A comprehensive framework for efficient, scalable, and performance-portable tensor applications

Tensor-centric computations are the compute-intensive core of large-scale parallel applications in scientific computing and machine learning. The quest for sustained performance increases for such computations will depend on enhanced hardware efficiency via customization and reduced data movement. This requires novel advances in algorithm-architecture co-design methodology. Further, transition to customized hardware presents crucial challenges for application developer productivity and performance-portability.

This project brings together a team with complementary expertise with a focused plan to address the above challenges. We aim to achieve significant advancement in performance-portability of tensor applications, as well as significant advancement in algorithm-architecture co-design methodology and tools for such computations. The proposed project spans the full application-to-architecture software/hardware stack, along with consideration of the cross-cutting concern of accuracy/correctness. Our team focuses in reconfigurable architectures, to design and evaluate core functional hardware elements required to implement the experimental architectures developed in this project, for both FPGA and ASIC targets.

This research effort is funded by NSF PPOSS-PP of Scalable Systems award number #22-507.