Minisymposium Presentation
Resource-Efficient AI System Design
Presenter
Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University.
Description
Today’s large-scale AI model training and serving jobs require many hardware accelerators to run, making these jobs extremely costly and power-hungry. Yet despite requiring many GPUs to run, AI jobs often underutilize individual GPUs for a variety of reasons, including data preprocessing stalls, communication stalls, low batching opportunities, and imbalanced memory and compute usage of individual operators within a job. This inefficient use of hardware accelerators further increases costs. In this talk, we will discuss why optimizing hardware accelerator (e.g., GPU) utilization is key to improving the cost and energy efficiency of AI workloads and how we can achieve this. I will present several computer systems that we are building as part of the Swiss AI initiative to optimize GPU cluster configurations and job parallelization strategies for distributed AI training jobs and efficiently share GPUs while maximizing performance.