Overcoming Challenges in Serving Large Language Models

Thursday, 12 October, 2023 - 11:0011:40

Theofilos Papapanagiotou, Amazon


Discover the secrets of hosting GPT-type models in a Kubernetes cluster with multi-GPU nodes. As the demand for custom GPT models grows, SREs are increasingly tasked with providing these capabilities in their organizations. We'll dive into the complexities of serving such models, including their large size and the need for GPU sharding and tensor parallelism. Learn about model file formats, model quantization techniques, and leveraging open-source tools like Huggingface Accelerate. Gain insights into the trade-offs between serving latency, prediction accuracy, and distributed serving, and explore best practices for optimizing resource allocation. Don't miss our live demo showcasing the performance and trade-offs of a GPT-based model. Empower yourself with practical knowledge to meet the demands of hosting language models effectively.

Theofilos Papapanagiotou, Amazon

Theofilos is an accomplished ML architect and an expert in serving large language models with a focus on scalability and performance optimization. With a strong background in ML infrastructure and MLOps principles, he brings a wealth of experience to the table. As a maintainer of the KServe project and contributor to Kubeflow, Theofilos is actively involved in advancing the field of model serving. His deep understanding of Kubernetes, GPU optimization, and open-source tools allows him to navigate the challenges of hosting custom GPT-based models with ease. Attend his talk to gain valuable insights, best practices, and practical knowledge that will empower you to scale and optimize your language models effectively.

@conference {292147,
author = {Theofilos Papapanagiotou},
title = {Overcoming Challenges in Serving Large Language Models},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct

Presentation Video