"Quantization Techniques for Efficient Deployment of Large Language Models: A Comprehensive Review," a Presentation from AMD

Dwith Chenna, MTS Product Engineer for AI Inference at AMD, presents the “Quantization Techniques for Efficient Deployment of Large Language Models: A Comprehensive Review” tutorial at the May 2025 Embedded Vision Summit.

The deployment of large language models (LLMs) in resource-constrained environments is challenging due to the significant computational and memory demands of these models. To address this challenge, various quantization techniques have been proposed to reduce the model’s resource requirements while maintaining its accuracy. This talk provides a comprehensive review of post-training quantization (PTQ) methods, highlighting their trade-offs and applications in LLMs.

Chenna explains quantization techniques such as gradient post-training quantization (GPTQ), activation-aware weight quantization (AWQ) and SmoothQuant, and evaluates their performance on popular LLMs like the Open Pre-trained Transformer (OPT) language model series and Meta’s Llama-2 LLM. His results demonstrate that these techniques can significantly reduce these models’ size and computational requirements while maintaining their accuracy, making them suitable for deployment in edge environments.

See here for a PDF of the slides.

If you're building AI or vision-enabled products, you've come to the right place.

“Quantization Techniques for Efficient Deployment of Large Language Models: A Comprehensive Review,” a Presentation from AMD

Pages

Topics

Contact

Address

Phone