Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
This blog post was originally published at Nota AI’s website. It is reprinted here with the permission of Nota AI. Our method, Trimmed-Llama, reduces the key-value cache (KV cache) and latency of cross-attention-based Large Vision Language Models (LVLMs) without sacrificing performance. We identify sparsity in LVLM cross-attention maps, showing a consistent layer-wise pattern where most […]
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features Read More +