In On-Device LLMs: State of the Union, 2026, Vikas Chandra and Raghuraman Krishnamoorthi explain why running LLMs on phones has moved from novelty to practical engineering, and why the biggest breakthroughs came not from faster chips but from rethinking how models are built, trained, compressed, and deployed.
Why run LLMs locally?
Four reasons: latency (cloud round-trips add hundreds of milliseconds, breaking real-time experiences), privacy (data that never leaves the device can’t be breached), cost (shifting inference to user hardware saves serving costs at scale), and availability (local models work without connectivity). The trade-off is clear: frontier reasoning and long conversations still favor the cloud, but daily utility tasks like formatting, light Q&A, and summarization increasingly fit on-device.
Memory bandwidth is the real bottleneck
People over-index on TOPS. Mobile NPUs are powerful, but decode-time inference is memory-bandwidth bound: generating each token requires streaming the full model weights. Mobile devices have 50-90 GB/s bandwidth; data center GPUs have 2-3 TB/s. That 30-50x gap dominates real throughput.
This is why compression has an outsized impact. Going from 16-bit to 4-bit isn’t just 4x less storage; it’s 4x less memory traffic per token. Available RAM is also tighter than specs suggest (often under 4GB after OS overhead), limiting model size and architectural choices like mixture of experts (MoE).
Power matters too. Rapid battery drain or thermal throttling kills products. This pushes toward smaller, quantized models and bursty inference that finishes fast and returns to low power.
Small models have gotten better
Where 7B parameters once seemed minimum for coherent generation, sub-billion models now handle many practical tasks. The major labs have converged: Llama 3.2 (1B/3B), Gemma 3 (down to 270M), Phi-4 mini (3.8B), SmolLM2 (135M-1.7B), and Qwen2.5 (0.5B-1.5B) all target efficient on-device deployment. Below ~1B parameters, architecture matters more than size: deeper, thinner networks consistently outperform wide, shallow ones.
Training methodology and data quality drive capability at small scales. High-quality synthetic data, domain-targeted mixes, and distillation from larger teachers buy more than adding parameters. Reasoning isn’t purely a function of model size: distilled small models can outperform base models many times larger on math and reasoning benchmarks.
The practical toolkit
Quantization: Train in 16-bit, deploy at 4-bit. Post-training quantization (GPTQ, AWQ) preserves most quality with 4x memory reduction. The challenge is outlier activations; techniques like SmoothQuant and SpinQuant handle these by reshaping activation distributions before quantization. Going lower is possible: ParetoQ found that at 2 bits and below, models learn fundamentally different representations, not just compressed versions of higher-precision models.
KV cache management: For long context, KV cache can exceed model weights in memory. Compressing or selectively retaining cache entries often matters more than further weight quantization. Key approaches include preserving “attention sink” tokens, treating heads differently based on function, and compressing by semantic chunks.
Speculative decoding: A small draft model proposes multiple tokens; the target model verifies them in parallel. This breaks the one-token-at-a-time bottleneck, delivering 2-3x speedups. Diffusion-style parallel token refinement is an emerging alternative.
Pruning: Structured pruning (removing entire heads or layers) runs fast on standard mobile hardware. Unstructured pruning achieves higher sparsity but needs sparse matrix support.
Software stacks have matured
No more heroic custom builds. ExecuTorch handles mobile deployment with a 50KB footprint. llama.cpp covers CPU inference and prototyping. MLX optimizes for Apple Silicon. Pick based on your target; they all work.
Beyond text
The same techniques apply to vision-language and image generation models. Native multimodal architectures, which tokenize all modalities into a shared backbone, simplify deployment and let the same compression playbook work across modalities.
What’s next
MoE on edge remains hard: sparse activation helps compute but all experts still need loading, making memory movement the bottleneck. Test-time compute lets small models spend more inference budget on hard queries; Llama 3.2 1B with search strategies can outperform the 8B model. On-device personalization via local fine-tuning could deliver user-specific behavior without shipping private data off-device.
Bottom line
Phones didn’t become GPUs. The field learned to treat memory bandwidth, not compute, as the binding constraint, and to build smaller, smarter models designed for that reality from the start.
Read the full article here.

