Embedl, a Swedish deep-tech pioneer in AI model optimization, announced today FlashHead, an optimization method that makes the most popular language models, including Llama-3.2 (Meta), Gemma-3 (Google DeepMind), and Qwen-3 (Alibaba), the fastest models for on-device inference.
The technology, “FlashHead: Efficient Drop-in Replacement for the Classification Head in Language Model Inference,” reduces latency by up to 43% while preserving full model accuracy.
“FlashHead eliminates a major AI deployment bottleneck,” says Hans Salomonsson, CEO of Embedl. “ This means world class SLM models now run at lightning speed on everyday devices; fast, compact, and sustainable.”
Today’s classification head predicts the next token by assigning a probability to all possible tokens, but it is resource-intensive. State-of-the-art models include hundreds of thousands of possible tokens in their vocabularies, causing the head to become a severe bottleneck for inference. FlashHead reformulates a language model through the lens of information retrieval, making it faster and less computationally demanding. Notably, FlashHead delivers these efficiency gains while leaving the model’s output virtually unchanged.
FlashHead achieves this through several innovations, including equal-sized clustering for fast memory access, multi-probe retrieval, probabilistic sampling for increased speed, and selective quantization.
These optimizations make the classification head a minor cost, enabling fast inference even on low-power devices.
FlashHead has been tested on several of the world’s most widely used open models:
These speedups are in relation to state-of-the-art optimization (w4A16 quantization). When combined with mixed precision optimization, the Embedl optimized Llama 3.2-1B for RTX 3500 Ada Generation reaches almost the same latency (2.06 ms) as the original model on the much more powerful H200 GPU (1.95 ms),, enabling true on-device AI.
Starting December 8, 2025, developers and researchers can access and use optimized FlashHead models for Llama-3.2, Gemma-3, and Qwen-3 on Hugging Face. Visit Hugging Face to try FlashHead and experience its efficiency improvements firsthand.
“This milestone makes state-of-the-art models run faster, cheaper, and locally for everyone, no cloud required,” says Hans Salomonsson.
Embedl AB is a Swedish company specializing in AI optimization. Embedl delivers solutions to make AI models faster, smaller, and more energy-efficient, powering high-performance AI on any Edge device.
Media Contact:
Press Embedl AB
frida@embedl.com| www.embedl.com
Gijima Media is excited to announce the relaunch of its official website, https://gijimamedia.com. The refreshed site…
Hayward, CA – Toycycle, the curated marketplace for pre-loved and surplus toys, today announced the…
Leading tourism professionals and award-winning content creators reveal why DMOs must abandon six-month campaign cycles…
The HR Fuse community app provides HR leaders with continuous support and authentic connection through…
When California passed SB 54, the state sent a clear message to industry: plastic packaging…
Brooklyn, N.Y. — Microblink enters 2026 with strong business momentum, fueled by growing adoption from…
This website uses cookies.