Tag

#compression

7 articles

Bonsai 27B is a full open reasoning model that fits on an iPhone

PrismML has compressed a 27-billion-parameter AI model to under 4 GB, enabling it to run on an iPhone. Apple is reportedly testing the technology, which could advance on-device AI capabilities.

Jul 1521

tech

Meet Nemotron Labs 3 Puzzle 75B A9B: A Compressed Hybrid MoE LLM Delivering 2.03x Server Throughput

NVIDIA introduces Nemotron-Labs-3-Puzzle-75B-A9B, a compressed hybrid MoE LLM delivering 2.03x server throughput, leveraging hardware-aware compression and knowledge distillation.

Jul 935

NVIDIA Releases Nemotron-Labs-3-Puzzle-75B-A9B: A Compressed Hybrid MoE LLM Delivering 2.03x Server Throughput at Matched User Throughput

Learn how NVIDIA's new AI model Nemotron-Labs-3-Puzzle-75B-A9B uses compression and smart design to work faster and more efficiently than previous versions, without sacrificing quality.

Jul 840

Sina's open model VibeThinker-3B aims to show reasoning compresses well but factual knowledge doesn't

A new AI model called VibeThinker-3B shows that logical reasoning can be made small and efficient, while factual knowledge still requires larger models. This discovery could make AI more accessible and powerful.

Jun 2741

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

As KV cache memory outpaces model weights in large language models, three compression techniques—TurboQuant, OSCAR, and EpiCache—are emerging as key contenders. While each offers distinct methods for optimization, they are seen as complementary rather than competitive.

Jun 1856

A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor

Learn to compress instruction-tuned language models using FP8, GPTQ, and SmoothQuant quantization techniques with llmcompressor, and benchmark their performance.

May 1756

Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup, All with Zero Accuracy Loss

Google introduces TurboQuant, a new compression algorithm that reduces LLM key-value cache memory by 6x and delivers up to 8x speedup without accuracy loss.

Mar 2498