News

Google releases KV cache compression technology, storage demand is expected to be hit, and the US stock storage sector collectively falls

By Li Jia, Bao Yilong

Source: Wallstreetcn

Google has launched **TurboQuant**, a memory compression technology that can compress the key-value cache of large language models down to 3 bits, achieving a **6x reduction in memory usage** and up to **8x acceleration**, sparking market concerns over storage demand. However, Morgan Stanley pointed out that the technology only applies to the **inference stage** and may instead unlock more AI application scenarios. Memory stocks tumbled during the session but narrowed their losses by the close: Western Digital and Micron both fell 3.4%, Seagate Technology closed down 2.6%, and Western Digital’s decline shrank to 1.6%.

U.S. memory chip stocks sold off sharply during Wednesday’s session. Western Digital dropped as much as 6.5%, Micron Technology 4%, Western Digital over 4%, and Seagate Technology over 5%.

Google’s new AI memory compression technology, TurboQuant, has raised worries about the outlook for storage demand. The technology is said to cut cache memory usage for large language models by at least 6x without sacrificing accuracy, while delivering up to 8x faster performance, aimed at resolving memory bottlenecks in AI inference and vector search.

At Wednesday’s close, the U.S. Memory Chip & Hardware Supply Chain Index fell 2.08% to 113.03 points, having dipped to an intraday low of 109 points. Western Digital and Micron led the decline, both falling more than 3.4%. Seagate Technology closed down 2.6%, and Western Digital’s loss narrowed to 1.6%.

## Google TurboQuant Shakes Up Storage Demand

Google’s TurboQuant is a memory compression technology designed specifically for large language models and vector search engines, with the core goal of solving the storage bottleneck of key-value (KV) caches in AI systems.

According to Google’s announcement, TurboQuant can compress KV cache to 3 bits without requiring model training or fine-tuning. Tests on open-source models such as Gemma and Mistral show a **6x reduction in KV memory usage**. On NVIDIA H100 GPU accelerators, the algorithm delivers up to **8x performance improvement** compared to unquantized KV implementations.

The technology achieves compression in two steps: first, it uses the **PolarQuant** method to rotate data vectors for high-quality compression, then applies the quantized Johnson-Lindenstrauss algorithm to eliminate residual errors. Google noted that traditional vector quantization methods create an extra 1–2 bits of memory overhead per value, partially offsetting compression benefits — an issue TurboQuant improves upon.

TurboQuant is set to be presented at **ICLR 2026**, and PolarQuant is planned for **AISTATS 2026**. Google has validated the technology across benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, and stated it is also suitable for vector retrieval in large-scale search engines.

---

## Jevons Paradox Revisited? TurboQuant May Unlock More AI Use Cases

Morgan Stanley noted that Google’s TurboQuant only affects key-value caches during inference, and does **not** impact high-bandwidth memory (HBM) used for model weights, nor does it relate to training workloads.

Therefore, this does not mean total storage or hardware demand will drop by 6x. Instead, efficiency gains will increase per-GPU throughput: the same hardware can support 4–8x longer context lengths, or significantly larger batch sizes without triggering out-of-memory errors.

Still, the memory sector has seen substantial cumulative gains year-to-date, with valuations already under pressure. Any technical development that could reduce hardware demand has been enough to trigger a defensive market reaction. Morgan Stanley also warned that, as the compression technology can be directly integrated into platform infrastructure, it may create marginal downside pressure on the software layer.

In its analysis, Morgan Stanley cited the **Jevons Paradox**, arguing that improved efficiency could actually boost overall demand. The logic is: by compressing data volume and transmission, TurboQuant drastically lowers the service cost per query, making AI deployment more profitable.

This means models previously reliant on cloud clusters can run on local hardware, effectively lowering barriers to large-scale AI deployment. This would unlock more application scenarios and drive higher utilization of existing infrastructure.

Morgan Stanley called TurboQuant a **breakthrough reshaping the cost curve of AI deployment**, comparing its impact to that of DeepSeek. It represents a positive signal for cloud providers and model platforms, delivering strong return on investment in long-context inference and retrieval-intensive applications. The long-term impact on computing and memory hardware is judged as **neutral to positive**.

---

## Risk Warning and Disclaimer

The market is subject to risks; investing involves risks. This article does not constitute personal investment advice, nor does it take into account the specific investment objectives, financial situations, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular circumstances. Any investment made based on this article is at your own risk.

PREVIOUS：Iran’s foreign minister said there was “no ne NEXT：Google releases KV cache compression technolo