Apple's breakthrough — running large language models 20x faster from flash memory

Source: LLM in a Flash — Alizadeh et al., Apple, arXiv 2312.11514

I started this one expecting a technical improvement. The more interesting part is what the improvement reveals about where current AI systems are still awkward, expensive, or surprisingly fragile.

There’s a straightforward constraint on running large language models locally: the models are bigger than the RAM in most devices. A 70B parameter model in half-precision takes around 140GB of memory. A high-end consumer laptop has 32–64GB of RAM. Simple arithmetic says it can’t work. Apple’s LLM-in-a-Flash paper says otherwise. The insight: flash memory SSDs in modern devices holds 0.5–2TB, with read speeds around 1–10 GB/s. Large language models are sparse: at any given inference step, only a small fraction of parameters are actually active. If you can predict which parameters you’ll need and prefetch only those from flash, you can run a much larger model from a much smaller DRAM footprint. Two specific techniques make this work. “Windowing” reuses previously activated neurons across successive generation steps, dramatically reducing the volume of data transferred from flash. “Row-column bundling” reads data from flash in larger contiguous chunks that match the sequential access pattern of flash storage. Combined, these enable models up to 2× the available DRAM to run at up to 20× the naive loading speed on CPU. The practical significance: this is one of the key technical enablers for private, on-device AI inference.

In plain English, that is why the result matters beyond the chart. It changes where people should look, what they should question, and which comfortable assumption probably needs to be retired.

So the question is not whether the method sounds clever. Many things sound clever in AI. The question is whether it removes a real bottleneck once the system leaves the paper and meets the world.