Source: LLM in a Flash — Alizadeh et al., Apple, arXiv 2312.11514
There’s a straightforward constraint on running large language models locally: the models are bigger than the RAM in most devices. A 70B parameter model in half-precision takes around 140GB of memory. A high-end consumer laptop has 32–64GB of RAM. Simple arithmetic says it can’t work. Apple’s LLM-in-a-Flash paper says otherwise.
The insight: flash memory (SSDs) in modern devices holds 0.5–2TB, with read speeds around 1–10 GB/s. Large language models are sparse: at any given inference step, only a small fraction of parameters are actually active. If you can predict which parameters you’ll need and prefetch only those from flash, you can run a much larger model from a much smaller DRAM footprint.
Two specific techniques make this work. “Windowing” reuses previously activated neurons across successive generation steps, dramatically reducing the volume of data transferred from flash. “Row-column bundling” reads data from flash in larger contiguous chunks that match the sequential access pattern of flash storage. Combined, these enable models up to 2× the available DRAM to run at up to 20× the naive loading speed on CPU.
The practical significance: this is one of the key technical enablers for private, on-device AI inference. The implications for data sovereignty, offline capability, and latency are significant — and they point toward a future where meaningful AI capability doesn’t require a cloud connection.