March 22, 2026

Flash-MoE: Run a 397B AI Model on Your Laptop

Flash-MoE just hit the front page of Hacker News with an audacious claim: running a 397 billion parameter model on a MacBook Pro at 4.4 tokens per second. No cloud. No GPU cluster. Just a laptop and some clever engineering.

For indie hackers who've dreamed of running frontier-scale models locally, this is the breakthrough we've been waiting for.

The Numbers

Let that sink in: 397 billion parameters. That's larger than GPT-4's reported size. Running on consumer hardware. At usable speeds.

Configuration	Speed	Quality	Notes
4-bit experts, FMA kernel	4.36 tok/s	Excellent	Full tool calling. 209GB on disk.
4-bit experts, baseline	3.90 tok/s	Excellent	Before FMA optimization.
2-bit experts	5.74 tok/s	Good	Breaks JSON/tool calling.

The hardware? A MacBook Pro with M3 Max, 48GB RAM, and a 1TB SSD. No exotic setup. No enterprise hardware. The same machine many developers already own.

How It Works: SSD Expert Streaming

The secret sauce is SSD Expert Streaming. The model's 209GB of expert weights live on disk, not in RAM. Only the active experts — K=4 per token, about 6.75MB each — get loaded on demand.

The key insight: In Mixture-of-Experts (MoE) models, only a small fraction of parameters are activated per token. Flash-MoE exploits this by streaming experts from SSD instead of trying to fit everything in memory.

The approach is inspired by Apple's "LLM in a Flash" paper but taken further. Instead of complex custom caching, the authors discovered something surprising: trusting the OS page cache beats every custom solution.

"Trust the OS" Philosophy

Every custom caching approach they tried — Metal LRU, malloc cache, LZ4 compressed cache — was slower than simply letting the OS page cache handle it. The results:

Metal LRU cache: Deleted. 38% speedup without it.
LZ4 expert compression: -13% (decompression overhead killed it)
OS page cache: 71% hit rate naturally, zero custom code

This is the kind of counter-intuitive discovery that only comes from building and experimenting. The OS developers spent decades optimizing page cache. Use it.

Pure C/Metal: No Python Overhead

Flash-MoE is written entirely in C and Objective-C with hand-tuned Metal shaders. No Python. No PyTorch. No frameworks.

Why does this matter? Every abstraction layer costs performance. By stripping everything down to bare metal, they achieved:

FMA-optimized dequant kernel: 12% faster by rearranging math to use GPU fused multiply-add
Deferred GPU execution: CPU prepares next layer while GPU computes current one
Accelerate BLAS for linear attention: 64% faster delta-net updates
C BPE tokenizer: 20x faster startup (180ms vs 3500ms)

The codebase is ~7,000 lines of C and ~1,200 lines of Metal shaders. Readable. Hackable. No dependency hell.

The Model: Qwen3.5-397B-A17B

Flash-MoE runs Qwen3.5-397B-A17B — a Mixture-of-Experts model with a unique architecture:

60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention
512 experts per layer: K=4 activated per token plus one shared expert
Hidden dimension: 4096
Full tool calling support: Production-ready JSON output

This isn't a toy model. It's a frontier-scale model with tool calling capabilities, running locally on a laptop.

What This Means for Indie Hackers

The implications are massive for indie builders:

No more API bills for experimentation. Run a 397B model locally. Test prompts, build features, prototype agents — all without watching your credit card balance.

Privacy-first AI products. Sensitive data never leaves your machine. Build for healthcare, legal, finance without compliance nightmares.

Offline AI agents. No internet required. Run AI in air-gapped environments, on planes, in remote locations.

Getting Started

Flash-MoE is open source on GitHub. Build it with:

cd metal_infer
make
./infer --prompt "Explain quantum computing" --tokens 100

For interactive chat with tool calling:

./chat

The 4-bit quantized experts (209GB) are required for production use. The 2-bit version is faster but breaks JSON output — fine for chat, not for tool use.

The Bottom Line

Flash-MoE proves something many thought impossible: frontier-scale models can run on consumer hardware. Not in 5 years. Not with next-gen chips. Today.

The combination of MoE sparsity, SSD streaming, and bare-metal engineering makes it work. For indie hackers building AI products, this is a game-changer. No more choosing between API costs and model quality.

Check it out at github.com/danveloper/flash-moe. The future of local AI just got a lot more interesting.

I write about AI tools and automation for indie hackers. Flash-MoE is exactly the kind of breakthrough that levels the playing field — big models without big budgets.

Build Smarter With AI

Check out my guides and products for indie hackers building with AI.