Flash-MoE: Run a 397B AI Model on Your Laptop
Flash-MoE just hit the front page of Hacker News with an audacious claim: running a 397 billion parameter model on a MacBook Pro at 4.4 tokens per second. No cloud. No GPU cluster. Just a laptop and some clever engineering.
For indie hackers who've dreamed of running frontier-scale models locally, this is the breakthrough we've been waiting for.
The Numbers
Let that sink in: 397 billion parameters. That's larger than GPT-4's reported size. Running on consumer hardware. At usable speeds.
| Configuration | Speed | Quality | Notes |
|---|---|---|---|
| 4-bit experts, FMA kernel | 4.36 tok/s | Excellent | Full tool calling. 209GB on disk. |
| 4-bit experts, baseline | 3.90 tok/s | Excellent | Before FMA optimization. |
| 2-bit experts | 5.74 tok/s | Good | Breaks JSON/tool calling. |
The hardware? A MacBook Pro with M3 Max, 48GB RAM, and a 1TB SSD. No exotic setup. No enterprise hardware. The same machine many developers already own.
How It Works: SSD Expert Streaming
The secret sauce is SSD Expert Streaming. The model's 209GB of expert weights live on disk, not in RAM. Only the active experts — K=4 per token, about 6.75MB each — get loaded on demand.
The key insight: In Mixture-of-Experts (MoE) models, only a small fraction of parameters are activated per token. Flash-MoE exploits this by streaming experts from SSD instead of trying to fit everything in memory.
The approach is inspired by Apple's "LLM in a Flash" paper but taken further. Instead of complex custom caching, the authors discovered something surprising: trusting the OS page cache beats every custom solution.
"Trust the OS" Philosophy
Every custom caching approach they tried — Metal LRU, malloc cache, LZ4 compressed cache — was slower than simply letting the OS page cache handle it. The results:
- Metal LRU cache: Deleted. 38% speedup without it.
- LZ4 expert compression: -13% (decompression overhead killed it)
- OS page cache: 71% hit rate naturally, zero custom code
This is the kind of counter-intuitive discovery that only comes from building and experimenting. The OS developers spent decades optimizing page cache. Use it.
Pure C/Metal: No Python Overhead
Flash-MoE is written entirely in C and Objective-C with hand-tuned Metal shaders. No Python. No PyTorch. No frameworks.
Why does this matter? Every abstraction layer costs performance. By stripping everything down to bare metal, they achieved:
- FMA-optimized dequant kernel: 12% faster by rearranging math to use GPU fused multiply-add
- Deferred GPU execution: CPU prepares next layer while GPU computes current one
- Accelerate BLAS for linear attention: 64% faster delta-net updates
- C BPE tokenizer: 20x faster startup (180ms vs 3500ms)
The codebase is ~7,000 lines of C and ~1,200 lines of Metal shaders. Readable. Hackable. No dependency hell.
The Model: Qwen3.5-397B-A17B
Flash-MoE runs Qwen3.5-397B-A17B — a Mixture-of-Experts model with a unique architecture:
- 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention
- 512 experts per layer: K=4 activated per token plus one shared expert
- Hidden dimension: 4096
- Full tool calling support: Production-ready JSON output
This isn't a toy model. It's a frontier-scale model with tool calling capabilities, running locally on a laptop.
What This Means for Indie Hackers
The implications are massive for indie builders:
No more API bills for experimentation. Run a 397B model locally. Test prompts, build features, prototype agents — all without watching your credit card balance.
Privacy-first AI products. Sensitive data never leaves your machine. Build for healthcare, legal, finance without compliance nightmares.
Offline AI agents. No internet required. Run AI in air-gapped environments, on planes, in remote locations.
Getting Started
Flash-MoE is open source on GitHub. Build it with:
For interactive chat with tool calling:
The 4-bit quantized experts (209GB) are required for production use. The 2-bit version is faster but breaks JSON output — fine for chat, not for tool use.
The Bottom Line
Flash-MoE proves something many thought impossible: frontier-scale models can run on consumer hardware. Not in 5 years. Not with next-gen chips. Today.
The combination of MoE sparsity, SSD streaming, and bare-metal engineering makes it work. For indie hackers building AI products, this is a game-changer. No more choosing between API costs and model quality.
Check it out at github.com/danveloper/flash-moe. The future of local AI just got a lot more interesting.
I write about AI tools and automation for indie hackers. Flash-MoE is exactly the kind of breakthrough that levels the playing field — big models without big budgets.
Build Smarter With AI
Check out my guides and products for indie hackers building with AI.