March 31, 2026

Ollama MLX: Run Local AI 3x Faster on Apple Silicon

Ollama just shipped version 0.19, and it's a big deal for anyone running local LLMs on a Mac. The new release swaps in Apple's MLX framework as the backend on Apple Silicon — and the numbers are wild.

On an M5 Max, Ollama can now prefill at 1,851 tokens per second and decode at 134 tokens per second with the Qwen3.5-35B model. That's not cloud inference. That's your laptop.

What Changed Under the Hood

Previous versions of Ollama relied on GGML (via llama.cpp) for inference. That worked well, but Apple's MLX framework takes better advantage of the unified memory architecture that makes Apple Silicon special. Instead of copying data between CPU and GPU memory, MLX keeps everything in one shared pool.

The result is dramatically lower latency and higher throughput — especially on M5 chips, which add dedicated GPU Neural Accelerators that MLX can tap into directly.

Here's what Ollama 0.19 brings to the table:

Why This Matters for Builders

If you're running Claude Code, OpenCode, Codex, or any coding agent that uses Ollama as its local backend, this update makes everything snappier. The improved cache means repeated tool calls and branching conversations don't waste time reprocessing the same system prompt.

For indie hackers and solo builders, this is about cost and privacy:

Apple even highlighted Ollama alongside OpenClaw in the announcement. The local AI stack is maturing fast — your Mac is becoming a serious inference machine, not just a dev terminal.

The Performance Numbers

Ollama tested with Qwen3.5-35B-A3B quantized to NVFP4 on Apple's M5 lineup. The improvements over the previous GGML-based backend are significant:

You'll need at least 32GB of unified memory to run the 35B model comfortably. If you've got an M5 Max with 64GB or more, you're in an excellent spot.

Getting Started

Download Ollama 0.19 and pull the coding-optimized model:

ollama run qwen3.5:35b-a3b-coding-nvfp4

To use it with Claude Code:

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

Or with OpenClaw:

ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4

The Bigger Picture

Local AI is no longer a toy. Between Ollama's MLX backend, Apple's Neural Accelerators on M5, and open-weight models like Qwen3.5 getting better every month, the gap between local and cloud inference is closing fast.

For builders who care about privacy, cost control, and offline capability, this is the update you've been waiting for. Your Mac just became a much more capable AI workstation — and it didn't cost you a cent.


Running Local AI on Your Mac?

I write about AI tools, automation, and building smarter every week. Check out more posts on local models, coding agents, and the builder's toolkit.

Read More Posts