Ollama is now powered by MLX on Apple Silicon in preview
TL;DR Highlight
Ollama has switched its inference backend on Apple Silicon from llama.cpp to Apple's MLX framework, delivering up to nearly 2x faster inference speeds. On M5 chips, it also leverages the GPU Neural Accelerator, bringing meaningful performance gains to coding agent workflows.
Who Should Read
Developers running coding agents like Claude Code or Codex, or local LLMs, on Mac (Apple Silicon). Especially MacBook/Mac Studio users with 32GB or more of unified memory.
Core Mechanics
- Starting with Ollama 0.19, the inference backend on macOS has switched from llama.cpp (GGUF format) to MLX, a machine learning framework created by Apple. MLX is optimized for Apple Silicon's unified memory architecture, actively leveraging the structure where CPU, GPU, and Neural Engine share memory.
- On M5 Max, the Prefill speed (the rate at which the prompt is processed before the first token is generated) improved by approximately 57%, from 1154 tokens/s in version 0.18 to 1810 tokens/s in version 0.19. Decode speed (the rate at which tokens are generated) improved by approximately 93%, from 58 tokens/s to 112 tokens/s. The test model was Alibaba's Qwen3.5-35B-A3B.
- On M5, M5 Pro, and M5 Max chips, the newly added GPU Neural Accelerator further boosts both TTFT (Time To First Token, the latency before the first response) and token generation speed.
- This update introduces support for NVFP4 (a 4-bit floating-point quantization format developed by NVIDIA). It reduces memory usage and storage while better preserving model accuracy compared to the existing Q4_K_M format. Since this format is primarily used in cloud production environments, it enables direct comparison between local and production results.
- The caching system has been significantly improved. Cache can now be reused across multiple conversations, reducing memory usage and increasing cache hit rates for tools like Claude Code that share system prompts. Additionally, an 'intelligent checkpoint' feature that saves snapshots at appropriate positions within prompts and a smarter eviction policy that retains shared prefixes longer even after old branches are deleted have been added.
- The officially recommended model in this preview release is the Qwen3.5-35B-A3B NVFP4 version, tuned for coding tasks. It requires 32GB or more of unified memory and is primarily targeted at use cases integrating with coding agents like Claude Code or OpenClaw.
- Future updates were teased, including easier ways to import custom fine-tuned models into Ollama and expansion of supported architectures. Currently, only specific architectures take the MLX path.
Evidence
- "A user shared benchmark results from an M4 Pro + 48GB RAM environment. Comparing the same Qwen3.5-35B-A3B model across formats, NVFP4 (PromptEval 13.2 t/s, Decode 66.5 t/s) was about 2x faster than Q4_K_M (6.6 t/s, 30.0 t/s), while int4 (59.4 t/s, 84.4 t/s) was the fastest overall. However, the user noted they did not verify quality differences. There were observations that LM Studio has supported MLX for a long time, and some users shared experiences where GGUF format consistently produced better benchmark results — not by a large margin, but a difference exists. Some viewed Ollama's MLX adoption as belated. A user running Qwen 70B 4-bit with llama.cpp on an M2 Max 96GB expressed hope that the MLX transition would improve memory handling, while also being curious about actual performance comparisons with GGUF paths and large models — this comment also reconfirmed that Ollama had been using llama.cpp internally. An M4 Max 48GB RAM user reported that even simple queries like 'Hello world' took 6–25 seconds, which appears to be due to the model going through a 'thinking' process rather than the inference itself. Given the nature of coding-focused models, thinking may be enabled by default, so checking the configuration was advised. A majority expressed optimism about the future of on-device LLMs, citing reasons such as privacy, elimination of external API dependencies, distributed data center demand, and power savings. However, some comments pointed out the practical limitation that running coding agents comfortably on a 16GB RAM Mac is still difficult."
How to Apply
- "If you want to use Claude Code integrated with a local LLM on an M1/M2/M3/M4 Mac (32GB or more), you can get started right away by upgrading to Ollama 0.19 and running the command `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`. Compared to version 0.18, the Decode speed is approximately 2x faster, noticeably reducing wait times in agent loops. If you're running RAG or agent workflows on a local Mac that repeatedly process long system prompts or large contexts (50k+ tokens), the new intelligent cache checkpoint feature reduces the cost of repeatedly processing the same prefix — simply upgrading is enough to see the benefit. If you're on a newer chip like M5 Max, prioritize using NVFP4 quantized models. Quality is more stable than int4, and since it's the same format used in cloud production (NVIDIA GPU servers), local test results can be applied directly to production. If you're already using llama.cpp or LM Studio with GGUF models and are satisfied, note that community benchmarks suggest minor quality differences exist, so rather than switching immediately, it's better to compare quality directly with the same model before deciding."
Code Example
snippet
# Integrate with Claude Code
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
# Integrate with OpenClaw
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
# Chat directly
ollama run qwen3.5:35b-a3b-coding-nvfp4
# Measure performance (using --verbose flag)
ollama run qwen3.5:35b-a3b-nvfp4 "calculate fibonacci numbers in a one-line bash script" --verbose
# Performance comparison by format (based on M4 Pro 48GB)
# Model PromptEvalRate EvalRate
# qwen3.5:35b-a3b-q4_K_M 6.6 t/s 30.0 t/s
# qwen3.5:35b-a3b-nvfp4 13.2 t/s 66.5 t/s
# qwen3.5:35b-a3b-int4 59.4 t/s 84.4 t/sTerminology
MLXA machine learning framework created by Apple exclusively for Apple Silicon. Designed to maximize the use of the unified memory architecture where CPU, GPU, and Neural Engine share memory.
NVFP4A 4-bit floating-point quantization format developed by NVIDIA. It offers less accuracy loss than INT4 while saving memory, making it widely used for cloud production inference.
TTFTStands for Time To First Token. The time elapsed from when a prompt is entered until the first output token is produced. The longer it is, the more the user perceives the initial response as slow.
PrefillThe stage in which an LLM processes the entire input prompt at once before generating a response. The longer the context, the more time this takes.
KV CacheStands for Key-Value Cache. A cache that stores and reuses attention values previously computed by the LLM. When functioning well, it avoids recomputing from scratch when repeating the same conversation.
GGUFThe model file format used by llama.cpp. Supports various quantization options (such as Q4_K_M) and works well for CPU inference. MLX uses its own format instead of this one.