Ollama is now powered by MLX on Apple Silicon in preview
TL;DR Highlight
Ollama has switched its inference backend on Apple Silicon from llama.cpp to Apple's MLX framework, delivering up to nearly 2x faster inference speeds. On M5 chips, it also leverages the GPU Neural Accelerator, bringing meaningful performance gains to coding agent workflows.
Who Should Read
Developers running coding agents like Claude Code or Codex, or local LLMs, on Mac (Apple Silicon). Especially MacBook/Mac Studio users with 32GB or more of unified memory.
Core Mechanics
- Starting with Ollama 0.19, the inference backend on macOS has switched from llama.cpp (GGUF format) to MLX, a machine learning framework created by Apple. MLX is optimized for Apple Silicon's unified memory architecture, actively leveraging the structure where CPU, GPU, and Neural Engine share memory.
- On M5 Max, the Prefill speed (the rate at which the prompt is processed before the first token is generated) improved by approximately 57%, from 1154 tokens/s in version 0.18 to 1810 tokens/s in version 0.19. Decode speed (the rate at which tokens are generated) improved by approximately 93%, from 58 tokens/s to 112 tokens/s. The test model was Alibaba's Qwen3.5-35B-A3B.
- On M5, M5 Pro, and M5 Max chips, the newly added GPU Neural Accelerator further boosts both TTFT (Time To First Token, the latency before the first response) and token generation speed.
- This update introduces support for NVFP4 (a 4-bit floating-point quantization format developed by NVIDIA). It reduces memory usage and storage while better preserving model accuracy compared to the existing Q4_K_M format. Since this format is primarily used in cloud production environments, it enables direct comparison between local and production results.
- The caching system has been significantly improved. Cache can now be reused across multiple conversations, reducing memory usage and increasing cache hit rates for tools like Claude Code that share system prompts. Additionally, an 'intelligent checkpoint' feature that saves snapshots at appropriate positions within prompts and a smarter eviction policy that retains shared prefixes longer even after old branches are deleted have been added.
- The officially recommended model in this preview release is the Qwen3.5-35B-A3B NVFP4 version, tuned for coding tasks. It requires 32GB or more of unified memory and is primarily targeted at use cases integrating with coding agents like Claude Code or OpenClaw.
- Future updates were teased, including easier ways to import custom fine-tuned models into Ollama and expansion of supported architectures. Currently, only specific architectures take the MLX path.
Evidence
- "A user shared benchmark results from an M4 Pro + 48GB RAM environment. Comparing the same Qwen3.5-35B-A3B model across formats, NVFP4 (PromptEval 13.2 t/s, Decode 66.5 t/s) was about 2x faster than Q4_K_M (6.6 t/s, 30.0 t/s), while int4 (59.4 t/s, 84.4 t/s) was the fastest overall. However, the user noted they did not verify quality differences. There were observations that LM Studio has supported MLX for a long time, and some users shared experiences where GGUF format consistently produced better benchmark results — not by a large margin, but a difference exists. Some viewed Ollama's MLX adoption as belated. A user running Qwen 70B 4-bit with llama.cpp on an M2 Max 96GB expressed hope that the MLX transition would improve memory handling, while also being curious about actual performance comparisons with GGUF paths and large models — this comment also reconfirmed that Ollama had been using llama.cpp internally. An M4 Max 48GB RAM user reported that even simple queries like 'Hello world' took 6–25 seconds, which appears to be due to the model going through a 'thinking' process rather than the inference itself. Given the nature of coding-focused models, thinking may be enabled by default, so checking the configuration was advised. A majority expressed optimism about the future of on-device LLMs, citing reasons such as privacy, elimination of external API dependencies, distributed data center demand, and power savings. However, some comments pointed out the practical limitation that running coding agents comfortably on a 16GB RAM Mac is still difficult."
How to Apply
- "If you want to use Claude Code integrated with a local LLM on an M1/M2/M3/M4 Mac (32GB or more), you can get started right away by upgrading to Ollama 0.19 and running the command `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`. Compared to version 0.18, the Decode speed is approximately 2x faster, noticeably reducing wait times in agent loops. If you're running RAG or agent workflows on a local Mac that repeatedly process long system prompts or large contexts (50k+ tokens), the new intelligent cache checkpoint feature reduces the cost of repeatedly processing the same prefix — simply upgrading is enough to see the benefit. If you're on a newer chip like M5 Max, prioritize using NVFP4 quantized models. Quality is more stable than int4, and since it's the same format used in cloud production (NVIDIA GPU servers), local test results can be applied directly to production. If you're already using llama.cpp or LM Studio with GGUF models and are satisfied, note that community benchmarks suggest minor quality differences exist, so rather than switching immediately, it's better to compare quality directly with the same model before deciding."
Code Example
# Integrate with Claude Code
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
# Integrate with OpenClaw
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
# Chat directly
ollama run qwen3.5:35b-a3b-coding-nvfp4
# Measure performance (using --verbose flag)
ollama run qwen3.5:35b-a3b-nvfp4 "calculate fibonacci numbers in a one-line bash script" --verbose
# Performance comparison by format (based on M4 Pro 48GB)
# Model PromptEvalRate EvalRate
# qwen3.5:35b-a3b-q4_K_M 6.6 t/s 30.0 t/s
# qwen3.5:35b-a3b-nvfp4 13.2 t/s 66.5 t/s
# qwen3.5:35b-a3b-int4 59.4 t/s 84.4 t/sTerminology
Related Papers
Show HN: Bash4LLM+ – A lightweight, dependency-free Bash wrapper for LLM APIs
Python이나 Node.js 없이 순수 Bash만으로 Groq 등 OpenAI 호환 LLM API를 호출할 수 있는 단일 스크립트 도구로, Termux(Android)를 포함한 모든 Unix 환경에서 동작한다.
Wayfinder Router: deterministic routing of queries between local and hosted LLM
프롬프트의 복잡도를 모델 호출 없이 오프라인으로 점수화해서 간단한 쿼리는 로컬 모델로, 어려운 쿼리는 유료 모델로 자동 라우팅하는 CLI 도구다. LLM 비용을 줄이면서도 응답 품질을 유지하고 싶은 개발자에게 유용하다.
Apple Neural Engine: Architecture, Programming, and Performance
Apple 기기에 내장된 AI 전용 칩인 ANE(Apple Neural Engine)를 리버스 엔지니어링으로 분석한 302페이지짜리 기술 문서로, Core ML 아래 숨겨진 내부 구조와 직접 접근 경로를 처음으로 공개한다.
DSpark: Speculative decoding accelerates LLM inference [pdf]
DeepSeek이 Speculative Decoding을 개선한 DSpark 기법을 공개했는데, 같은 시스템 용량 기준으로 사용자당 생성 속도가 57~78% 빨라졌다고 한다. 이게 DeepSeek이 경쟁사 대비 훨씬 싼 가격으로 Pro 모델을 제공할 수 있는 핵심 기술 중 하나일 가능성이 높다.
Show HN: Smart model routing directly in Claude, Codex and Cursor
프롬프트마다 적합한 AI 모델을 50ms 이내에 자동으로 선택해주는 프록시 라우터로, API 비용을 40~70% 절감할 수 있다고 주장하는 오픈소스 도구다. 단, 프롬프트 캐싱 손실 문제로 커뮤니티 반응은 엇갈린다.
Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
단일 파일을 통째로 암기하도록 Transformer를 과적합(overfitting)시킨 뒤 arithmetic coding으로 압축하는 실험으로, 100MB CSV를 7MB(~0.5 bits/byte)까지 줄이는 데 성공했다. 모델이 '범용 이해' 대신 '특정 파일 완전 암기'를 목표로 한다는 점에서 전통적 ML 학습과 정반대 방향이라 흥미롭다.