Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code
TL;DR Highlight
This article explains how to run the Google Gemma 4 26B-A4B model locally on macOS using LM Studio 0.4.0's lms CLI and integrate it with Claude Code. Thanks to the MoE architecture, it can run at 51 tok/s on a 48GB MacBook Pro, enabling coding tasks without API costs.
Who Should Read
Developers who want to adopt local models instead of cloud AI due to API costs or data privacy concerns. Specifically, developers who have an Apple Silicon Mac with 48GB or more of memory and are using AI coding tools like Claude Code.
Core Mechanics
- Cloud AI APIs have issues with rate limits, costs, privacy, and network latency, which local models can replace with the advantages of zero API costs, no data leakage, and stable availability.
- Google Gemma 4 is not a single model but consists of four model families: E2B, E4B (optimized for on-device use), 26B-A4B (MoE), and 31B (Dense). E2B and E4B support audio input, and the 31B Dense model achieves the highest benchmark scores of 85.2% on MMLU Pro and 89.2% on AIME 2026.
- The 26B-A4B model uses the MoE (Mixture of Experts, an architecture that selectively activates only some of the total parameters) approach, having 128 + 1 shared experts, but only activates 8 experts (3.8B parameters) per token. This results in inference costs at the level of a 4B dense model but with much higher quality.
- The effective performance of 26B-A4B is estimated to be around the level of a 10B dense model (sqrt(26B × 4B) ≈ 10B), scoring 82.6% on MMLU Pro and 88.3% on AIME 2026, approaching the performance of 31B Dense (85.2%, 89.2%). Based on Elo scores, it is also comparable to Qwen 3.5 397B-A17B or Kimi-K2.5, which require 400B~1000B parameters (~1441).
- On a 14-inch MacBook Pro M4 Pro (48GB unified memory), Gemma 4 26B-A4B operates at 51 tok/s and supports a 256K context window, vision input, native function/tool calling, and configurable thinking modes.
- LM Studio 0.4.0 introduces llmster (a standalone inference engine separated from the desktop app) and the lms CLI, enabling model download, loading, and serving solely through the terminal without a GUI. It can also be used in headless servers, CI/CD pipelines, and SSH sessions.
- Key new features of LM Studio 0.4.0 include the llmster daemon (background service), lms CLI, parallel request processing (simultaneous requests in consecutive batches), stateful REST API (/v1/chat endpoint, maintaining conversation history), and MCP integration.
Evidence
- "A commenter shared a setup for connecting Gemma 4 26B-A4B to Claude Code using a llama.cpp server on a M1 Max 64GB MacBook. They pointed out that Gemma 4 26B-A4B is about twice as fast at token generation than Qwen3.5 35B-A3B (40 tok/s), but significantly lags behind in the tau2 benchmark (agent task capability measurement) with 68% vs 81%. Therefore, it may not be suitable for heavy agentic tasks that require many tool calls."
How to Apply
- If you want to reduce API costs or avoid sending code/data to external servers, install LM Studio 0.4.0 or later, download and load Gemma 4 26B-A4B with the `lms` CLI, and serve it as an OpenAI-compatible API to replace Claude Code's backend model with a free local model.
- If you have less than 48GB of memory or need faster speeds, adjust your model selection considering that the entire MoE model weights are loaded into memory. For example, in a 32GB environment, consider Gemma 4 E4B or smaller quantized versions, or use Ollama or llama.cpp to reduce memory usage with lightweight quantized GGUF files like `unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL`.
- When applying local models to agentic coding tasks that require many tool calls, keep in mind that Gemma 4 26B-A4B is vulnerable with 68% on the tau2 benchmark compared to Qwen3.5 35B-A3B (81%). It is better to evaluate coding-specialized models like Qwen3-coder first for such purposes.
- If you want to integrate local LLMs into headless servers or CI/CD environments, you can choose to use LM Studio 0.4.0's headless mode (lms CLI + llmster daemon) or directly run a llama.cpp server. llama.cpp can easily launch a server with `llama-server --reasoning auto --fit on -hf <model name> --temp 1.0`.
Code Example
snippet
# Run Gemma 4 26B-A4B local server using llama.cpp + Swival
$ llama-server \
--reasoning auto \
--fit on \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
--temp 1.0 --top-p 0.95 --top-k 64
# Run Swival agent in a separate terminal
$ uvx swival --provider llamacpp
# If running with Ollama
$ ollama run gemma4:26bTerminology
MoE (Mixture of Experts)A method of having multiple 'expert' networks inside a model and selectively calculating only some of them for each input token. The total number of parameters is large, but the actual amount of computation is small, resulting in faster speeds.
unified memoryAn architecture in Apple Silicon Macs where the CPU and GPU share the same memory pool. Unlike conventional PCs with separate GPU VRAM, there is no separate VRAM limit, making it easier to load large models.
llmsterA standalone inference engine newly separated in LM Studio 0.4.0. Previously built into the GUI app, it can now be detached and run as a background service.
headless CLIA method of controlling a program only with terminal commands without a graphical user interface (GUI). It can be used in servers or SSH environments.
GGUFA model file format used by llama.cpp. It includes quantization (compression to reduce model size) settings, and names like Q4_K_XL indicate the 4-bit quantization level.
tau2 벤치마크A benchmark that measures the ability of an AI agent to perform complex tasks by actually calling tools. It evaluates agentic task capabilities rather than simple question answering.