Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference
TL;DR Highlight
Google's open-source model Gemma 4 can now run on iPhone with full local inference without the cloud, demonstrating that on-device AI has moved beyond the experimental stage and entered a practical phase.
Who Should Read
iOS/Android developers looking to add AI features to mobile apps or considering edge AI solutions with privacy and offline requirements.
Core Mechanics
- Google's open-source model family, Gemma 4, can now run inference completely locally and offline on iPhone. There is no API call or cloud dependency.
- The flagship 31B variant in the model lineup showed similar performance to Qwen 3.5's 27B model in benchmarks. Gemma has approximately 4 billion more parameters.
- There are lightweight variants specifically designed for mobile deployment, E2B (2 billion parameters) and E4B (4 billion parameters), with Google even recommending E2B for its own apps. This is a choice considering memory and heat limitations.
- To get started, simply download the 'Google AI Edge Gallery' app from the App Store and select the desired model variant. Local inference is possible immediately without any separate settings or accounts.
- Google AI Edge Gallery is not just a simple text interface, but a platform that includes image recognition, voice interaction, and an extensible Skills framework. It is positioned as a foundation for developers to experiment with on-device AI.
- Inference is executed via the iPhone's GPU (Metal), and actual response latency is reported to be noticeably low. Actual benchmark figures measured Prefill speed of 231 t/s, Decode speed of 16 t/s, and time to first token of 1.16 seconds on an iPhone 16 Pro.
- The ability to operate offline is of practical value in enterprise use cases such as field work, medical environments, and where cloud processing is impossible due to data privacy regulations.
- The community has mentioned that it can also be run identically on Android via AI Core or llama.cpp.
Evidence
- There was criticism regarding the use of GPU (Metal). One comment stated, 'It seems they gave up on custom attention kernel compilation for Apple's dedicated NPU, ANE (Apple Neural Engine), and went around with Metal.' Metal is easy to port, but consumes much more battery than a dedicated NPU. It was evaluated as just a flashy tech demo until the ANE backend is rewritten.
- A developer created and released an offline code generation app called 'pucky' that runs on iPhone using Gemma 4 on GitHub (https://github.com/blixt/pucky). While the 4B model is technically executable, it automatically switches to 2B due to memory constraints, and it generates a single TypeScript file and compiles it with oxc. They stated that it is difficult to pass App Store review and must be built directly in Xcode.
- Experiences sharing being blocked by Apple's guideline 2.5.2 when attempting to deploy apps including local LLMs in the App Store have been shared. This points to Apple blocking LLM usage within the app store, and a practical concern that the distribution path for on-device AI apps may be limited.
- There was also criticism regarding the model structure characteristics of Gemma 4. 'Gemma 4 tends to activate almost all weights, resulting in high power consumption.' It was pointed out that it is less efficient than Qwen3-coder, which uses MoE (Mixture of Experts) to activate only about 3 billion parameters at a time. It was evaluated that there is still a lot of performance room left on the table.
- There were also warnings about the reliability of small models. An experience was shared where it confidently gave the wrong answer ('Yes, it is') when asked, 'Is it okay to give an avocado to a dog?' This case reminds us that it is dangerous to use small on-device models directly for medical or safety-related judgments.
How to Apply
- If you are developing enterprise field apps in areas such as medical, financial, and military where cloud AI APIs cannot be used due to data privacy regulations, adopting Google AI Edge Gallery and Gemma 4 E2B/E4B variants as an on-device inference base can satisfy both regulatory compliance and AI functionality.
- If you are trying to embed AI functionality in an iOS app, you may be blocked by Apple's guideline 2.5.2 when deploying to the App Store, so it is good to review TestFlight or enterprise deployment paths in advance. As in the case of community developers, Xcode direct build and sideloading methods can also be alternatives.
- When linking an LLM to a mobile app, model size selection is important. Considering memory and heat, E2B is more realistic than E4B. In fact, if you try 4B, it often automatically falls back to 2B due to memory constraints, so it is safer to design the UX based on E2B from the beginning.
- If you are concerned about battery consumption of on-device inference based on Gemma 4, you should add additional logic to limit usage frequency in battery-sensitive scenarios (e.g., minimize background processing, maintain short context) until support for ANE (Apple Neural Engine) is added instead of the current GPU (Metal) based backend.
Terminology
ANEAbbreviation for Apple Neural Engine, Apple's dedicated AI accelerator chip embedded in iPhone/Mac. It is much more power-efficient than the GPU and can process AI calculations while reducing battery consumption.
MetalA low-level graphics/computing API provided by Apple. Unlike ANE, it uses a general-purpose GPU, making it easy to port, but consumes more battery.
MoEAbbreviation for Mixture of Experts, a structure that selectively activates only some 'expert' parameters among the entire model parameters depending on the input. Even with the same number of parameters, the actual amount of computation is reduced, increasing speed and efficiency.
엣지 AIRefers to performing AI calculations directly on user devices (edge) such as smartphones and IoT devices, rather than on cloud servers. It enables offline operation and privacy protection, but has hardware limitations.
온디바이스 추론The process of an AI model calculating results directly within the device without sending data to an external server. It reduces response latency, operates without the internet, and keeps data within the device.
Skills 프레임워크Refers to an extensible structure in Google AI Edge Gallery that allows you to add functions that LLMs can use, such as web search and external tool linking, in the form of plugins.