Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

TL;DR Highlight

Five failure modes and eight practical solutions emerged after five days of running on-device SLMs (Gemma 4 E2B, Qwen3 0.6B) with Wordle.

Who Should Read

Android developers aiming to embed LLM functionality in mobile apps with offline/privacy guarantees, particularly those deploying on-device inference to production without cloud APIs.

Core Mechanics

SLMs aren’t scaled-down cloud models; they exhibit different failure patterns, such as wrapping JSON in markdown code fences, translating JSON keys into the output language, or emitting invalid UTF-8.
Compliance with numerical constraints like word length is severely low. Qwen3 0.6B initially violated the ‘give me a 5-letter word’ constraint 30–50% of the time; even after adding CRITICAL markers and specific examples, 10–15% violations remained, ultimately requiring word selection to be curated from a fixed list.
Output quality degrades rapidly after 3–5 generations within the same chat session. KV cache saturation (the model’s memory of previous conversation) is the cause, with Qwen3 0.6B failing after 3 turns and Gemma 4 E2B after 5–7, resulting in repetitive and meaningless responses.
Success rates increase exponentially as the number of output fields handled by the LLM decreases. With 85% success per field, a 7-field schema yields 32% complete success, while a 2-field schema yields 72%.
WorkManager should not be used for LLM puzzle generation; it produced 7 bugs per day. WorkManager caches SUCCEEDED status even after process restarts, making it fundamentally unsuitable for foreground tasks like LLM inference, which the user is actively watching. Switching to Kotlin coroutines resolved all 7 bugs.
After evaluating 9 models, only 2 were released. Gemma 3 1B was eliminated after 8 hours, replaced by half-sized Qwen3 0.6B. Each model has a unique prompt tuning profile, meaning each supported model equals a separate AI integration.
The final architecture uses a curated JSON file for word selection, the LLM generates only 3 hints, and a deterministic fallback is provided when the LLM fails. The game remains fully playable even if the LLM never succeeds.
Prompt engineering tip: Use full language names ('Brazilian Portuguese') instead of ISO codes ('pt'), provide specific negative examples rather than abstract rules, and repeat language specification in both the system and user prompts to reduce language drift.

Evidence

"Word length violation rate: 30–50% before prompt optimization, reduced to 10–15% after adding CRITICAL markers and specific examples, and 0% after curating word generation. JSON code fence wrapping: occurred in most initial calls, reduced to 20–30% after adding ‘no code fences’ to the prompt, and contributed approximately 15–20% successful parsing with Qwen3 0.6B when using structural parsing (Strategy 5). Pixel 7 Pro with Gemma 4 E2B: approximately 30 tok/s decoding and 10-second initialization with the CPU backend. Qwen3 0.6B: approximately 35 tok/s and 5-second initialization. Generating a 50-token hint takes 1–2 seconds. Removing WorkManager (commit 6ce6435) deleted 767 lines of code and added 576, resulting in a net reduction of 191 lines and resolving all 7 bugs."

How to Apply

"If your LLM handles both word selection and hint generation, separate the word selection into a validated JSON asset file and have the LLM generate only hints. This single change simultaneously resolves issues with word length violations, repeated words, nonexistent words, and incorrect language words. If you’re handling SLM JSON parsing with simple deserialization, replace it with a five-step pipeline: UTF-8 sanitization, code fence removal, direct parsing, regular expression extraction, and field inference based on value type, ignoring key names. The final step handles cases where the model translates JSON keys into another language. If you have batch generation logic that calls the LLM multiple times within the same session, add a rotation that creates a new session every 3–5 turns. When retrying, include specific failure reasons like ‘word has 7 letters but we asked for 5’ in the prompt to increase retry success rates."

Code Example

snippet

Terminology

SLMSmall Language Model with fewer than 3 billion parameters. Runnable on resource-constrained hardware like smartphones, but less capable of understanding instructions than larger models like GPT-4.

KV 캐시Temporary memory that the model uses to remember previous conversation history. Becomes saturated with longer conversations, causing the model to lose context and repeat nonsensical responses.

LiteRT-LMAn on-device LLM inference framework for Android created by Google. Allows you to embed models like Gemma and Qwen into your app with a single Gradle dependency.

WorkManagerAndroid’s official API for handling background tasks that survive app termination. Suitable for file uploads or data synchronization, but not for foreground tasks like LLM inference that the user is actively watching.

온디바이스 추론Running AI models directly on the user’s device (e.g., smartphone) instead of a cloud server. Enables operation without an internet connection and keeps data on the device.

구조적 파싱A parsing method that infers fields based on the value type (array, short string, number) instead of the JSON key name. Allows correct parsing even if the model translates keys into another language.

언어 드리프트The phenomenon where a model responds in English even when instructed to output in Korean. A side effect of having overwhelmingly more English data in the training set.

결정론적 폴백A safety net that displays a pre-defined, fixed response when the LLM fails. Always returns the same result due to its lack of randomness.

Related Papers

Related Resources

Original Abstract (Expand)

On-device Small Language Models (SLMs) promise fully offline, private AI experiences for mobile users (no cloud dependency, no data leaving the device). But is this promise achievable in practice? This paper presents a longitudinal practitioner case study documenting the engineering challenges of integrating SLMs (Gemma 4 E2B, 2.6B parameters; Qwen3 0.6B, 600M parameters) into Palabrita, a production Android word-guessing game. Over a 5-day development sprint comprising 204 commits (~90 directly AI-related), the system underwent a radical transformation: from an ambitious design where the LLM generated complete structured puzzles (word, category, difficulty, and five hints as JSON) to a pragmatic architecture where curated word lists provide the words and the LLM generates only three short hints, with a deterministic fallback if it fails. We identify five categories of failures specific to on-device SLM integration: output format violations, constraint violations, context quality degradation, latency incompatibility, and model selection instability. For each failure category, we document the observed symptoms, root causes, and the prompt engineering and architectural strategies that effectively mitigated them, including multi-layer defensive parsing, contextual retry with failure feedback, session rotation, progressive prompt hardening, and systematic responsibility reduction. Our findings demonstrate that on-device SLMs are viable for production mobile applications, but only when the developer accepts a fundamental constraint: the most reliable on-device LLM feature is one where the LLM does the least. We distill our experience into eight actionable design heuristics for practitioners integrating SLMs into mobile apps.