Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
TL;DR Highlight
Five failure modes and eight practical solutions emerged after five days of running on-device SLMs (Gemma 4 E2B, Qwen3 0.6B) with Wordle.
Who Should Read
Android developers aiming to embed LLM functionality in mobile apps with offline/privacy guarantees, particularly those deploying on-device inference to production without cloud APIs.
Core Mechanics
- SLMs aren’t scaled-down cloud models; they exhibit different failure patterns, such as wrapping JSON in markdown code fences, translating JSON keys into the output language, or emitting invalid UTF-8.
- Compliance with numerical constraints like word length is severely low. Qwen3 0.6B initially violated the ‘give me a 5-letter word’ constraint 30–50% of the time; even after adding CRITICAL markers and specific examples, 10–15% violations remained, ultimately requiring word selection to be curated from a fixed list.
- Output quality degrades rapidly after 3–5 generations within the same chat session. KV cache saturation (the model’s memory of previous conversation) is the cause, with Qwen3 0.6B failing after 3 turns and Gemma 4 E2B after 5–7, resulting in repetitive and meaningless responses.
- Success rates increase exponentially as the number of output fields handled by the LLM decreases. With 85% success per field, a 7-field schema yields 32% complete success, while a 2-field schema yields 72%.
- WorkManager should not be used for LLM puzzle generation; it produced 7 bugs per day. WorkManager caches SUCCEEDED status even after process restarts, making it fundamentally unsuitable for foreground tasks like LLM inference, which the user is actively watching. Switching to Kotlin coroutines resolved all 7 bugs.
- After evaluating 9 models, only 2 were released. Gemma 3 1B was eliminated after 8 hours, replaced by half-sized Qwen3 0.6B. Each model has a unique prompt tuning profile, meaning each supported model equals a separate AI integration.
- The final architecture uses a curated JSON file for word selection, the LLM generates only 3 hints, and a deterministic fallback is provided when the LLM fails. The game remains fully playable even if the LLM never succeeds.
- Prompt engineering tip: Use full language names ('Brazilian Portuguese') instead of ISO codes ('pt'), provide specific negative examples rather than abstract rules, and repeat language specification in both the system and user prompts to reduce language drift.
Evidence
- "Word length violation rate: 30–50% before prompt optimization, reduced to 10–15% after adding CRITICAL markers and specific examples, and 0% after curating word generation. JSON code fence wrapping: occurred in most initial calls, reduced to 20–30% after adding ‘no code fences’ to the prompt, and contributed approximately 15–20% successful parsing with Qwen3 0.6B when using structural parsing (Strategy 5). Pixel 7 Pro with Gemma 4 E2B: approximately 30 tok/s decoding and 10-second initialization with the CPU backend. Qwen3 0.6B: approximately 35 tok/s and 5-second initialization. Generating a 50-token hint takes 1–2 seconds. Removing WorkManager (commit 6ce6435) deleted 767 lines of code and added 576, resulting in a net reduction of 191 lines and resolving all 7 bugs."
How to Apply
- "If your LLM handles both word selection and hint generation, separate the word selection into a validated JSON asset file and have the LLM generate only hints. This single change simultaneously resolves issues with word length violations, repeated words, nonexistent words, and incorrect language words. If you’re handling SLM JSON parsing with simple deserialization, replace it with a five-step pipeline: UTF-8 sanitization, code fence removal, direct parsing, regular expression extraction, and field inference based on value type, ignoring key names. The final step handles cases where the model translates JSON keys into another language. If you have batch generation logic that calls the LLM multiple times within the same session, add a rotation that creates a new session every 3–5 turns. When retrying, include specific failure reasons like ‘word has 7 letters but we asked for 5’ in the prompt to increase retry success rates."
Code Example
Terminology
Related Papers
Dynamic Context Evolution for Scalable Synthetic Data Generation
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
90%+ fewer tokens per session by reading a pre-compiled wiki instead of exploring files cold. Built from Karpathy's workflow.
This is a workflow sharing post about how pre-organizing a codebase in Wiki format can reduce token usage per Claude session by more than 90% instead of directly exploring the codebase every time.
I mass deleted 3 months of AI generated code last week. Here is what I learned.
A retrospective post by a developer who deleted 3 months' worth of code after over-relying on AI code generation, but access to the original post is blocked, making it impossible to verify the actual content.
This new technique saves 60% of my token expenses
You can reduce LLM response tokens by 60% by using a telegraphic style that only keeps nouns and verbs, excluding articles, conjunctions, and auxiliary verbs.
Taught Claude to talk like a caveman to use 75% less tokens.
This post details a prompt technique that drastically compresses Claude's response style, reducing token usage by 75%, which could be useful for developers interested in reducing API costs.
Related Resources
Original Abstract (Expand)
On-device Small Language Models (SLMs) promise fully offline, private AI experiences for mobile users (no cloud dependency, no data leaving the device). But is this promise achievable in practice? This paper presents a longitudinal practitioner case study documenting the engineering challenges of integrating SLMs (Gemma 4 E2B, 2.6B parameters; Qwen3 0.6B, 600M parameters) into Palabrita, a production Android word-guessing game. Over a 5-day development sprint comprising 204 commits (~90 directly AI-related), the system underwent a radical transformation: from an ambitious design where the LLM generated complete structured puzzles (word, category, difficulty, and five hints as JSON) to a pragmatic architecture where curated word lists provide the words and the LLM generates only three short hints, with a deterministic fallback if it fails. We identify five categories of failures specific to on-device SLM integration: output format violations, constraint violations, context quality degradation, latency incompatibility, and model selection instability. For each failure category, we document the observed symptoms, root causes, and the prompt engineering and architectural strategies that effectively mitigated them, including multi-layer defensive parsing, contextual retry with failure feedback, session rotation, progressive prompt hardening, and systematic responsibility reduction. Our findings demonstrate that on-device SLMs are viable for production mobile applications, but only when the developer accepts a fundamental constraint: the most reliable on-device LLM feature is one where the LLM does the least. We distill our experience into eight actionable design heuristics for practitioners integrating SLMs into mobile apps.