Large Language Model Selection for Test-Driven Prompt Android iOS Development
TL;DR Highlight
Extended the Python-biased LLM code generation research to Android (Java) / iOS (Swift), and compiled a decision tree for choosing the right model at the right time.
Who Should Read
Android/iOS developers or teams looking to adopt AI code generation in mobile app development. Especially useful when you need criteria for choosing GPT-4o vs open-source models.
Core Mechanics
- 8,704 evaluations on HumanEval & MBPP — directly compared GPT-4o, GPT-4o-mini, Qwen 14B, Qwen 32B across Android (Java) and iOS (Swift)
- TDP (Test-Driven Prompting, a technique that embeds test cases in the prompt to guide correct answers) improved accuracy by an average of +2.22 pp over baseline prompting
- Mobile platform accuracy ranged 66.85%–88.87%, consistently lower than Python code generation (86.90%–91.30%) regardless of model size
- A decision tree for selecting models based on first-attempt accuracy, budget constraints, and self-hosting preference
- Remediation Accuracy (the rate at which incorrect code is fixed on retry) was also measured, providing evaluation closer to real-world workflows
Evidence
- TDP vs baseline prompting: average +2.22 pp (95% CI [1.22–3.23 pp], p < 0.001, Cohen's d = 0.3974)
- Mobile top accuracy 88.87% — similar to or lower than Python's lowest (86.90%)
- 544 programming tasks × 4 models × 2 platforms × 2 strategies = 8,704 evaluations
How to Apply
- Include expected input/output test cases alongside the function signature in your code generation prompt (TDP) to expect roughly a 2 pp accuracy gain.
- If cost matters, consider self-hosting Qwen 32B instead of GPT-4o; attach a Remediation (retry) loop for tasks with low first-attempt accuracy.
- Since mobile code generation quality is lower than Python, build a pipeline that auto-runs unit tests in CI to always validate LLM output.
Code Example
# Test-Driven Prompting Example (Swift function generation)
prompt = """
Write a Swift function that reverses a string.
Requirements:
- Function signature: func reverseString(_ s: String) -> String
Test cases (your output must pass all of these):
- reverseString("hello") == "olleh"
- reverseString("") == ""
- reverseString("a") == "a"
- reverseString("Swift") == "tfiwS"
Return only the function implementation.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)Terminology
Original Abstract (Expand)
Large language model (LLM) code generation research predominantly focuses on Python, with test-driven prompt engineering exclusively targeting this language. This study presents a comprehensive LLM selection framework for mobile development through rigorous empirical analysis. We conducted 8,704 evaluations across 544 programming tasks (HumanEval and MBPP datasets) on Android (Java) and iOS (Swift) platforms using four state-of-the-art LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, and Qwen 32B), two prompting strategies (base and test-driven), and two metrics (accuracy and remediation accuracy). Systematic analysis of platform-specific patterns yielded a decision tree incorporating first-attempt correctness, budget constraints, and self-hosting requirements, validated through three industry-relevant use cases. Results show test-driven prompting (TDP) achieves a +2.22 pp average accuracy improvement over baseline (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974). However, LLMs consistently underperform in mobile development (66.85%–88.87%) compared to Pythonbased code generation (86.90%–91.30%) regardless of model size or type. This framework establishes groundwork for platform-specific optimizations while providing practitioners with actionable guidance for model selection in mobile development contexts.