Large Language Model Selection for Test-Driven Prompt Android iOS Development | AI Paper Digest

TL;DR Highlight

Extended the Python-biased LLM code generation research to Android (Java) / iOS (Swift), and compiled a decision tree for choosing the right model at the right time.

Who Should Read

Android/iOS developers or teams looking to adopt AI code generation in mobile app development. Especially useful when you need criteria for choosing GPT-4o vs open-source models.

Core Mechanics

8,704 evaluations on HumanEval & MBPP — directly compared GPT-4o, GPT-4o-mini, Qwen 14B, Qwen 32B across Android (Java) and iOS (Swift)
TDP (Test-Driven Prompting, a technique that embeds test cases in the prompt to guide correct answers) improved accuracy by an average of +2.22 pp over baseline prompting
Mobile platform accuracy ranged 66.85%–88.87%, consistently lower than Python code generation (86.90%–91.30%) regardless of model size
A decision tree for selecting models based on first-attempt accuracy, budget constraints, and self-hosting preference
Remediation Accuracy (the rate at which incorrect code is fixed on retry) was also measured, providing evaluation closer to real-world workflows

Evidence

TDP vs baseline prompting: average +2.22 pp (95% CI [1.22–3.23 pp], p < 0.001, Cohen's d = 0.3974)
Mobile top accuracy 88.87% — similar to or lower than Python's lowest (86.90%)
544 programming tasks × 4 models × 2 platforms × 2 strategies = 8,704 evaluations

How to Apply

Include expected input/output test cases alongside the function signature in your code generation prompt (TDP) to expect roughly a 2 pp accuracy gain.
If cost matters, consider self-hosting Qwen 32B instead of GPT-4o; attach a Remediation (retry) loop for tasks with low first-attempt accuracy.
Since mobile code generation quality is lower than Python, build a pipeline that auto-runs unit tests in CI to always validate LLM output.

Code Example

snippet

# Test-Driven Prompting Example (Swift function generation)
prompt = """
Write a Swift function that reverses a string.

Requirements:
- Function signature: func reverseString(_ s: String) -> String

Test cases (your output must pass all of these):
- reverseString("hello") == "olleh"
- reverseString("") == ""
- reverseString("a") == "a"
- reverseString("Swift") == "tfiwS"

Return only the function implementation.
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

Terminology

TDPTest-Driven Prompting. A technique that pre-embeds test cases ('if this input, then this output') into the prompt. Similar to providing example answers with a test question.

HumanEvalA code generation benchmark by OpenAI. Consists of 164 function-writing problems; evaluated by whether the generated code passes hidden tests.

MBPPA code generation benchmark by Google (Mostly Basic Python Problems). Consists of 500 beginner-to-intermediate Python problems from non-experts.

Remediation AccuracyThe rate at which incorrect code is fixed on retry (or by regenerating after providing error feedback). Similar to iterative LLM call workflows in production.

Cohen's dEffect size indicating the practical magnitude of difference between two groups. 0.2 = small, 0.5 = medium, 0.8+ = large.

ppPercentage Point. 85%→87% is a 2 pp increase but ~2.4% improvement. Don't confuse % and pp when comparing accuracy.

Related Papers

Original Abstract (Expand)

Large language model (LLM) code generation research predominantly focuses on Python, with test-driven prompt engineering exclusively targeting this language. This study presents a comprehensive LLM selection framework for mobile development through rigorous empirical analysis. We conducted 8,704 evaluations across 544 programming tasks (HumanEval and MBPP datasets) on Android (Java) and iOS (Swift) platforms using four state-of-the-art LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, and Qwen 32B), two prompting strategies (base and test-driven), and two metrics (accuracy and remediation accuracy). Systematic analysis of platform-specific patterns yielded a decision tree incorporating first-attempt correctness, budget constraints, and self-hosting requirements, validated through three industry-relevant use cases. Results show test-driven prompting (TDP) achieves a +2.22 pp average accuracy improvement over baseline (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974). However, LLMs consistently underperform in mobile development (66.85%–88.87%) compared to Pythonbased code generation (86.90%–91.30%) regardless of model size or type. This framework establishes groundwork for platform-specific optimizations while providing practitioners with actionable guidance for model selection in mobile development contexts.