MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
TL;DR Highlight
Benchmarking whether LLMs can auto-generate C++ kernels for the MNN mobile inference engine, and achieving 93.7% compilation success rate with the MoKA multi-agent system.
Who Should Read
Mobile ML engineers handling on-device AI inference optimization, or devs building LLM-based code automation pipelines.
Core Mechanics
- Even the latest LLMs like GPT-5 and Claude-Sonnet-4.5 have 54%+ compilation failure rates on mobile kernel generation — primarily due to lack of MNN framework-specific knowledge
- Both LoRA fine-tuning and GRPO (reinforcement learning) show minimal performance improvement — training data for mobile inference frameworks is simply too sparse for fine-tuning to overcome
- MoKA (multi-agent system) separates Coder + Debugger + Accelerator roles in a plan-and-execute loop for iterative improvement
- Debugger parses compilation errors with tree-sitter, understands repository structure, and fixes cross-file dependency errors
- Accelerator receives on-device profiling results and automatically suggests hardware optimizations like SIMD vectorization and cache blocking — achieves up to 6.82x speedup on LayerNorm2D
- MobileKernelBench with 190 tasks and 95 ONNX operators is released — PyTorch/ONNX pair format ensures cross-framework compatibility
Evidence
- MoKA compilation success rate (CSR) 93.7% — baseline Claude-Sonnet-4.5 single query 46.3%, +47.4pp improvement
- MoKA fast1.5 (proportion of kernels >1.5x faster than baseline MNN) 27.4% — vs baseline 4.7%, pass@10 5.3% overwhelming difference
- MoKA functional correctness rate (FCR) 75.3% — Claude pass@10 47.9% (+27.4pp), single query 34.2% (+41.1pp)
- LayerNorm2D case: 10 optimization iterations achieve max 6.82x speedup, average 2.82x
How to Apply
- When developing kernels for mobile inference engines (MNN, NCNN, etc.), compose a multi-agent loop with Coder → Debugger → Accelerator roles in order — produces far higher quality code than simple repeated prompting.
- For automating compilation error debugging, applying the context injection method of parsing error locations with tree-sitter and providing the repository tree in the prompt significantly reduces LLM API hallucination.
- When building a performance optimization agent, limiting prompts to "find only one bottleneck and suggest only one optimization per iteration" reduces the search space and makes history-based self-reflection more effective.
Code Example
# MoKA Accelerator prompt example (based on paper Appendix B.3)
accelerator_prompt = """
You are an expert in model deployment, proficient in PyTorch and C++ programming,
and familiar with the coding style of the MNN framework.
Your task is to analyse the performance bottlenecks of the following MNN operator code
and propose optimisation methods to accelerate it.
Then identify **exactly one** highest-impact speed bottleneck,
propose **exactly one** optimisation method and propose a modification plan.
Operator information: {op_info}
Current implementation: {code_book}
Current performance: {performance}
History optimisation info: {history_optmz_info}
Requirements:
- Return **one and only one** optimisation method -- the largest expected speedup.
- Keep fields brief; avoid lists of alternatives, disclaimers, or generic advice.
- Avoid the totally same optimizations that have already been attempted.
Output format (JSON):
{{
"bottleneck": "<max 100 words>",
"optimisation_method": "<max 100 words>",
"modification_plan": "<max 100 words>"
}}
"""
# Debugger compilation error prompt example
debugger_prompt = """
Operator information: {op_info}
Current implementation: {code_book}
Compilation errors: {compile_error}
Analyze the errors and provide suggestions.
Note:
- Only provide semantic suggestions (in text).
- For cross-file errors, refer to the relevant code snippets and adjust only the current code.
- Do NOT suggest modifications to other MNN framework files.
Output format (JSON):
{{
"local_error_suggestion": [],
"crossfile_error_suggestion": []
}}
"""Terminology
Related Resources
Original Abstract (Expand)
Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile de- vices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross-framework interoperability, coupled with an automated pipeline that bridges the host-device gap for on-device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inher-ent to mobile frameworks; standard models and even fine-tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain-specific grounding. To overcome these limitations, we propose the Mobile K ernel A gent (MoKA), a multi-agent system equipped with repository-aware reasoning and a plan-and-execute paradigm.Validated on MobileKernelBench, MoKA achieves state-of-the-art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernelsto deliver measurable speedups over native libraries.