MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Mar 12, 2026•Xingze Zou, Jing Wang, Yuhua Zheng +8•View PDF

TL;DR Highlight

Benchmarking whether LLMs can auto-generate C++ kernels for the MNN mobile inference engine, and achieving 93.7% compilation success rate with the MoKA multi-agent system.

Who Should Read

Mobile ML engineers handling on-device AI inference optimization, or devs building LLM-based code automation pipelines.

Core Mechanics

Even the latest LLMs like GPT-5 and Claude-Sonnet-4.5 have 54%+ compilation failure rates on mobile kernel generation — primarily due to lack of MNN framework-specific knowledge
Both LoRA fine-tuning and GRPO (reinforcement learning) show minimal performance improvement — training data for mobile inference frameworks is simply too sparse for fine-tuning to overcome
MoKA (multi-agent system) separates Coder + Debugger + Accelerator roles in a plan-and-execute loop for iterative improvement
Debugger parses compilation errors with tree-sitter, understands repository structure, and fixes cross-file dependency errors
Accelerator receives on-device profiling results and automatically suggests hardware optimizations like SIMD vectorization and cache blocking — achieves up to 6.82x speedup on LayerNorm2D
MobileKernelBench with 190 tasks and 95 ONNX operators is released — PyTorch/ONNX pair format ensures cross-framework compatibility

Evidence

MoKA compilation success rate (CSR) 93.7% — baseline Claude-Sonnet-4.5 single query 46.3%, +47.4pp improvement
MoKA fast1.5 (proportion of kernels >1.5x faster than baseline MNN) 27.4% — vs baseline 4.7%, pass@10 5.3% overwhelming difference
MoKA functional correctness rate (FCR) 75.3% — Claude pass@10 47.9% (+27.4pp), single query 34.2% (+41.1pp)
LayerNorm2D case: 10 optimization iterations achieve max 6.82x speedup, average 2.82x

How to Apply

When developing kernels for mobile inference engines (MNN, NCNN, etc.), compose a multi-agent loop with Coder → Debugger → Accelerator roles in order — produces far higher quality code than simple repeated prompting.
For automating compilation error debugging, applying the context injection method of parsing error locations with tree-sitter and providing the repository tree in the prompt significantly reduces LLM API hallucination.
When building a performance optimization agent, limiting prompts to "find only one bottleneck and suggest only one optimization per iteration" reduces the search space and makes history-based self-reflection more effective.

Code Example

snippet

# MoKA Accelerator prompt example (based on paper Appendix B.3)

accelerator_prompt = """
You are an expert in model deployment, proficient in PyTorch and C++ programming,
and familiar with the coding style of the MNN framework.
Your task is to analyse the performance bottlenecks of the following MNN operator code
and propose optimisation methods to accelerate it.

Then identify **exactly one** highest-impact speed bottleneck,
propose **exactly one** optimisation method and propose a modification plan.

Operator information: {op_info}
Current implementation: {code_book}
Current performance: {performance}
History optimisation info: {history_optmz_info}

Requirements:
- Return **one and only one** optimisation method -- the largest expected speedup.
- Keep fields brief; avoid lists of alternatives, disclaimers, or generic advice.
- Avoid the totally same optimizations that have already been attempted.

Output format (JSON):
{{
  "bottleneck": "<max 100 words>",
  "optimisation_method": "<max 100 words>",
  "modification_plan": "<max 100 words>"
}}
"""

# Debugger compilation error prompt example
debugger_prompt = """
Operator information: {op_info}
Current implementation: {code_book}
Compilation errors: {compile_error}

Analyze the errors and provide suggestions.
Note:
- Only provide semantic suggestions (in text).
- For cross-file errors, refer to the relevant code snippets and adjust only the current code.
- Do NOT suggest modifications to other MNN framework files.

Output format (JSON):
{{
  "local_error_suggestion": [],
  "crossfile_error_suggestion": []
}}
"""

Terminology

MNNA lightweight deep learning inference engine for mobile devices created by Alibaba. A lightweight framework for running AI models fast on smartphones.

ONNXA standard file format for exchanging models between different deep learning frameworks (PyTorch, TensorFlow, etc.). A common standard for compatibility, like USB.

KernelLow-level code that performs actual computations on GPU/CPU. A function optimized for hardware to perform operations like matrix multiplication and convolution.

GRPOA reinforcement learning-based training method that trains the model to generate better code through reward signals. Instead of telling it the right answer directly, provides feedback through scores.

CSRCompilation Success Rate. The proportion of LLM-generated code that builds without compilation errors.

FCRFunctional Correctness Rate. The proportion of successfully compiled code that actually produces correct output values.

SIMDA technology for processing multiple data simultaneously with a single CPU instruction. E.g., adding 8 numbers at once. ARM NEON is the mobile SIMD.

Android NDKA toolkit that enables using C/C++ code in Android apps. Required for building high-performance native code on mobile.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile de- vices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross-framework interoperability, coupled with an automated pipeline that bridges the host-device gap for on-device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inher-ent to mobile frameworks; standard models and even fine-tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain-specific grounding. To overcome these limitations, we propose the Mobile K ernel A gent (MoKA), a multi-agent system equipped with repository-aware reasoning and a plan-and-execute paradigm.Validated on MobileKernelBench, MoKA achieves state-of-the-art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernelsto deliver measurable speedups over native libraries.