Resource-Efficient Iterative LLM-Based NAS with Feedback Memory

Mar 12, 2026•Xiaojie Gu, Dmitry Ignatov, Radu Timofte•View PDF

TL;DR Highlight

Auto neural architecture search in 18 hours on a single RTX 4090, using a loop where an LLM generates, evaluates, and improves PyTorch code directly.

Who Should Read

ML researchers and practitioners interested in Neural Architecture Search (NAS) without expensive hardware, and teams wanting to automate model design with LLM coding agents.

Core Mechanics

Proposed a code-driven NAS framework where an LLM agent generates PyTorch model implementations, trains them, and iteratively improves based on results
The agent operates in a closed loop: generate architecture code, train briefly, evaluate performance, analyze results, generate improved code
Uses program synthesis rather than traditional NAS search spaces — the LLM can explore architectures beyond predefined primitives
Discovered novel architectural patterns that outperform human-designed baselines on target tasks
The entire search process runs on commodity hardware (single RTX 4090) in under 18 hours
Generated architectures are fully reproducible PyTorch code, not black-box configurations

Evidence

Completed full NAS runs on a single RTX 4090 GPU in approximately 18 hours
Discovered architectures that match or exceed human-designed baselines on benchmark tasks
LLM agent successfully navigates the search space without human intervention
Code-level search finds architectures not possible with traditional NAS search spaces

How to Apply

Set up the LLM agent with access to a Python execution environment and PyTorch
Define the target task and evaluation metrics, then let the agent run the search loop autonomously
Start with a shorter search budget to get initial results, then extend for better architectures
The generated PyTorch code can be directly deployed without any conversion steps

Code Example

snippet

# Example of historical feedback memory structure passed to Prompt Improver
history = [
    {
        "problem": "Layers too deep, failed to converge within 1 epoch",
        "suggestion": "Reduce number of layers from 6→3 and add BatchNorm",
        "outcome": "accuracy improved from 0.42 → 0.55"
    },
    {
        "problem": "Channel size 512 too large, causing OOM error",
        "suggestion": "Reduced channel size to 256",
        "outcome": "error: CUDA out of memory"
    },
    # ... keep last K=5 entries
]

prompt = f"""
You are a visionary deep learning architect.
Task: CIFAR-10 image classification (no pretrained weights)

Current best architecture (accuracy: {best_acc:.3f}):
{best_code}

Recent improvement history (last {K} attempts):
{json.dumps(history, ensure_ascii=False, indent=2)}

Based on the history, identify recurring failure patterns and suggest
a concrete improvement. Then generate a complete PyTorch Net(nn.Module) class.
"""

Terminology

Neural Architecture Search (NAS)Automated methods for finding optimal neural network architectures for a given task, traditionally requiring massive compute.

Program SynthesisGenerating executable code that meets specified requirements — here used to generate novel neural network architectures as Python code.

Search SpaceThe set of all possible architectures that NAS can explore. Traditional NAS has fixed search spaces; code-based NAS is open-ended.

Code GenerationThe LLM's ability to write executable code, here applied to generate novel PyTorch model implementations.

Related Resources

Paper Code Repository (includes prompt templates, training scripts, and iteration logs)

Original Abstract (Expand)

Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.