Resource-Efficient Iterative LLM-Based NAS with Feedback Memory
TL;DR Highlight
Auto neural architecture search in 18 hours on a single RTX 4090, using a loop where an LLM generates, evaluates, and improves PyTorch code directly.
Who Should Read
ML researchers and practitioners interested in Neural Architecture Search (NAS) without expensive hardware, and teams wanting to automate model design with LLM coding agents.
Core Mechanics
- Proposed a code-driven NAS framework where an LLM agent generates PyTorch model implementations, trains them, and iteratively improves based on results
- The agent operates in a closed loop: generate architecture code, train briefly, evaluate performance, analyze results, generate improved code
- Uses program synthesis rather than traditional NAS search spaces — the LLM can explore architectures beyond predefined primitives
- Discovered novel architectural patterns that outperform human-designed baselines on target tasks
- The entire search process runs on commodity hardware (single RTX 4090) in under 18 hours
- Generated architectures are fully reproducible PyTorch code, not black-box configurations
Evidence
- Completed full NAS runs on a single RTX 4090 GPU in approximately 18 hours
- Discovered architectures that match or exceed human-designed baselines on benchmark tasks
- LLM agent successfully navigates the search space without human intervention
- Code-level search finds architectures not possible with traditional NAS search spaces
How to Apply
- Set up the LLM agent with access to a Python execution environment and PyTorch
- Define the target task and evaluation metrics, then let the agent run the search loop autonomously
- Start with a shorter search budget to get initial results, then extend for better architectures
- The generated PyTorch code can be directly deployed without any conversion steps
Code Example
# Example of historical feedback memory structure passed to Prompt Improver
history = [
{
"problem": "Layers too deep, failed to converge within 1 epoch",
"suggestion": "Reduce number of layers from 6→3 and add BatchNorm",
"outcome": "accuracy improved from 0.42 → 0.55"
},
{
"problem": "Channel size 512 too large, causing OOM error",
"suggestion": "Reduced channel size to 256",
"outcome": "error: CUDA out of memory"
},
# ... keep last K=5 entries
]
prompt = f"""
You are a visionary deep learning architect.
Task: CIFAR-10 image classification (no pretrained weights)
Current best architecture (accuracy: {best_acc:.3f}):
{best_code}
Recent improvement history (last {K} attempts):
{json.dumps(history, ensure_ascii=False, indent=2)}
Based on the history, identify recurring failure patterns and suggest
a concrete improvement. Then generate a complete PyTorch Net(nn.Module) class.
"""Terminology
Related Resources
Original Abstract (Expand)
Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.