Feedback Memory를 활용한 Resource-Efficient 반복적 LLM 기반 NAS

Resource-Efficient Iterative LLM-Based NAS with Feedback Memory

Mar 12, 2026•Xiaojie Gu, Dmitry Ignatov, Radu Timofte•View PDF

TL;DR Highlight

LLM이 PyTorch 코드를 직접 생성·평가·개선하는 루프로 RTX 4090 한 장에서 18시간 만에 신경망 아키텍처를 자동 탐색하는 방법

Who Should Read

엣지 디바이스용 경량 모델을 설계하거나, 클라우드 없이 로컬 GPU 환경에서 NAS(신경망 아키텍처 탐색)를 시도하려는 ML 엔지니어.

Core Mechanics

LLM이 PyTorch 코드를 직접 생성 → 1 epoch 학습 → 결과 분석 → 다음 코드 개선 제안의 닫힌 루프(closed-loop) 구조로 NAS 자동화
최근 K=5 시도 기록만 유지하는 슬라이딩 윈도우 방식으로 컨텍스트 크기를 일정하게 유지하면서 반복 학습 가능 (Markov chain 아이디어 차용)
실패한 코드 실행도 '문제-수정 제안-결과' 구조화된 triple로 기록해서 LLM이 실패 패턴을 학습하게 함 — 기존 방법들은 실패를 그냥 버렸음
Code Generator(코드 생성)와 Prompt Improver(진단 및 개선 제안) 두 LLM으로 역할 분리해서 각 호출의 부담을 줄임
DeepSeek-Coder-6.7B는 CIFAR-10에서 28.2% → 69.2%, Qwen2.5-7B는 50.0% → 71.5%, GLM-5는 43.2% → 62.0%로 모두 크게 향상
LLM fine-tuning 없이 frozen 상태 7B 이하 모델만 사용, 전체 2000 iteration이 RTX 4090 단일 GPU에서 약 18시간 완료

Evidence

CIFAR-10에서 DeepSeek-Coder-6.7B: 28.2% → 69.2% (+41.0%p), Spearman ρ=0.754 (p≈0)
CIFAR-10에서 Qwen2.5-7B: 50.0% → 71.5% (+21.5%p), 2000 iteration 최고 정확도
GLM-5는 단 100 iteration으로 CIFAR-10 43.2% → 62.0%, 성공률 91.0%로 세 모델 중 가장 안정적
Ablation: historical feedback memory 제거 시 탐색이 정체되거나 성능 하락, single-shot 베이스라인 초과 불가

How to Apply

오픈소스 DeepSeek-Coder-6.7B-Instruct 또는 Qwen2.5-7B-Instruct를 로컬에 띄우고, '현재 베스트 코드 + 최근 5회 시도 기록'을 프롬프트에 넣어 새 PyTorch 모델 코드를 생성하도록 요청하면 된다.
각 iteration에서 생성된 코드를 1 epoch만 학습해 빠른 proxy 정확도를 구하고, 실패한 경우도 에러 메시지를 (문제, 제안, 결과) triple 형태로 히스토리에 저장해 다음 프롬프트에 포함시킨다.
VRAM이 제한된 환경에서는 LLM과 학습이 GPU를 공유하므로 탐색 자체가 자연스럽게 경량 모델을 선호하게 되어 엣지 배포용 아키텍처 발굴에 바로 활용 가능하다.

Code Example

snippet

# Prompt Improver에게 전달하는 historical feedback memory 구조 예시
history = [
    {
        "problem": "레이어가 너무 깊어 1 epoch 내 수렴 안 됨",
        "suggestion": "레이어 수를 6→3으로 줄이고 BatchNorm 추가",
        "outcome": "accuracy 0.42 → 0.55로 향상"
    },
    {
        "problem": "채널 수 512가 너무 커 OOM 에러 발생",
        "suggestion": "채널 수를 256으로 줄임",
        "outcome": "error: CUDA out of memory"
    },
    # ... 최근 K=5개 유지
]

prompt = f"""
You are a visionary deep learning architect.
Task: CIFAR-10 image classification (no pretrained weights)

Current best architecture (accuracy: {best_acc:.3f}):
{best_code}

Recent improvement history (last {K} attempts):
{json.dumps(history, ensure_ascii=False, indent=2)}

Based on the history, identify recurring failure patterns and suggest
a concrete improvement. Then generate a complete PyTorch Net(nn.Module) class.
"""

Terminology

NASNeural Architecture Search. 신경망 구조(레이어 종류, 개수, 연결 방식 등)를 사람이 직접 설계하지 않고 알고리즘이 자동으로 찾아주는 기술.

closed-loop pipeline생성 → 평가 → 피드백 → 재생성이 순환하는 구조. 공장 자동화의 품질 검사 후 재작업 루프와 비슷함.

Markov chain현재 상태를 결정할 때 '가장 최근 K개'만 보고 그 이전 전체 기록은 무시하는 수학적 모델. 바둑에서 최근 몇 수만 보고 다음 수를 두는 것과 유사.

proxy accuracy전체 학습 없이 1 epoch만 돌려서 얻은 임시 정확도. 건물 완공 전 모형으로 품질을 미리 판단하는 것과 비슷.

instruction-tuned LLM사용자 지시를 잘 따르도록 추가 학습된 LLM. GPT-4o, Claude처럼 '~해줘'라고 하면 그대로 수행하도록 조정된 모델.

frozen LLM파라미터를 업데이트하지 않고 그대로 사용하는 LLM. 모델 자체를 바꾸지 않고 프롬프트만 바꿔서 사용하는 방식.

diagnostic triple이 논문에서 만든 개념으로, 각 시도를 (발견된 문제, 제안한 수정, 실제 결과) 세 가지로 구조화해서 기록하는 방식. 병원 진료 기록처럼 원인-처방-결과를 세트로 남기는 것.

Related Resources

논문 코드 저장소 (프롬프트 템플릿, 학습 스크립트, iteration 로그 포함)

Original Abstract (Expand)

Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.