FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast | AI Paper Digest

TL;DR Highlight

파인튜닝 없이 여러 AI 에이전트가 실패 경험을 공유하며 집단지성으로 메모리를 진화시키는 프레임워크

Who Should Read

LLM 에이전트에 자기개선(self-improvement) 메커니즘을 붙이려는 ML 엔지니어나 에이전트 시스템 개발자. 특히 파인튜닝 없이 프롬프트만으로 에이전트 성능을 높이고 싶은 경우에 유용.

Core Mechanics

FORGE는 파인튜닝(weight 업데이트) 없이 실패한 에피소드에서 자동으로 Rules(조건부 휴리스틱) 또는 Examples(few-shot 시연) 형태의 지식 artifact를 생성해 에이전트 메모리에 주입하는 방식으로 작동함
핵심 메커니즘은 'Champion Broadcast': N개의 에이전트 인스턴스를 병렬로 돌리고, 스테이지마다 가장 성능 좋은 인스턴스의 메모리를 나머지 전체에 덮어쓰는 방식으로 개선된 지식을 집단에 전파함
테스트된 4개 모델(Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) 모두에서 zero-shot 대비 1.7~7.7배, 단순 Reflexion 대비 29~72% 성능 향상 달성
메모리 표현 방식 비교: Examples(few-shot 시연)가 3개 모델에서 가장 높은 성능을 냄. Rules는 토큰을 약 40% 덜 쓰면서도 안정적인 성능을 보여 비용 대비 효율이 가장 좋음
Graduation 메커니즘: 특정 점수 이상 달성한 인스턴스는 학습을 멈추고 동결시켜 나중에 나쁜 메모리로 덮어써지는 것을 방지하고 불필요한 계산을 절약함
약한 모델일수록 더 큰 이득: 베이스라인이 가장 낮았던 Gemini는 7.7배 향상, 베이스라인이 가장 좋았던 Grok은 1.7배 향상. FORGE가 모델 능력 격차를 줄여주는 효과가 있음

Evidence

FORGE는 12개 모델-표현방식 조합 전체에서 Reflexion 대비 29~72% 향상: Gemini Examples −78.9→−24.5, Qwen Examples −57.6→−24.3, Llama Examples −53.9→−28.3, Grok Rules −79.9→−33.7
Catastrophic failure 비율(리턴 < −100): zero-shot에서 약 90%였던 것이 FORGE 적용 후 최저 약 1%까지 감소
토큰 비용(Gemini 기준): Rules ~106M, Examples ~177M, Mixed ~188M으로 Rules가 Examples 대비 약 40% 저렴하면서도 안정적인 성능 유지
가장 좋은 단일 체크포인트 리턴 −3.60(Gemini Rules)으로, DRL 최고 점수 −3.47에 근접. 단, 전체 평균은 이보다 낮음

How to Apply

기존 단일 에이전트에 Reflexion 루프가 있다면, 같은 루프를 N개 병렬로 돌리고 각 스테이지 끝에 가장 높은 점수를 낸 인스턴스의 메모리 파일을 나머지에 복사하는 Champion Broadcast를 추가하면 됨
에이전트가 실패할 때(특정 reward threshold 아래로 떨어질 때) 에피소드를 즉시 중단하고, 같은 LLM으로 실패 trajectory를 분석해서 조건부 규칙(Rules) 또는 ReAct 스타일 시연(Examples)을 생성한 뒤 시스템 프롬프트에 주입하는 방식으로 구현 가능
비용이 중요한 프로덕션 환경이라면 Examples 대신 Rules 표현 방식을 선택하면 토큰을 약 40% 절약하면서도 비슷한 성능을 낼 수 있음. graduation threshold를 설정해서 수렴한 인스턴스는 추가 학습에서 제외하면 컴퓨팅 낭비를 줄일 수 있음

Code Example

snippet

# FORGE 핵심 로직 의사코드

# 1. 실패 감지 → 메모리 업데이트 (Inner Loop)
def reflexion_loop(agent, memory, max_attempts=3, failure_threshold=-1.1):
    for attempt in range(max_attempts):
        trajectory = []
        for step in range(30):  # CAGE-2 horizon
            action = agent.act(memory)
            obs, reward = env.step(action)
            trajectory.append((action, obs, reward))
            
            # 실패 감지 시 즉시 중단
            if reward < failure_threshold:
                # 같은 LLM으로 실패 분석 → 지식 artifact 생성
                new_artifact = reflector_agent.analyze(
                    trajectory=trajectory,
                    representation='rules'  # 또는 'examples', 'mixed'
                )
                memory.append(new_artifact)
                break  # 에피소드 재시작
    return memory

# 2. Champion Broadcast (Outer Loop)
def forge_protocol(N=10, S=6, graduation_threshold=-15):
    instances = [Agent(memory=[]) for _ in range(N)]
    graduated = set()
    
    for stage in range(S):
        # 병렬로 각 인스턴스 학습
        for i, agent in enumerate(instances):
            if i not in graduated:
                agent.memory = reflexion_loop(agent, agent.memory)
        
        # 체크포인트 평가
        scores = {i: evaluate(instances[i]) 
                  for i in range(N) if i not in graduated}
        
        # Graduation: 좋은 인스턴스는 동결
        for i, score in scores.items():
            if score > graduation_threshold:
                graduated.add(i)
        
        # Champion Broadcast: 최고 인스턴스 메모리를 나머지에 전파
        active = [i for i in range(N) if i not in graduated]
        if active:
            champion_idx = max(active, key=lambda i: scores[i])
            champion_memory = instances[champion_idx].memory
            
            for i in active:
                if i != champion_idx:
                    instances[i].memory = champion_memory.copy()  # 전체 교체
    
    return instances

Terminology

ReActReasoning + Acting의 줄임말. LLM이 '생각(Thought) → 행동(Action) → 관찰(Observation)'을 반복하며 문제를 푸는 에이전트 패턴. 사람이 메모 적으며 문제 푸는 것과 비슷.

Reflexion에이전트가 실패 후 자기 자신을 돌아보며 텍스트로 된 교훈을 메모리에 저장하는 기법. 시험 틀린 문제를 오답노트에 정리하는 것과 비슷.

POMDPPartially Observable Markov Decision Process. 에이전트가 환경 전체를 볼 수 없고 일부 정보만 보면서 결정해야 하는 상황. 안개 속에서 운전하는 것과 비슷.

Population-Based Training (PBT)여러 모델/에이전트를 동시에 훈련시키면서, 좋은 성능의 것을 선택해 나머지에 전파하는 학습법. 회사에서 여러 팀이 경쟁하고 최고 팀의 방법을 전사에 공유하는 것과 비슷.

Champion Broadcast이 논문에서 핵심 메커니즘. 가장 좋은 성능을 낸 에이전트의 메모리를 나머지 모든 에이전트에 복사하는 것. 반 1등의 필기노트를 반 전체에 복사해주는 것.

Graduation일정 성능 이상 달성한 에이전트 인스턴스를 더 이상 학습시키지 않고 동결시키는 메커니즘. 잘 풀던 답안을 실수로 지우지 않게 잠금해두는 것.

Few-shotLLM에게 예시를 몇 개 보여주고 유사한 작업을 하게 하는 방법. 학생에게 예제 풀이 2~3개 보여주고 비슷한 문제 풀게 하는 것.

Related Papers

Related Resources

Original Abstract (Expand)

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.