EurekAgent: 자율 과학적 발견을 위한 Agent Environment Engineering

TL;DR Highlight

LLM 에이전트에게 복잡한 워크플로우 대신 잘 설계된 '환경'을 줬더니 수학·커널·ML 벤치마크에서 모두 SOTA를 달성했다.

Who Should Read

자율 연구 에이전트나 AI 코파일럿 시스템을 설계하는 ML 엔지니어. Claude Code, Codex 같은 CLI 에이전트를 실제 연구 자동화 파이프라인에 붙여보려는 개발자.

Core Mechanics

기존 연구들은 에이전트에게 '어떻게 연구할지' 워크플로우를 상세히 지정했지만, EurekAgent는 반대로 '어떤 환경에서 연구할지'만 설계하고 에이전트가 전략을 스스로 결정하게 했다.
4가지 환경 엔지니어링 축으로 구성됨: Permissions(Docker 격리+숨겨진 평가자), Artifact(파일시스템+Git 기반 공유 메모리), Budget(시간·API 비용 제한), Human-in-the-loop(웹 모니터+터미널 UI).
Prepare → Propose → Implement(병렬) 루프로 작동. Propose 단계에서 웹 검색으로 기존 솔루션을 찾고, Implement 단계에서 최대 P개 세션이 병렬로 각 가설을 구현한다.
같은 라운드의 병렬 세션끼리는 서로 코드를 못 보게 격리(same-round isolation)해서 조기에 하나의 로컬 최적으로 수렴하는 것을 막는다.
평가 스크립트(hidden evaluator)는 에이전트 작업공간 밖에 두고 점수만 받을 수 있게 해서 reward hacking(점수를 속이는 행위)을 구조적으로 차단한다.
Claude Code를 CLI 에이전트로, GLM-5.1을 베이스 LLM으로 사용했으며, 학습(fine-tuning) 없이 환경 설계만으로 test-time training 기반 경쟁자들을 뛰어넘었다.

Evidence

26-circle packing 문제에서 이전 AI 최고 기록(AlphaEvolve) 2.635986을 2.635999로 갱신했고, 총 API 비용은 $11 미만이었다.
TriMul 커널 엔지니어링 과제에서 상위 리더보드 최고 기록(2096.04 μs) 대비 약 4.3%, 이전 AI 최고 기록 TTT-Discover(2247.78 μs) 대비 약 10.8% 빠른 2005.03 μs를 달성했다.
MLE-Bench Lite 7개 대회에서 any-medal rate 85.71%, gold-medal rate 71.43%로 1위. 2~5위(AIBuildAI, Famou-Agent 등 Claude Opus·Gemini 사용)는 모두 71.43%에 그쳤다.
수학 3개 태스크 평균 API 비용이 $17 이하였으며, 오픈소스 모델 GLM-5.1만으로 상용 독점 모델(Claude Opus 4.6, Gemini 2.5 Pro)을 사용하는 경쟁자들을 이겼다.

How to Apply

자체 연구 자동화 파이프라인을 만들 때, 에이전트 프롬프트에 상세 절차를 넣는 대신 Docker 샌드박스로 실행 공간을 격리하고 평가 스크립트는 외부 서비스로 분리해서 점수 조작을 구조적으로 막아보라.
여러 에이전트를 병렬로 돌릴 때 같은 라운드 세션끼리 파일 접근을 차단하고, 이전 라운드 결과만 공유 Git 저장소로 접근하게 하면 다양성을 유지하면서도 좋은 솔루션이 다음 라운드에 누적된다.
장시간 실행 에이전트에 wall-clock 타이머와 API 비용 트래커를 환경 레이어에서 강제 적용하면, 에이전트가 예산을 인식하고 마감 전에 결과물을 제출하도록 유도할 수 있다.

Code Example

snippet

# EurekAgent 실행 예시 (오픈소스 저장소 기준)
# 1. 태스크 정의 파일 준비
# - problem_description.md: 문제 설명
# - evaluator.py: 숨겨진 평가 스크립트 (에이전트 접근 불가)
# - submission_spec.md: 제출 포맷 명세
# - initial_solution.py: (선택) 초기 코드

# 2. EurekAgent 실행
python run_eurekagent.py \
  --problem problem_description.md \
  --evaluator evaluator.py \
  --spec submission_spec.md \
  --initial initial_solution.py \
  --rounds 5 \
  --parallel 3 \
  --time-propose 20 \
  --time-implement 120 \
  --api-budget 20.0 \
  --agent claude-code \
  --model glm-5.1

# 환경 엔지니어링 핵심 구조
# Permissions: Docker 컨테이너 내 실행, 평가자는 외부 gRPC 서비스
# Artifact: ./runs/{run_id}/ 아래 Git 관리, ranked_solutions.json 자동 갱신
# Budget: 각 세션에 시간 체크 API 제공, 비용 초과 시 자동 중단
# Human-in-loop: localhost:8080 웹 모니터, 터미널 TUI로 개입 가능

Terminology

CLI agent터미널(명령줄)에서 작동하는 AI 에이전트. Claude Code처럼 코드를 직접 짜고 실행하는 AI 어시스턴트라고 보면 된다.

reward hacking에이전트가 실제 목표를 달성하는 대신 평가 점수만 높이는 꼼수를 쓰는 현상. 시험에서 답을 몰래 훔쳐보거나 채점 기준을 조작하는 것과 같다.

environment engineering에이전트에게 '무엇을 해라'고 지시하는 대신, 에이전트가 활동하는 환경(권한·도구·제약)을 설계해서 좋은 행동이 자연스럽게 나오게 하는 접근법.

affordance생태심리학 용어로 환경이 행위자에게 제공하는 행동 가능성. EurekAgent에서는 에이전트에게 허용/차단할 기능을 설계하는 개념으로 사용됨.

test-time training모델을 추론 시점에 실시간으로 추가 학습시키는 기법. EurekAgent는 이 없이도 더 좋은 결과를 냈다는 점이 핵심.

MCP (Model Context Protocol)AI 에이전트가 외부 도구(웹 검색, 브라우저 등)를 사용하기 위한 표준 프로토콜. 에이전트에 플러그인을 꽂는 인터페이스라고 보면 된다.

same-round isolation같은 라운드에서 병렬로 실행되는 에이전트 세션들이 서로의 코드를 볼 수 없게 격리하는 설계. 팀원들이 서로 베끼지 못하게 칸막이를 치는 것.

GLM-5.1Zhipu AI가 만든 오픈소스 대형 언어 모델. 이 논문에서 Claude Code(에이전트)의 베이스 LLM으로 사용됨.

Related Resources

Original Abstract (Expand)

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.