Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

TL;DR Highlight

An experiment report: give Claude Code 16 GPUs and it runs 910 experiments in 8 hours, achieves a 2.87% improvement in validation loss, and develops its own strategy for leveraging a mixed H100/H200 hardware pool.

Who Should Read

ML engineers who spend a lot of time on hyperparameter tuning with repeated experiments, and infrastructure engineers interested in giving AI agents autonomous control of cloud infrastructure.

Core Mechanics

Claude Code autonomously managed the entire ML experimentation loop: designing experiments, submitting GPU jobs, monitoring results, updating hypotheses, and iterating — without human intervention between iterations.
In 8 hours with 16 GPUs, the agent ran 910 experiments and found a configuration that reduced validation loss by 2.87% compared to the human-tuned baseline.
The agent spontaneously developed a strategy for the mixed H100/H200 cluster: assigning larger batch sizes to the faster H200s and smaller jobs to H100s to maximize throughput.
The agent maintained a running hypothesis log, systematically ruling out dead ends and prioritizing promising directions — exhibiting behavior closer to a research scientist than a grid search.
Failure modes included the agent occasionally getting stuck in local optima and needing human nudges to explore different regions of the search space.

Evidence

The experiment logs were shared publicly, showing the agent's actual decision trail — commenters found the H100/H200 hardware strategy emergence particularly impressive.
ML researchers noted that 2.87% validation loss improvement in 8 hours is genuinely competitive with what a skilled human ML engineer could achieve in a similar time budget.
Skeptics raised reproducibility concerns: the improvement might be specific to this model/dataset combination and the agent's choices might not generalize.
The cost analysis showed the 8-hour GPU run cost approximately $800-1200 — comparable to a day of senior ML engineer time, prompting discussion about the economics.

How to Apply

Set up a structured experiment logging system before unleashing an agent on hyperparameter search — the agent needs a consistent format to read its own history.
Define the search space explicitly and impose hard constraints (max batch size, min learning rate) to prevent the agent from exploring obviously bad regions.
Implement a 'human checkpoint' every N experiments or every hour: review the agent's hypothesis log and redirect if it's stuck or heading in an unproductive direction.
Start with a smaller GPU allocation (2-4 GPUs) to verify the agent is behaving sensibly before scaling up to the full cluster.

Terminology

Validation LossA metric measuring model error on held-out validation data — lower is better, and it's the primary optimization target in ML experiments.

Hyperparameter TuningThe process of finding the optimal configuration values (learning rate, batch size, etc.) for a training run, typically requiring many experiments.

H100/H200NVIDIA's high-end data center GPUs — H200 has more memory bandwidth and is faster for many workloads than H100.

Grid SearchAn exhaustive hyperparameter search strategy that tries all combinations in a defined grid — contrasted with more intelligent Bayesian or agent-driven search.