Anthropic's original take home assignment open sourced

TL;DR Highlight

Anthropic open-sourced a performance optimization challenge they use internally for hiring — and Claude Opus 4.5 scored higher than top human candidates in a 2-hour window.

Who Should Read

Systems programmers interested in performance optimization challenges, and ML researchers tracking AI's capabilities on hard algorithmic problems.

Core Mechanics

Anthropic uses a multi-stage performance optimization problem as a hiring filter — they open-sourced this challenge, allowing public benchmarking.
Claude Opus 4.5 was run against the challenge with a 2-hour time limit and achieved a score higher than the best human candidates who had also completed it.
The challenge involves real-world performance work: profiling, identifying bottlenecks, applying algorithmic and systems-level optimizations, measuring results.
This is a meaningful benchmark because it's a real task with a measurable objective metric (performance improvement), not a subjective evaluation.
The 2-hour constraint is significant — it's not unlimited time for the AI to brute-force approaches but a time-boxed task matching human evaluation conditions.
Implication: for structured optimization tasks with clear success metrics, AI is now competitive with strong human candidates at the level Anthropic hires.

Evidence

Anthropic published the challenge publicly, enabling community verification of Claude's results by having others attempt the same problem.
HN commenters noted the importance of the 'real task' framing — many AI benchmarks are gameable but a performance optimization challenge with measurable output is harder to fake.
Several engineers attempted the challenge and shared their scores, providing human comparison points confirming Claude's result is genuinely strong.
Discussion of the implications for hiring: if AI can match strong candidates on technical screening tasks, what does that mean for the purpose of such screens?

How to Apply

Run your own internal performance optimization challenges against Claude — the public Anthropic challenge provides a calibration baseline.
For hiring: re-evaluate what your technical screens are testing if AI can now match human performance — the goal should shift to tasks requiring novel problem framing, not just execution.
Use Claude for performance debugging sessions: profile first, describe the bottleneck, and use Claude to enumerate and prioritize optimization approaches before implementing.
Try the Anthropic challenge yourself first (before asking Claude to do it) — the process reveals where human expertise still adds unique value.

Terminology

Performance optimization challengeAn engineering task requiring identifying bottlenecks and improving the speed/efficiency of a given program — involves profiling, algorithmic thinking, and systems knowledge.

Benchmark (hiring)In this context, a structured problem used to evaluate candidate ability — distinct from ML benchmarks, this is a practical engineering assessment.