Anthropic's original take home assignment open sourced
TL;DR Highlight
Anthropic open-sourced a performance optimization challenge they use internally for hiring — and Claude Opus 4.5 scored higher than top human candidates in a 2-hour window.
Who Should Read
Systems programmers interested in performance optimization challenges, and ML researchers tracking AI's capabilities on hard algorithmic problems.
Core Mechanics
- Anthropic uses a multi-stage performance optimization problem as a hiring filter — they open-sourced this challenge, allowing public benchmarking.
- Claude Opus 4.5 was run against the challenge with a 2-hour time limit and achieved a score higher than the best human candidates who had also completed it.
- The challenge involves real-world performance work: profiling, identifying bottlenecks, applying algorithmic and systems-level optimizations, measuring results.
- This is a meaningful benchmark because it's a real task with a measurable objective metric (performance improvement), not a subjective evaluation.
- The 2-hour constraint is significant — it's not unlimited time for the AI to brute-force approaches but a time-boxed task matching human evaluation conditions.
- Implication: for structured optimization tasks with clear success metrics, AI is now competitive with strong human candidates at the level Anthropic hires.
Evidence
- Anthropic published the challenge publicly, enabling community verification of Claude's results by having others attempt the same problem.
- HN commenters noted the importance of the 'real task' framing — many AI benchmarks are gameable but a performance optimization challenge with measurable output is harder to fake.
- Several engineers attempted the challenge and shared their scores, providing human comparison points confirming Claude's result is genuinely strong.
- Discussion of the implications for hiring: if AI can match strong candidates on technical screening tasks, what does that mean for the purpose of such screens?
How to Apply
- Run your own internal performance optimization challenges against Claude — the public Anthropic challenge provides a calibration baseline.
- For hiring: re-evaluate what your technical screens are testing if AI can now match human performance — the goal should shift to tasks requiring novel problem framing, not just execution.
- Use Claude for performance debugging sessions: profile first, describe the bottleneck, and use Claude to enumerate and prioritize optimization approaches before implementing.
- Try the Anthropic challenge yourself first (before asking Claude to do it) — the process reveals where human expertise still adds unique value.
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.