Where's the shovelware? Why AI coding claims don't add up
TL;DR Highlight
A 25-year veteran developer debunks AI coding tool productivity claims with real data and software release statistics. If AI truly gives 10x productivity, why isn't there a flood of new software?
Who Should Read
Developers using AI coding tools (Cursor, Copilot, Claude Code, etc.) who wonder 'am I really faster?' or engineering managers considering an AI-first transition.
Core Mechanics
- The METR study showed developers felt 20% faster with AI but were actually 19% slower. Developers are unreliable observers of their own productivity.
- The author coin-flipped between AI and manual work for 6 weeks, measuring time — AI was 21% slower by median, and statistical significance would require 4 more months. No dramatic 2x difference exists.
- Cursor claims 'extraordinary productivity,' GitHub Copilot says 'delegate like a boss,' Google claims 25% faster, and 14% of developers report 10x improvement. These claims fuel management FOMO.
- The author's core argument: if AI truly explodes productivity, Steam games, indie apps, and SaaS launches should surge — but actual release statistics show no such shovelware flood.
- These exaggerated claims cause real damage. Companies use 'AI-first' to cut developer hiring, lower salaries, and compress project timelines to 20% of original.
- As of writing (Sept 2025), global software release graphs show no significant increase after AI adoption — the author's empirical evidence.
- The author doesn't reject AI coding's dream itself, but warns that claimed productivity levels aren't supported by data, and management decisions based on them are dangerous.
Evidence
- A developer shared their manager declared 'we are AI-first' and compressed project timelines to 20% of original. SVP and PM AI mania at unprecedented levels — a stark gap between reality and executive expectations.
- Many comments took a middle ground: 'AI is dramatically useful only for specific tasks.' Clearly faster for learning new APIs, scaffolding, mock creation, and codebase exploration, but no difference or net negative for large existing codebase modifications.
- A former CTO countered with 'infinite X improvement for me' — someone who hadn't been coding hands-on could now build large systems again thanks to Claude Code, noting professional developers and non-developers have completely different experiences.
- Some argued the data (through March-April) was too early since agentic coding only became usable around April-May 2025. One developer built a multi-week CLI tool in a day with agents as a specific counterexample.
- Paul Krugman's internet productivity research was cited — the internet also showed no significant productivity statistical change for 25 years after its debut, suggesting AI's economic effects may not appear until the 2030s.
How to Apply
- To measure AI coding tool impact on your team, use the METR study methodology: coin-flip to split AI/non-AI groups and record actual time spent. Use data, not feelings, to avoid self-deception.
- Don't apply AI tools uniformly — focus on tasks with proven effectiveness (scaffolding, mock writing, new API exploration, boilerplate generation) and pair manual work for complex existing code modifications.
- When executives try to compress timelines citing AI productivity, use the METR study results (perceived +20% vs actual -19%) and software release statistics as evidence for realistic timeline negotiation.
- Multiple comments flagged juniors uncritically accepting AI-generated code. During code reviews, especially check AI-generated code for style guide compliance, existing library usage, and PR size.
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.