Further human + AI + proof assistant work on Knuth's "Claude Cycles" problem
TL;DR Highlight
A post sharing the process of solving the 'Claude Cycles' problem posed by mathematician Donald Knuth through collaboration between human experts, AI (LLMs), and formal proof assistants like Lean — demonstrating the real potential of AI to contribute meaningfully to mathematical research.
Who Should Read
Developers and researchers curious about how far AI can be used in mathematical reasoning or formal verification. Especially those interested in proof assistants like Lean or Coq, or AI for mathematics.
Core Mechanics
- This post covers a collaborative approach using human mathematicians + LLMs (large language models) + formal proof assistants (e.g., Lean) to solve a mathematical problem called 'Claude Cycles' posed by legendary computer scientist Donald Knuth.
- While the original tweet is inaccessible due to JavaScript being disabled, community comments and context indicate it reports further progress beyond previous work, showing that this kind of human-AI collaboration is producing real results in pure mathematics research.
- LLMs are characterized as strong at 'broad but shallow search' — meaning that when an expert sets the direction, LLMs excel at rapidly exploring a wide possibility space and proposing candidate ideas.
- Formal proof assistants like Lean and Coq are software tools that allow mathematical proofs to be written in a machine-verifiable form. Using these tools to verify proof ideas suggested by AI can reliably filter out errors.
- Some in the community predicted that in the future, applying AlphaGo-style reinforcement learning (RL) to Lean's syntax tree will prove more powerful than LLMs, since RL on the Lean syntax tree enables reasoning over much longer time scales.
- There was an observation that a professional mathematician's toolkit consists of roughly 10 core tricks, and if these tricks could be encoded as latent vectors (abstract representations inside AI models), AI could greatly accelerate mathematical research.
- Overall, a sober assessment also coexists: AI handles 'repetitive expert-level tasks' well when guided by specialists, but still has blind spots when it comes to truly difficult and complex problems.
Evidence
- "A witty comment went viral suggesting that 'AI will win a Fields Medal (the highest honor in mathematics) before it takes on the role of a McDonald's manager.' The argument is that while mathematics may seem like using a brain as a hammer to tighten a screw, LLMs' strength in 'broad and shallow search' actually makes them a good fit for mathematical research. There was also a prediction that AlphaGo-style reinforcement learning applied to Lean's syntax tree will become the dominant approach instead of LLMs, as RL-based methods can search over much longer time scales and are better suited for complex proofs. A realistic comment noted that it's unsurprising AI performs well when guided by experts — AI handles experts' 'lazy work' effectively, but still has blind spots on truly hard problems. One comment said it was hard to tell whether the thread participants were bots or humans, a meta-observation reflecting how deeply AI has become involved in mathematics community discussions, making it difficult to distinguish who is real. There were also comments wondering 'if anyone would tackle P≠NP this way,' and practical questions like 'what does this mean for ordinary people?' — reflecting that this type of research still largely remains within specialist communities."
How to Apply
- "When mathematical proof or algorithm correctness verification is needed, you can build a two-stage pipeline: generate draft proof ideas with an LLM, then verify them using a proof assistant like Lean or Coq to mechanically catch errors. Rather than trying to solve complex math problems with an LLM alone, design a role-sharing structure where a domain expert (or expert-level prompt) sets the direction and the LLM explores candidate paths — this yields far more reliable results. If you're interested in the AlphaGo-style RL + formal proof tool combination, use DeepMind's AlphaProof or related papers as references and experiment with reinforcement learning agents in the Lean environment. This field is currently advancing rapidly."
Terminology
Related Papers
What happened after 2k people tried to hack my AI assistant
실제로 6,000개 이상의 이메일로 AI 에이전트에 prompt injection 공격을 시도한 공개 실험 결과로, Claude Opus 4.6이 비밀 파일 유출을 한 번도 허용하지 않았지만 실험 설계의 현실성에 대한 논란이 뜨거웠다.
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
여러 LLM을 조합해도 '모든 모델이 동시에 틀리는 비율(β)'이 성능 상한선이며, 업계가 쓰는 pairwise 상관계수(ρ)는 이 상한선을 예측하지 못한다.
Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
실제 환경처럼 API가 망가지거나 결과가 이상할 때 LLM 에이전트가 얼마나 잘 버티는지 측정하는 벤치마크 ToolBench-X 공개.
Nearly Half of LG Smart TV Apps Contain Residential Proxy SDKs
6,038개의 LG·Samsung 스마트 TV 앱을 스캔했더니 2,058개에서 사용자의 IP를 몰래 팔아 트래픽을 중계하는 Residential Proxy SDK가 발견됐다. TV는 컴퓨터처럼 감시받지 않아서 프록시 호스트로 거의 이상적인 환경이다.
Prompt Injection as Role Confusion
LLM이 시스템 프롬프트, 사용자 입력, 툴 출력을 구분하지 못하는 구조적 결함이 prompt injection의 근본 원인이라는 ICML 2026 논문으로, 현재 LLM 보안 아키텍처의 한계를 명확히 분석한다.
GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2
모델 크기가 커질수록 성능이 좋아진다는 통념에 반해, 오픈소스 753B 모델 GLM-5.2가 추정 1~2T 규모의 GPT-5.5보다 환각 비율이 3배 낮다는 벤치마크 결과가 나왔다. 단순히 파라미터 수와 벤치마크 점수만으로 모델을 선택하면 실제 업무에서 낭패를 볼 수 있다는 경고다.