Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models
TL;DR Highlight
A math blog post showing how 1840s physics equations connect modern RL and Diffusion Models, explaining that continuous-time RL and generative model training are two faces of the same optimal control problem.
Who Should Read
ML researchers or graduate students who want a deep mathematical understanding of reinforcement learning or Diffusion Models — especially developers familiar with discrete-time Bellman equations but unfamiliar with their extension to continuous time.
Core Mechanics
- Richard Bellman published his Dynamic Programming theory in 1952, and later discovered that extending it to continuous time yields a mathematically identical structure to the Hamilton-Jacobi equation from 1840s physics. In other words, the core equation of modern RL is essentially a rediscovery of a classical mechanics equation.
- Taking the time interval h to zero in the discrete-time Bellman equation yields the HJB (Hamilton-Jacobi-Bellman) equation in PDE form. The key structure is: 'immediate reward + gradient of future value × rate of system change = 0' — expressing the optimal control condition as a differential equation.
- In stochastic systems with noise (Itô processes, i.e., dynamics mixed with Brownian Motion), Itô's formula adds a Laplacian term (the trace term of σσᵀ) to the HJB equation. This is the fundamental difference from deterministic systems.
- Policy Iteration in continuous-time RL consists of two steps. In the Policy Evaluation step, the Feynman-Kac formula is used to estimate the value function of the current policy via Monte Carlo. In the Policy Improvement step, a better action is found using the gradient of the estimated value function.
- Model-Free continuous-time Q-learning is also introduced. The Q-function (value of state-action pairs) is learned directly by approximating with a neural network the continuous-time analogue of the TD error condition from discrete-time Q-learning.
- Practical examples include the Stochastic LQR (linear dynamics + quadratic cost + noise) and the Merton Portfolio (continuous-time portfolio optimization) problems. Both yield closed-form solutions when the HJB is solved, making them useful for algorithm validation.
- Diffusion Model training can be interpreted as a stochastic optimal control problem. Viewing the reverse process from noise to data as a control problem reveals that the score function (the log-gradient of the data distribution) corresponds to the optimal control input.
Evidence
- "A commenter who identified themselves as a beginner in RL asked, 'This post is beyond my level — are there good books or resources with step-by-step implementations using ML libraries?' This is evidence that the post demands a substantial mathematical background. A person who studied control theory as an electrical engineering undergraduate commented that they were glad 'the mathematics of control theory has remained useful for so long,' suggesting that readers with a control theory background will find the content much more approachable. A general software engineer commenter honestly expressed anxiety, saying they felt 'completely overwhelmed by mathematicians' and weren't sure 'if the software field will survive in five years — like being an ice seller when the refrigerator is about to be invented.' This illustrates how high the mathematical difficulty is for the average developer. One commenter raised a fundamental issue: 'it is not clear why continuous-time mathematics applies to digital computers.' They pointed out that real numbers are defined as equivalence classes of Dedekind cuts or Cauchy sequences, while digital computers only handle finite bit strings, making it far from obvious that analysis equations requiring infinite precision directly correspond to algorithms — criticizing this as a problem that is always 'swept under the rug' in numerical analysis. Comments also reported formatting bugs (Bellman equation labels overlapping with formulas), stray quotation marks mixed into HJB equations, and a 'suggest correction' link returning a 404 error, indicating the blog post is still not fully polished."
How to Apply
- "When working with continuous-time RL environments (robotics, financial portfolio optimization, etc.), applying the continuous-time Q-learning described in this post instead of discrete-time Q-learning enables learning that is less sensitive to the choice of time interval. In particular, the Merton Portfolio example can be used directly as a baseline validation for financial reinforcement learning projects. When customizing Diffusion Models or researching new sampling algorithms, reinterpreting the reverse SDE as a stochastic optimal control problem offers a new perspective on score function design. Refer to the Diffusion Models section of this post to examine what control problem your model's score matching objective is solving. When implementing Neural Policy Iteration that approximates the value function with a neural network, applying the Feynman-Kac Monte Carlo technique from this post to the Evaluation step allows value estimation through path sampling alone, without directly solving the PDE. It is recommended to first validate on a simple problem with an analytic solution like LQR before extending to more complex environments."
Terminology
Related Papers
Show HN: Neural Particle Automata
고정된 격자 대신 움직이는 파티클 위에서 동작하는 Neural Cellular Automata의 확장 버전으로, 형태 생성·포인트 클라우드 분류·텍스처 합성 등 다양한 작업에서 자기조직화 동작을 학습할 수 있다.
The annotated PyTorch training loop
PyTorch 학습 루프의 각 코드 줄이 왜 그 위치에 있어야 하는지, 순서를 바꾸거나 빠뜨렸을 때 어떤 문제가 생기는지를 단계별로 설명한 심층 가이드다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.