Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models

TL;DR Highlight

A math blog post showing how 1840s physics equations connect modern RL and Diffusion Models, explaining that continuous-time RL and generative model training are two faces of the same optimal control problem.

Who Should Read

ML researchers or graduate students who want a deep mathematical understanding of reinforcement learning or Diffusion Models — especially developers familiar with discrete-time Bellman equations but unfamiliar with their extension to continuous time.

Core Mechanics

Richard Bellman published his Dynamic Programming theory in 1952, and later discovered that extending it to continuous time yields a mathematically identical structure to the Hamilton-Jacobi equation from 1840s physics. In other words, the core equation of modern RL is essentially a rediscovery of a classical mechanics equation.
Taking the time interval h to zero in the discrete-time Bellman equation yields the HJB (Hamilton-Jacobi-Bellman) equation in PDE form. The key structure is: 'immediate reward + gradient of future value × rate of system change = 0' — expressing the optimal control condition as a differential equation.
In stochastic systems with noise (Itô processes, i.e., dynamics mixed with Brownian Motion), Itô's formula adds a Laplacian term (the trace term of σσᵀ) to the HJB equation. This is the fundamental difference from deterministic systems.
Policy Iteration in continuous-time RL consists of two steps. In the Policy Evaluation step, the Feynman-Kac formula is used to estimate the value function of the current policy via Monte Carlo. In the Policy Improvement step, a better action is found using the gradient of the estimated value function.
Model-Free continuous-time Q-learning is also introduced. The Q-function (value of state-action pairs) is learned directly by approximating with a neural network the continuous-time analogue of the TD error condition from discrete-time Q-learning.
Practical examples include the Stochastic LQR (linear dynamics + quadratic cost + noise) and the Merton Portfolio (continuous-time portfolio optimization) problems. Both yield closed-form solutions when the HJB is solved, making them useful for algorithm validation.
Diffusion Model training can be interpreted as a stochastic optimal control problem. Viewing the reverse process from noise to data as a control problem reveals that the score function (the log-gradient of the data distribution) corresponds to the optimal control input.

Evidence

"A commenter who identified themselves as a beginner in RL asked, 'This post is beyond my level — are there good books or resources with step-by-step implementations using ML libraries?' This is evidence that the post demands a substantial mathematical background. A person who studied control theory as an electrical engineering undergraduate commented that they were glad 'the mathematics of control theory has remained useful for so long,' suggesting that readers with a control theory background will find the content much more approachable. A general software engineer commenter honestly expressed anxiety, saying they felt 'completely overwhelmed by mathematicians' and weren't sure 'if the software field will survive in five years — like being an ice seller when the refrigerator is about to be invented.' This illustrates how high the mathematical difficulty is for the average developer. One commenter raised a fundamental issue: 'it is not clear why continuous-time mathematics applies to digital computers.' They pointed out that real numbers are defined as equivalence classes of Dedekind cuts or Cauchy sequences, while digital computers only handle finite bit strings, making it far from obvious that analysis equations requiring infinite precision directly correspond to algorithms — criticizing this as a problem that is always 'swept under the rug' in numerical analysis. Comments also reported formatting bugs (Bellman equation labels overlapping with formulas), stray quotation marks mixed into HJB equations, and a 'suggest correction' link returning a 404 error, indicating the blog post is still not fully polished."

How to Apply

"When working with continuous-time RL environments (robotics, financial portfolio optimization, etc.), applying the continuous-time Q-learning described in this post instead of discrete-time Q-learning enables learning that is less sensitive to the choice of time interval. In particular, the Merton Portfolio example can be used directly as a baseline validation for financial reinforcement learning projects. When customizing Diffusion Models or researching new sampling algorithms, reinterpreting the reverse SDE as a stochastic optimal control problem offers a new perspective on score function design. Refer to the Diffusion Models section of this post to examine what control problem your model's score matching objective is solving. When implementing Neural Policy Iteration that approximates the value function with a neural network, applying the Feynman-Kac Monte Carlo technique from this post to the Evaluation step allows value estimation through path sampling alone, without directly solving the PDE. It is recommended to first validate on a simple problem with an analytic solution like LQR before extending to more complex environments."

Terminology

HJB (Hamilton-Jacobi-Bellman) 방정식A partial differential equation that the optimal action strategy must satisfy in an optimal control problem. It is the continuous-time extension of the discrete-time Bellman equation, obtained by taking the time interval to zero.

Itô 프로세스A stochastic system that evolves over time. It takes the form of deterministic change plus Brownian Motion (random perturbations), and is used to model financial asset prices or the noisy movement of robots.

Feynman-Kac 공식A method for expressing the solution of a PDE as the expected value of stochastic paths (samples). It enables Monte Carlo estimation through simulation instead of solving the equation directly.

Score FunctionThe gradient of the log probability density function of a data distribution (∇ log p(x)). It serves as the key signal indicating the direction of denoising in Diffusion Models, and corresponds to the optimal control input from an optimal control perspective.

LQR (Linear Quadratic Regulator)A control problem with linear dynamics and a quadratic cost function. This combination yields the optimal control input cleanly as a linear function of the state, making it a frequently used benchmark problem for algorithm validation.

Dynamic Programming Principle (DPP)The principle that an optimal strategy can be decomposed into 'the current choice plus the optimal strategy going forward.' It is the core idea behind deriving the HJB equation and the theoretical foundation of all RL algorithms including Q-learning.