MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

TL;DR Highlight

Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.

Who Should Read

ML engineers or researchers struggling with large model training due to insufficient GPU VRAM, especially developers who need to perform fine-tuning or experiments in a single GPU environment.

Core Mechanics

MegaTrain adopts a 'memory-centric' design, storing model parameters and optimizer states in CPU host memory rather than GPU VRAM. The GPU acts as a 'transient compute engine,' receiving and processing data only when computation is required.
Parameters are streamed from the CPU to the GPU for each layer, and the calculated gradients are sent back to the CPU. This minimizes the data residing on the GPU, dramatically reducing VRAM usage.
To address CPU-GPU bandwidth bottlenecks, a 'pipelined double-buffered' execution engine was introduced. Utilizing multiple CUDA streams, parameter prefetching, actual computation, and gradient offloading are executed concurrently, keeping the GPU constantly busy.
Instead of maintaining the existing autograd graph, 'stateless layer templates' were introduced, dynamically binding parameters as they are streamed in. This eliminates the need to keep graph metadata on the GPU and increases scheduling flexibility.
Stable training of models up to 120B parameters was successfully validated with a combination of an H200 GPU and 1.5TB of CPU memory. Furthermore, a 1.84x higher training throughput was achieved compared to DeepSpeed ZeRO-3 + CPU offloading, based on a 14B model.
It was also confirmed that a 7B model can be trained with a 512k token context on a single GH200 (GPU-CPU integrated architecture) device. This is significant as it enables ultra-long context learning on a single node.

Evidence

One user with an RTX 3080 (10GB VRAM) was struggling with OOM errors when training models larger than 40M~50M parameters, and expressed positive feedback, stating that this approach would allow them to train much larger models locally on their PC with ample CPU RAM.
One commenter pointed out that the idea itself isn't novel and criticized its limitations in terms of actual speed. They claimed to have achieved approximately 1,000 tok/s on a 4090 using a similar method, while the paper reported 341 tok/s with a single 3090 for a 14B model, emphasizing that it's still too slow for practical pretraining.
The same commenter mentioned additional optimization techniques not mentioned in the paper, such as accumulating gradients directly into the optimizer state instead of offloading them separately, using the Muon optimizer which uses half the VRAM compared to Adam, and applying 4-bit quantization to both parameters and optimizer states.
There was a cynical response regarding the H200 + 1.5TB host memory requirement, stating that 'while it's a single GPU, it's by no means a lightweight setup.' It pointed out that it's not an easily accessible environment for the average developer.
Some opinions suggested similarity to PyTorch's FSDP (Fully Sharded Data Parallel) feature. Specifically, questions were raised about how much of this approach could be reproduced using only the `torch.distributed.fsdp` primitive, and doubts were raised about the effectiveness of this technique on architectures with integrated GPU-CPU memory like Apple M series.
There was an opinion that this technique is practically useful only for small-scale fine-tuning tasks and is too slow for large-scale pretraining. Reactions also indicated similarity to DeepSpeed.

How to Apply

If you need to fine-tune a 14B~30B scale model in a single GPU environment with limited VRAM (e.g., RTX 3090, RTX 4090), applying the MegaTrain approach instead of DeepSpeed ZeRO-3 + CPU offloading can yield up to 1.84x faster training throughput on the same hardware.
If you want to experiment with models larger than 100B parameters without multi-GPU in a server environment with a high-end single GPU like H200 and 1TB or more of CPU memory, MegaTrain's layer-by-layer streaming approach can enable 120B model training without a multi-node setup.
If you need to train a 7B model requiring ultra-long contexts (512k tokens or more) on a single device, combining MegaTrain with an architecture like GH200, which has high-speed GPU-CPU memory connectivity, can handle it without multi-GPU.
If direct implementation is difficult, consider first reviewing PyTorch's `torch.distributed.fsdp` (Fully Sharded Data Parallel) feature, which supports similar CPU offloading, and then incrementally applying the double buffering technique from the MegaTrain paper if performance is insufficient.

Terminology

Full Precision TrainingA method of training model parameters in their original FP32 (32-bit floating-point) format. The opposite is mixed precision training, such as FP16 or BF16. Higher precision requires more memory.

CPU OffloadingA technique of moving data that is difficult to fit into GPU VRAM (parameters, optimizer states, etc.) to CPU RAM. The GPU only fetches it when needed. It slows down the process but allows handling much larger models.

ZeRO-3The most aggressive stage of DeepSpeed's memory optimization strategy. Parameters, gradients, and optimizer states are distributed across multiple GPUs and fetched only when needed.

Double BufferingA technique of pre-loading data required for the next operation. While processing A, B is loaded in advance, allowing immediate processing of B after A is finished without waiting.

Autograd GraphThe computational flow graph internally maintained by PyTorch and others for backpropagation. It consumes a significant amount of memory during training.

Throughput (처리량)The number of tokens that can be processed per unit of time (tok/s). It is an indicator of training speed, and a higher value means more data can be trained in the same amount of time.