Jaxformer

Scaling Modern Transformers (Part 0: Intro | Part 1: Tokenization)

This is a zero-to-one guide on scaling modern transformers with n-dimensional parallelism. Transformers have driven much of the deep learning revolution, yet no practical guide reflects SOTA architectures and the complexities of large-scale language modelling. While excellent resources such as DeepMind’s How to Scale Your Model and HuggingFace’s Ultra Scale Playbook exist, a gap remains between theory and end-to-end implementation. We aim to bridge that gap by showing you how to scale a model from scratch (in Jax, with code) to current standards.

Code & Contact

Find the complete code for this guide on our GitHub repository. More information about the authors can be found in the Conclusion.

Introduction

Modern transformers are at the heart of today’s deep learning systems, but taking them from a single-GPU prototype to a multi-node cluster is not straightforward. Scaling efficiently requires understanding how data moves through the hardware, how models can be split across devices, and how training infrastructure ties everything together.

This guide is a practical, code-first walkthrough of scaling modern transformers in JAX. Our goal is to bridge the gap between high-level scaling theory and hands-on implementation. By the end, you should feel comfortable building a SOTA transformer model that runs on TPUs/GPUs, sharding it across devices, and training it at scale with techniques used in SOTA systems.

Prerequisites

Prior to reading this guide, we assume you are famiilar with the following topics and resources (or equivalent material):

Basic Transformer implementations
Familiarity with Distributed Training ideas
JAX basics: we recommend that you start reading through their docs
Andrej Karpathy’s Zero-to-Hero Neural Network series

Goals

By the end of this guide, you should be able to:

Understand how to tokenize and stream large datasets efficiently for training.
Estimate the compute, memory, and communication costs of running a transformer model.
Select and combine parallelism schemes (data, tensor, pipeline, FSDP, MoE) for a given hardware setup.
Confidently configure and launch distributed training runs on multi-host TPU or GPU clusters.
Recognize bottlenecks that prevent strong scaling and know how to address them.

This is v1.0. We aim to update the guide sporadically as we implement more complex ideas and architectures in the future.

Overview

Here’s how the guide is structured:

Part 1: Tokenization at Scale — how to preprocess massive datasets, shard them, and checkpoint safely for distributed training.
Part 2: Base Model — building a transformer in JAX with modules like RMSNorm, RoPE, and Multi-latent Attention.
Part 3: Sharded Model — introducing parallelism strategies (data, tensor, pipeline, FSDP) and applying them to transformer layers.
Part 4: Distributed Training — how to set up TPU/GPU clusters, manage checkpoints, and synchronize training loops.
Part 5: Dataset & Configs — structured configs for datasets, hyperparameters, and runtime options.
Part 6: Mixture of Experts — implementing and training MoE layers, covering routing, stability, and efficiency challenges.
Part 7: Final Run — putting it all together: multi-host scripts, launching large runs across TPU pods, and analyzing results.
Part 8: Conclusions — lessons learned, future directions like DualPipe and expert parallelism, and additional resources.