Large Language Diffusion Models

Contents

Motivation

We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning,
conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from
the core principle of generative modeling: approximating the true language distribution through maximum
likelihood estimation.

We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled
generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.

Method

LLaDA is a masked diffusion model [1, 2] that follows standard pretraining and SFT
while sampling via diffusion. During pretraining, it masks all tokens randomly
at ratio ( t ∼ U[0,1] ); in SFT, only response tokens may be masked. The model
simulates diffusion from full masking ((t = 1)) to unmasking ((t = 0)), predicting
all masks simultaneously at each step with flexible remasking.

Scalability

LLaDA demonstrates impressive scalability, with its overall
trend being highly competitive with that of autoregressive baseline on the same data.

Source link

Motivation

Method

Scalability

Leave a Reply Cancel reply