Authors: Shaobo Wang, Xuan Ouyang*, Tianyi Xu*, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren***†, Dayiheng Liu†, Linfeng Zhang†**
Affiliations: Shanghai Jiao Tong University, Alibaba Group, University of Wisconsin–Madison, Mila-Quebec AI Institute
<aside> 📢
We are excited to introduce OPUS (Optimizer-aware Projected Utility Selection), a theoretically principled framework for dynamic data selection in every iteration during LLM pre-training.
While most pipelines still rely on static filtering or gradient-only online heuristics, these approaches either fail to adapt to the model’s evolving training needs or become prohibitively expensive at scale—and, critically, they often score examples in a geometry that does not match the optimizer’s actual parameter updates. OPUS is optimizer-aware: it evaluates utility directly in the optimizer-induced update geometry by estimating each sample’s one-step contribution toward a high-quality proxy distribution. To make per-step scoring scalable, OPUS combines efficient inner-product computation with CountSketch-based random projections, and replaces brittle greedy top-k selection with Boltzmann soft sampling to preserve diversity. ****
Experiments show that OPUS delivers clear gains—for example, on GPT-2-XL model trained for 30B tokens on FineWeb-Edu dataset with Muon optimizer, OPUS outperforms the strongest static/dynamic data selection baselines by +2.5–2.8 points. It even matches the training performance of 200B tokens.
OPUS is also compatible to other industrial static filters like Fineweb-Edu classifiers. Specifically, when OPUS is trained by selecting from the middle-quality data pool, while all baselines are trained on the higher quality data; despite this disadvantage, OPUS still outperforms both strong static filters and prior online selection methods.
</aside>
Takeaways for dynamic data selection in LLM pre-training (OPUS)
<aside> 🔥
Consider a training stream $\mathcal{D}{\mathrm{tr}}$ *and a proxy validation set $\mathcal{D}{\text {val }}$. At iteration $t$, we draw a candidate pool $\mathcal{B}t=\left\{z{t, 1}, \ldots, z_{t, N}\right\}$ of size $N$ from the stream. The goal is to select a subset $\widehat{\mathcal{B}}t \subset \mathcal{B}t$ *of size $k$ (typically $k=\lfloor\rho N\rfloor$ ) that maximizes progress on $\mathcal{D}{\text {val }}$. Let $\mathcal{L}(\mathcal{D} ; \theta):=\frac{1}{|\mathcal{D}|} \sum{z \in \mathcal{D}} \ell(z ; \theta)$ denote the empirical risk. We define the Utility of a candidate subset $S$, denoted $U^{(t)}(S)$, as the reduction in loss on the validation set achieved by updating the model using $S$ :
$$ \begin{equation}U^{(t)}(S):=\mathcal{L}\left(\mathcal{D}{\mathrm{val}} ; \theta_t\right)-\mathcal{L}\left(\mathcal{D}{\mathrm{val}} ; \theta_{t+1}(S)\right),\end{equation} $$
where $\theta_{t+1}(S)$ represents the model parameters after a single optimization step using the subset $S$. Here $\mathcal{D}{\text {val }}$ *serves as a proxy signal for the desired training direction; in practice it can be a held-out validation file or a curated proxy pool. We interpret $\theta{t+1}(S)$*as a one-step lookahead update obtained by applying the optimizer to subset $S$ while holding the optimizer state fixed at step $t$.
Directly evaluating Eq. (1) for every candidate subset is computationally intractable. To derive an efficient scoring mechanism, we perform a second-order Taylor expansion of the validation loss around $\theta_t$ :
$$ \begin{equation} \begin{split} \mathcal{L}\!\left(\mathcal{D}{\mathrm{val}};\theta{t+1}(S)\right) &\approx \mathcal{L}\!\left(\mathcal{D}{\mathrm{val}};\theta_t\right) +(\theta{t+1}-\theta_t)^{\top}\mathbf{g}{\mathrm{val}} \\ &\quad+\frac{1}{2}(\theta{t+1}-\theta_t)^{\top}\mathbf{H}{\mathrm{val}}(\theta{t+1}-\theta_t) \end{split} \end{equation} $$
where $\mathbf{g}{\text {val }}=\nabla \mathcal{L}\left(\mathcal{D}{\text {val }} ; \theta_t\right)$ is the validation gradient and $\mathbf{H}_{\text {val }}$ is the Hessian. Substituting this into Eq. (1), the utility becomes:
$$ \begin{equation} U^{(t)}(S) \approx-\underbrace{\left(\theta_{t+1}(S)-\theta_t\right)^{\top} \mathbf{g}{\text {val }}}{\text {Proxy Alignment }}-\underbrace{\frac{1}{2}\left(\theta_{t+1}(S)-\theta_t\right)^{\top} \mathbf{H}{\text {val }}\left(\theta{t+1}(S)-\theta_t\right)}_{\text {Redundancy Interaction }} . \end{equation} $$
A critical limitation of prior work is assuming the parameter update $\Delta \theta=\theta_{t+1}-\theta_t$ follows simple SGD dynamics (i.e., $\Delta \theta \propto-\sum \nabla \ell(z)$ ). In modern LLM training, the actual update direction can deviate substantially from the raw gradient due to optimizer state such as adaptive scaling, momentum transformation.
We therefore model training dynamics through an optimizer-dependent transformation $\Psi_t(\cdot)$, which maps a per-example gradient to its effective update at step $t$ :