xAGI AI Briefing : Training, RL, and optimization powered by Datagraph
powered by datagraph data-anotation marketplace
xAGI AI Briefing : Training, RL, and optimization powered by Datagraph
For a long time, the prevailing wisdom in AI has been "more is better" when it comes to training data. However, recent research has revealed a fundamental limitation: the "data wall".1 This is the point where simply scaling the quantity of web data no longer yields meaningful performance improvements. In response, the industry is moving beyond mere collection and toward a more active, intentional approach to data curation and synthesis.
We're excited to partner with Datagraph for this edition of the xAGI AI Briefing.
Datagraph is a premier marketplace that bridges the gap between expert knowledge and advanced AI development. By connecting subject-matter experts directly with AI labs, Datagraph ensures that cutting-edge models are trained on the highest quality, most relevant datasets.
Ready to put your expertise to work? Join as a knowledge expert and earn up to $499 per hour. If you're an AI lab seeking high-quality data, explore their datasets at datagraph.in.
A promising new paradigm for generating high-quality training material is the "source rephrasing" approach.1 Instead of using a massive, expensive language model to create new knowledge
de novo, this method leverages smaller, more cost-effective language models to rephrase and refine existing web content.1 This is the core principle behind frameworks like
WRAP (Web Rephrase Augmented Pre-training) and REWIRE, which transform raw web documents into higher-quality, task-aligned formats like Q&A pairs or instructional passages.1 This technique provides superior diversity and coverage at a fraction of the computational cost, making the creation of state-of-the-art training corpora more accessible than ever.1
NVIDIA's Nemotron-CC dataset is a prime example of this data rephrasing in practice.2 The latest version of this dataset is an updated web crawl that has been enhanced with synthetic rephrasing using models like Qwen3-30B-A3B. Its high-quality synthetic subset, known as
Nemotron-Synth, has become a competitive benchmark for pre-training.1 Despite its quality, even newer synthetic methods like BeyondWeb have shown consistent performance gains, outperforming Nemotron-Synth by up to 2.6 percentage points and achieving a training speedup of 2.7x.1 The development of these advanced data creation techniques signifies a profound shift, with the industry moving from passive data filtering to active, generative data enhancement.
The community is also making strides in data quality through dedicated curation efforts. Hugging Face's new FineWeb-edu dataset is a significant contribution, providing a high-quality, education-focused training resource for the open-source community.3 The dataset was created using an educational quality classifier, which was itself trained on synthetic data from LLama3-70B-Instruct, to filter and retain only the most educational web pages from the original FineWeb dataset. This project addresses the ongoing challenge that many high-performance models are released without their corresponding training datasets, despite the latter's immense impact on performance.3
Finally, the DCLM (DataComp for Language Models) benchmark explicitly validates the importance of this data-centric approach.5 Rather than focusing on finding the best model for a fixed dataset, DCLM fixes the model architecture and training code, challenging researchers to innovate by creating the best possible training sets.5 This benchmark includes various scales and compute budgets, making it accessible to a wide range of researchers and enabling the study of how data quality scales with model size.5 By formalizing data as the primary variable for innovation, DCLM solidifies the notion that a new era of data-driven performance is here.
Project Name
Purpose
Core Methodology
Key Features
Source/Reference
FineWeb-edu
Open-source, high-quality training data
Filtering web data with a classifier trained on synthetic data
Educational content focus; public release of classifier and code; sample subsets available
3
DCLM
Benchmark for data-centric innovation
Fixes model and code; challenges researchers to create best data sets
Multi-scale design; evaluation on 53 downstream datasets; supports re-use of data and custom extraction
5
Nemotron-CC
Next-generation pre-training dataset
Updates web crawls and enriches with synthetic, rephrased data
Includes high-value math and code, diverse Q&A; uses models like Qwen3 for rephrasing
1
The Reinforcement Learning Renaissance
The training of large language models using reinforcement learning has traditionally been a resource-intensive challenge. A significant bottleneck has been the "synchronous" nature of many training pipelines, where the system must wait for the longest sequence in a batch to finish generating before it can proceed.7 This results in massive GPU idle time and training inefficiency. The latest advancements in RL frameworks are directly addressing this problem through a new focus on system-level efficiency.
The AReaL (Ant Reasoning RL) framework represents a groundbreaking solution to this problem with its "fully asynchronous" design.7 This system completely decouples the generation of data from the training process, allowing each rollout worker to continuously produce outputs without waiting for others.7 This "system-algorithm co-design" has yielded impressive results, with the boba² release of AReaL demonstrating a
2.77x speedup compared to synchronous systems like veRL, while achieving superior or on-par training performance.7 For researchers, the recent release of
AReaL-lite further simplifies the process by offering a lightweight, algorithm-first codebase that maintains high performance with 80% fewer lines of code.8
This asynchronous approach is part of a larger ecosystem of new, highly specialized RL frameworks. Verl, for example, is a popular open-source framework that exemplifies the synchronous approach, and it provides multi-node distributed training with automatic Ray cluster setup for algorithms like PPO and GRPO.9 NVIDIA's
NeMo-RL offers an enterprise-grade solution that has added support for the Megatron-Core library, enabling a 6D parallelism strategy and advanced optimizations like sequence packing to handle training of massive models.11 Finally,
THUDM's Slime framework showcases a hybrid approach, designed for RL scaling by uniquely connecting a Megatron backend for training with an SGLang backend for flexible data generation.12 The interactions between these tools are demonstrated in real-world bug reports where all three are used together for training.13
These frameworks are being put to the test on complex tasks. The new Qwen 2.5 model, for example, is being trained using a methodology that involves the Group Relative Optimization Policy (GRPO).9 This application on coding tasks highlights how these advanced RL frameworks are not just theoretical constructs but are actively being used to train powerful, next-generation models with advanced reasoning abilities.14 The race for the future of large-scale RL is no longer just about which algorithm is best but about which underlying system can handle the immense scale and complexity most efficiently.
Framework
Core Technology
Approach
Key Features
AReaL
AReaL's own system with PPO optimizations
Asynchronous
2.77x speedup; decouples generation and training; highly scalable up to 1K GPUs; simplified "lite" version
Verl
Ray cluster management
Synchronous
Multi-node distributed training; supports PPO, GRPO; checkpoint persistence; easy-to-use CLI
NVIDIA NeMo-RL
Megatron-Core, PyTorch DTensor, vLLM
Both (async rollouts available)
6D parallelism; sequence packing; simplifies Megatron config; supports MoE models
Slime
Megatron (training) + SGLang (data generation)
Hybrid
High-performance training; flexible data generation; specifically for RL scaling; used for GLM-4.5
The Optimization Frontier: Performance Beyond Size
The pursuit of better performance is not limited to data and frameworks; it is also being waged at the most fundamental levels of the AI stack. Recent advancements are demonstrating that significant gains can be found by refining core training algorithms and model compression techniques.
A new research paper introduces the AdLoCo method, a three-stage approach designed to improve efficiency in distributed training.16 Its key innovation is
adaptive batched DiLoCo, which dynamically adjusts the local batch size on each compute node to strike the optimal balance between computation and communication.16 This approach significantly lowers synchronization delays, thereby improving convergence speed. The method also includes
Multi-Instance Training (MIT), which allows nodes to run multiple lightweight training streams in parallel, and a "switch mode" that introduces gradient accumulation as batch sizes grow.16 This holistic, end-to-end optimization of the distributed training process shows that efficiency is now a primary design goal, not an afterthought.
Parallel to this, model compression and quantization are reaching new levels of sophistication. The latest releases from vLLM's LLM Compressor (v0.7.0) and NVIDIA's TensorRT-LLM are leading this charge.18 They are integrating advanced techniques like
QuIP and SpinQuant, which use Hadamard-based rotations to lessen quantization sensitivity and improve the accuracy of low-bit models.18 Another major development is the adoption of
DeepSeek v3-style block quantization, which allows for efficient model compression without the need for a calibration dataset.18 The fact that these two major projects are converging on the same powerful techniques highlights the maturity and importance of this optimization frontier.
Even the most foundational algorithms are being re-examined. The Adam optimizer, a cornerstone of modern deep learning, has received renewed attention. Research and community discussions show that its epsilon parameter, often treated as a simple stability constant, is actually a critical tuning variable.20 In reinforcement learning, a larger epsilon can act as a "damping/trust region" term, which helps stabilize optimization in a domain where the target is constantly changing.20 Further exploration in this area has led to the development of Apple's new
AdEMAMix optimizer.22 This optimizer, a simple modification of Adam, uses a mixture of two exponential moving averages, and it is based on the surprising observation that gradients can remain relevant for tens of thousands of training steps.22 AdEMAMix has demonstrated significant efficiency gains, with one model performing comparably to an AdamW model trained on nearly twice as many tokens.22
Technique
Type
Problem Solved
Key Features
AdLoCo
Distributed Training Optimization
Communication and synchronization bottlenecks in training
Adaptive batch sizing; multi-instance training; gradient accumulation switch mode
QuIP / SpinQuant
Quantization / Inference Optimization
Low-bit quantization accuracy loss
Uses Hadamard-based rotations to stabilize quantization and improve accuracy
DeepSeek v3 Block Quantization
Quantization / Inference Optimization
Inefficient model compression and deployment
Partitions weights into blocks for independent quantization; does not require a calibration dataset
AdEMAMix
Foundational Optimization Algorithm
Inefficient use of past gradients; convergence speed
Uses a mixture of two exponential moving averages; reduces forgetting; achieves faster time to target
Today's Top News and People
Model Spotlight: Qwen 2.5: The New Contender in Code
Alibaba's Qwen family of models continues to be a dominant force in the open-source community.23 Today's news highlights the new
Qwen 2.5 Max model, which is showing fantastic performance in coding tasks.15 The model's success is a testament to the advancements in RL training frameworks and datasets discussed throughout this report.14
For a closer look at the model's capabilities, a video from Jayendra_ram tests the Qwen 2.5 Max model against its Turbo counterpart.15 The video and official blog posts provide valuable insight into the new model's performance and the techniques used to train it.
Video: Jayendra_ram's Qwen 2.5 Code Test:(https://www.youtube.com/watch?v?w_7WI6rbrjI&pp=0gcJCfwAo7VqN5tD) 15
Qwen 2.5 Max Blog: https://qwenlm.github.io/blog/qwen2.5-max/ 15
Qwen 2.5 Turbo Blog: https://qwenlm.github.io/blog/qwen2.5-turbo/ 15
People in AI: Finbarr Timbers
A key figure in the "Reinforcement Learning Renaissance" is Finbarr Timbers, an AI researcher and investor with a background at DeepMind and AI2.24 His work is focused on the application of reinforcement learning for large language models, an area he has characterized as the "We are so back" era of RL.23
Readers interested in learning more about his perspective on RL, scaling reasoning, and the future of AI can explore his work and interviews.
Finbarr Timbers' Blog: https://substack.com/@finbarrtimbers 25
Website:
https://finbarr.ca/
24
Interview: Interconnects Interview with Finbarr Timbers on the "We are So Back" Era of Reinforcement Learning:
23
Looking Ahead: Connecting the Dots
Today's news, when viewed as a whole, paints a clear picture of a maturing AI landscape. The initial race to scale models is evolving into a more sophisticated, multi-front campaign for end-to-end optimization. Success in the future will not hinge on a single breakthrough in a new model architecture but on a coordinated effort across three distinct yet interconnected frontiers.
First, the industry is mastering the art of data. The shift from passively collecting vast, raw web data to actively curating and synthetically rephrasing it will unlock performance gains that were once thought to be limited by the "data wall." Second, the race is on to build the next generation of training engines. The new RL frameworks, from AReaL's asynchronous design to NeMo-RL's Megatron-Core integration, are fundamentally changing how models are trained at scale. These systems are not merely new algorithms; they are a new class of engineering solution designed to eliminate systemic bottlenecks like GPU idle time. Finally, the focus on foundational optimizations, from AdLoCo's adaptive batching to Apple's AdEMAMix optimizer and the latest quantization techniques, demonstrates a commitment to efficiency at every layer.
Together, these developments show that the future of AI is a holistic challenge. The most successful models will be those built by teams that can simultaneously master data, leverage cutting-edge training systems, and optimize every aspect of the technology stack, from the foundational algorithms to the hardware-specific implementations.
Works cited
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale ..., accessed August 28, 2025, https://www.arxiv.org/pdf/2508.10975
nvidia/Nemotron-CC-v2 · Datasets at Hugging Face, accessed August 28, 2025, https://huggingface.co/datasets/nvidia/Nemotron-CC-v2
HuggingFaceFW/fineweb-edu · Datasets at Hugging Face, accessed August 28, 2025, https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
What can we learn from Hugging Face's Fineweb Dataset - Kili Technology, accessed August 28, 2025, https://kili-technology.com/large-language-models-llms/what-can-we-learn-from-hugging-face-s-fineweb-dataset
DataComp-LM (DCLM, accessed August 28, 2025, https://www.datacomp.ai/dclm/
apple/DCLM-7B - Hugging Face, accessed August 28, 2025, https://huggingface.co/apple/DCLM-7B
inclusionAI/AReaL-boba-2-32B - Hugging Face, accessed August 28, 2025, https://huggingface.co/inclusionAI/AReaL-boba-2-32B
inclusionAI/AReaL: Distributed RL System for LLM Reasoning - GitHub, accessed August 28, 2025, https://github.com/inclusionAI/AReaL
Verl: State-of-the-art RL Training for LLMs — SkyPilot documentation, accessed August 28, 2025, https://docs.skypilot.co/en/latest/examples/training/verl.html
Scale Machine Learning & AI Computing | Ray by Anyscale, accessed August 28, 2025,
https://www.ray.io/
Reinforcement Learning with NVIDIA NeMo-RL: Megatron-Core ..., accessed August 28, 2025, https://developer.nvidia.com/blog/reinforcement-learning-with-nvidia-nemo-rl-megatron-core-support-for-optimized-training-throughput/
THUDM/slime: slime is a LLM post-training framework ... - GitHub, accessed August 28, 2025, https://github.com/THUDM/slime
[Bug] megatron+sglang 训练moe模型的grpo,单节点,初始化,sglang engine之后,打印的probs全是none #8857 - GitHub, accessed August 28, 2025, https://github.com/sgl-project/sglang/issues/8857
Part 1 - Mathematical Reasoning with GRPO | Reinforcement ..., accessed August 28, 2025,
The NEW AI Model Qwen 2.5 Max is Fantastic at Coding! - YouTube, accessed August 28, 2025,
AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models - arXiv, accessed August 28, 2025, https://arxiv.org/html/2508.18182v1
[2508.18182] AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models - arXiv, accessed August 28, 2025, https://arxiv.org/abs/2508.18182
LLM Compressor 0.7.0 release recap | Red Hat Developer, accessed August 28, 2025, https://developers.redhat.com/articles/2025/08/25/llm-compressor-070-release-recap
Release Notes — TensorRT-LLM - GitHub Pages, accessed August 28, 2025, https://nvidia.github.io/TensorRT-LLM/release-notes.html
Intuitive explanation for Adam's epsilon parameter : r/reinforcementlearning - Reddit, accessed August 28, 2025, https://www.reddit.com/r/reinforcementlearning/comments/j9rflf/intuitive_explanation_for_adams_epsilon_parameter/
Using larger epsilon with Adam for RL? : r/reinforcementlearning - Reddit, accessed August 28, 2025, https://www.reddit.com/r/reinforcementlearning/comments/ctytuq/using_larger_epsilon_with_adam_for_rl/
The AdEMAMix Optimizer: Better, Faster, Older - Apple Machine Learning Research, accessed August 28, 2025, https://machinelearning.apple.com/research/ademamix-optimizer
Interconnects | Nathan Lambert | Substack, accessed August 28, 2025,
Finbarr Timbers, accessed August 28, 2025,
https://finbarr.ca/
Finbarr Timbers - Substack, accessed August 28, 2025, https://substack.com/@finbarrtimbers