I couch-surfed in the Bay Area for 2 months to learn how AI models are really trained.
The biggest takeaway: pre-training is hitting a wall.
For the last 2 months, I've been a full-on AI nomad. I've been couch-surfing across the Bay Area, hitting up every hackathon and meetup I could find, and basically talking to any eng leader or AI researcher who would listen. My goal was to get past the hype and understand how foundation models are actually built and improved.
I wanted to share the biggest thing I learned about how AI models are trained, which boils down to two key phases:
I) Pre-training (What we've been doing):
This is what you probably think of: sucking up the entire public internet. Every book, blog post, Wikipedia article, GitHub repo, and YouTube transcript. For a long time, bigger data = smarter model. This got us incredible results.
BUT... the consensus is that around a year back , the gains from this started to hit a plateau. Why? We've basically... run out of high-quality public internet to scrape. The gains from just adding more random text are diminishing fast.
II) Post-training (Where the real gains are now):
This is the "finishing school" for models, and it's where AI labs are investing heavily.
Since raw internet data is tapped out, the only way to make models smarter, more accurate, and better at reasoning is to use high-quality, expert-guided feedback. This isn't just data scraping; it's data creation.
This includes all the terms you hear thrown around:
RLHF (Reinforcement Learning with Human Feedback): The classic "was this response better or worse?" loop, but done by experts.
Expert Prompt-Response Pairs: Getting a PhD in physics to write a perfect, step-by-step answer to a complex problem, which is then used as a gold-standard fine-tuning example.
Preference Ranking Data: Showing the model two answers to a tricky legal or medical question and having an actual lawyer or doctor pick the better one.
Annotated Trajectories: This one is super important for reasoning. It means recording an expert as they solve a multi-step problem (like debugging code or doing a complex financial analysis) and teaching the model to replicate that entire reasoning path, not just the final answer.
If you want to go deep on this, the GPQA paper is a fantastic read. It shows how even frontier models struggle with graduate-level expert questions and how crucial this kind of expert data is to fix those gaps.
This whole experience convinced me that the biggest bottleneck in AI is no longer just compute—it's access to a scalable network of actual experts who can generate this data.
So, I'm building a project to tackle exactly this: datagraph.in
The goal is to connect AI labs directly with an engaged community of university students, PhD candidates, and verified domain experts to create the high-quality, bespoke post-training data they need.
If you're at an AI lab or on a team that's struggling with scaling your data quality for post-training workflows, I'd love to chat.
Feel free to DM me here or shoot me an email at saurav@xagi.in.
I'll leave you with my favorite quote from this whole journey, via OpenAI's CTO :
"The model of today is the worst model you will ever use."