Skip to content

AI’s new motto: Fake it till you make it, with synthetic data

At CES 2025, Elon Musk revealed that AI companies are running out of training data, but they have a plan.

AI Logs
FacebookLinkedInXReddit
Google News Preferred Source
FacebookLinkedInXReddit
Google News Preferred Source
Illustrative image.
Illustrative image.Vertigo3d/iStock
0:00 / 0:00

Sejal Sharma is IE’s AI columnist, offering deep dives into the world of artificial intelligence and its transformative impact across industries. Her bi-monthly AI Logs column explores the latest trends, breakthroughs, and ethical dilemmas in AI, delivering expert analysis and fresh insights. To stay informed, subscribe to our AI Logs newsletter for exclusive content.

Farmers have been growing their crops on the land, relying on the richness of the soil, plenty of underground water, and predictable seasons. It’s the magnificence of nature’s cycle to see the soil giving nutrients to the crop and then replenishing itself to sustain a second harvest. Then, a third. And a fourth.

Big technology companies are like the farmers. The artificial intelligence (AI) models they develop are like plants. And the soil is quality data. These companies train their AI models on vast amounts of data that provide the essential ‘nutrients,’ if you will, for their advancement.

Just as farmers must care for and replenish the soil to ensure a second harvest, technology companies must responsibly manage and refine their data sources so that innovation doesn’t stop — like going from AI to AGI or finding a cure for AIDS. 

If the soil is overused, crops wither. Likewise, when data is outdated or poorly handled, AI models falter. Thankfully, we humans have not yet run out of soil, but it seems we are running out of quality data to train AI models. 

Elon Musk recently reiterated the same at the Consumer Electronics Show (CES) 2025. He said, “We have exhausted basically the cumulative sum of human knowledge in AI training. That was basically last year.”

If that’s true, it’s no longer just about finding data; it’s about creating it. Musk’s thinking aligns with that of several other prominent academicians and scientists in AI/LM. Ilya Sutskever, former OpenAI scientist, said in December last year, “We’ve achieved peak data, and there’ll be no more. We have to deal with the data that we have. There’s only one internet.”

True, there is only one internet, and the data in it is expected to be all used up between 2026 and 2032, as per research conducted by AI forecasting organization Epoch, after it drew an estimate of how much text data is currently available and how much future models will need for training.

By the time 2026 or 2032 rolls around—an ambiguous six-year window that doesn’t inspire much confidence—a lot of new data will have been created on the internet.

The volume of data created, captured, and replicated on the internet in 2024 was 149 zettabytes. By 2026, this number is expected to increase to 221 zettabytes. To put things in perspective, a zettabyte (ZB) is equal to 1 trillion gigabytes (GB). Coming back to what the researchers at Epoch said — this creation of more data by 2026 and beyond will not be enough to satiate the needs of AI models.

These zettabytes of data above do not count the data that lies in the deep web, which accounts for about 90 percent of the internet. However, most general-purpose AI, including language models, is limited to the publicly accessible surface web. 

The problem with training on deep web data is that a lot of it is deeply problematic. Think illegal porn, extreme ideologies, human trafficking, and other illegal activities. In a time where AI models are suggesting that we add glue to our pizza if our cheese doesn’t stick, let’s not mix things up with the deep web, shall we? 

There is a solution being floated for when there’s no more new human text to use. It’s called synthetic data, which is fake data created by computers that looks and behaves like real data. For example, when training AI models on synthetic data, instead of using real photos of people, you feed the AI machine computer-generated images of faces.

It’s kind of like farmers turning to lab-grown crops when overpopulation and urbanization led to soil degradation. While hydroponic crops helped fill the gap, they raised new questions about whether “man-made” food can really be called food. And that was my last farmer-related AI analogy for the day. I promise.

The big tech is now forced to rethink its approach, turning to synthetic data to fill the gap. The appeal of synthetic data also lies in its ability to sidestep issues like privacy violations or training on copyrighted content. Companies like OpenAI are already embroiled in 13 lawsuits across the globe, accused of violating copyright laws by training their models on content it hasn’t paid for or sought permission for.

Synthetic data, although, is not without its issues. One big problem is that it can reinforce existing biases. If the original data it’s based on is biased, the synthetic data will be biased, too, and that can lead AI to make unfair or skewed decisions. 

Another issue is that synthetic data can make AI “hallucinate,” essentially creating answers that sound right but are totally made up. 

“How do you know if it’s the hallucinated answer or it’s a real answer? It’s challenging to find the ground truth,” asked Musk. Musk’s worry about telling the difference between real and fake answers sums up a big problem: how can we trust AI when the data it’s trained on might not even be accurate?

Despite these concerns, for now, synthetic data might be the best option for companies racing to build more advanced AI systems like AGI or ASI (artificial super intelligence).

Other solutions

Logically, another way to solve this problem of lack of high-quality data is to create more human-generated content.

With deep pockets of big tech, they can just hire a thousand writers to write human-generated data every day and train their AI models on them. If 1,000 writers are paid $50,000 annually, the cost is $50 million/year. And while big tech can presumably cough up that kind of money, it would still be too costly and time-consuming.

Another solution could be to bring all the offline data online. Think books, academic work, old computer data, and handwritten notes that have been hidden in forgotten drawers or filed away in dusty boxes. This data isn’t currently accessible on the internet, yet it may hold valuable insights that could drive innovation, academic research, or even commercial success.

Converting these paper records into digital format would require a significant effort like scanning, transcribing, and deciphering handwriting. The potential benefits would be enormous. By digitizing this information, we could unlock previously untapped data that could enhance AI models, improve historical accuracy, and help solve modern challenges with a better understanding of the past.

But here is the issue: converting paper-based data into a usable digital form doesn’t come free. This feat would require lots of money, time, and human resources, especially if the data is scattered across physical archives or is in different languages. For an organization taking on this task, there’s the further question of how to monetize the project. Will they release it freely to the public? Or will they charge users to access it?

“AI will do anything you want and suggest things you have never even thought of…in max three to four years,” said Musk at CES 2025. The next few years will likely determine whether we have ‘faked it’ well enough to ‘make it’ to a future where AI can genuinely change the world for the better.

Recommended Articles

0COMMENT

Sejal is a Delhi-based journalist, currently dedicated to reporting on technology and culture. She is particularly enthusiastic about covering artificial intelligence, the semiconductor industry and helping people understand the powers and pitfalls of technology. Outside of work, she likes to play badminton and spend time with her dogs. Feel free to email her for pitches or feedback on her work.