A hot potato: After having scraped the whole web to build their generative models, AI companies are now working on a new training paradigm based on computer-made data. Digital synthesis is better than human-made content for AI evolution, it seems. And it should pose no issues with copyright and privacy infringement.

An AI feedback loop is threatening to destroy the future of generative AI algorithms, so big tech corporations are scrambling to find a solution that could provide LLM models with the right data to grow and evolve. The future of AI training is seemingly linked to "synthetic data," which is a less onanistic way to say that algorithms should talk to each other if they want to keep a sane (digital) mind.

According to a recent report by the Financial Times, Microsoft, OpenAI, and LLM startup Cohere are some of the companies which are already testing the use of the aforementioned synthetic data. Compared to "natural" information provided by meager humans, synthetic data is generated by a computer algorithm while human supervisors provide feedback and fill the gaps. A process which is known as reinforcement learning by human feedback (RLHF).

With generative AI algorithms becoming increasingly sophisticated, even the richest AI-based companies (Microsoft, Google, etc.) have no easy way to get new "quality" content to keep training their large-language models (LLM). According to Cohere CEO Aidan Gomez, the web is "so noisy and messy" that it cannot possibly provide the data AI companies need.

Gomez said that to increase the performance of today's LLMs in tackling science, healthcare or business challenges, training efforts will require "unique and sophisticated datasets" created by world-level experts. However, this kind of human-created data is "extremely" expensive, so AI companies are employing AI algorithms to… train AI algorithms.

Basic AI models are already being developed with the sole purpose of outputting text, code or other "complex" information related to healthcare or financial frauds. This "synthetic" information can be in turn used to train a new generation of advanced LLMs to provide customers with even more "intelligence" and text-generation proficiency.

Gomez said that Cohere is working on an AI model for advanced mathematics, with two distinct models talking to each other and acting as the math tutor or the student. The two models have a "conversation about trigonometry," Gomez said, and it's all synthetic. Humans can later check if the model said something wrong or completely made up.

AI models talking to each other also provide a potential solution to the increasingly disturbing privacy and copyright issues faced by LLM corporations like OpenAI. Well-crafted synthetic datasets could remove biases and imbalances in existing data, Ali Golshan stated, even though the CEO of AI startup Gretel concedes that purely-synthetic training could impede progress as well. The web is already being littered with AI-generated information, which in turn will lead to chatbot degradation and "regurgitated knowledge" over time as predicted in the AI feedback-loop process.