The European Union has introduced new legislation requiring artificial intelligence (AI) companies to be more transparent about the data used to train their systems, aiming to lift the veil on one of the industry’s most closely guarded secrets.
This move comes amid a surge in public engagement and investment in generative AI, driven by Microsoft-backed OpenAI’s release of ChatGPT 18 months ago.
Generative AI, which can rapidly produce text, images, and audio content, has raised significant questions about data sourcing. Concerns have been particularly focused on whether using copyrighted materials, such as bestselling books and Hollywood movies, without permission constitutes a copyright infringement.
The EU’s AI Act, recently passed and set to be phased in over the next two years, mandates that companies deploying general-purpose AI models provide ‘detailed summaries’ of their training data.
The specifics of these summaries are yet to be finalized, with the newly established AI Office planning to release a template following stakeholder consultations in early 2025.
This requirement has met resistance from AI companies, who argue that revealing their datasets would compromise trade secrets, giving competitors an unfair advantage.
The level of detail in these transparency reports could significantly impact both small startups and major tech companies like Google and Meta, which are heavily invested in AI technology.
Over the past year, companies such as Google, OpenAI, and Stability AI have faced lawsuits from creators alleging unauthorized use of their content for AI training.
While President Joe Biden has issued executive orders addressing AI security risks in the United States, copyright issues remain largely untested.
However, there is growing bipartisan support in Congress for requiring tech companies to compensate rights holders for their data.
Several tech companies have signed content-licensing agreements with media outlets in response to increased scrutiny.
OpenAI, for example, has deals with the Financial Times and The Atlantic, while Google has partnered with NewsCorp and Reddit.
Despite these efforts, OpenAI faced criticism in March when its CTO, Mira Murati, declined to confirm whether YouTube videos were used to train its video-generating tool Sora, citing terms and conditions.
The issue of AI-generated content mimicking real voices has also sparked controversy. Last month, OpenAI was criticized for using a voice similar to actress Scarlett Johansson’s in a demonstration of the latest version of ChatGPT, raising further ethical and legal questions.
European lawmakers remain split on the issue. Dragos Tudorache, a key figure in drafting the AI Act, insists that AI companies publicly disclose their datasets to let creators know if their work was used.
France, under President Emmanuel Macron, has privately opposed regulations that could stifle the competitiveness of European AI startups.