Image of a computer with a vase landscape behind it metaphorically questions the limits of training generative AI.

Training Generative AI: How Much Content is up for Grabs? 

Generative AI has seen explosive growth over the past several months. During this fast-paced journey we’ve witnessed incredible capabilities of sophisticated AI models—and along with these capabilities have come a host of questions about ethics, privacy, and how our norms might change. One issue of concern is copyright: from infringement of existing copyrighted material to determining ownership of AI-generated works, everything becomes more complicated. And when it comes to training generative AI, there’s disagreement about what’s up for grabs.

Google’s recent statement to the Australian government aims to turn copyright law on its head: rather than needing to gain permission to use content from copyright holders, Google asserts that developers training AI models should be able to use content by default unless creators opt out. Such a move would put content creators on the defense and challenge the way copyright works. Is this the direction we want our AI leaders to take us? 

Data Sets for Training Generative AI Models 

Google’s stance rests on the reality that generative AI models need vast amounts of data to train on, especially if they’re going to continue to improve. Where should this training data come from? Of course, giving models access to all data available online makes the path easier for those training AI.  

But many of the people who created that content wouldn’t want AI models scraping and eventually using it for its outputs. Allowing access by default puts the responsibility onto the copyright holders to opt out, causing more work and concern for them. Moreover, those who don’t opt out can have their copyrighted material endlessly used without any payment whatsoever. And if generated content too closely resembles an original work, the creator’s legal protections—like the ability to sue—can be weakened. 

Google isn’t the only company thinking along these lines. For example, the opt-out idea follows what Stable Diffusion set up at the end of 2022. But in many cases opting out isn’t simple, especially when content has already been used and removing it becomes a burdensome trail to chase. 

Training Generative AI: How Much Content is up for Grabs?  1

Fair Use? 

Court cases have already been litigating artists’ claims that AI companies are using millions of works without license to do so. Such cases hinge on an interpretation of intellectual property laws, particularly the fair use doctrine, which leaves plenty of room for debate. Using “limited portions of a work…for purposes such as commentary, criticism, news reporting, and scholarly reports” is allowed, and there is no specific percentage of a work’s use that is specified.  

When it comes to AI, a business that knowingly uses training data with unlicensed works can be held responsible if the circumstances aren’t deemed fair use, potentially having to pay large amounts in damages. So if a company is working with AI, it’s in its best interest to ensure that the training data for its models is licensed and fairly obtained. 

Image: Training generative AI

Moving Forward with Generative AI Ethically and Responsibly 

To successfully navigate this new terrain where online content now moves through AI, everyone has a role to play: AI developers, content creators, and businesses need to do their part to keep practices accountable and fair. Transparency is key and can be a helpful solution: tracking content use with metadata tags could create a verification system similar to information authenticity trails. (Something like Project Origin, which creates a digital “chain of trust” for online media, could be applied to generative AI to provide transparency about the training data that has been used.) Considering the amount of people and data affected, it’s also a better policy for AI companies to obtain opt-in permission (like in Creative Commons) than to require opting out. 

All of this depends heavily on regulation and voluntary participation by those driving AI forward. Google and other tech giants have recently pledged to uphold best practices to provide secure and trustworthy AI systems. Consumers, creators, and businesses are relying on these powerful companies to stand by what they have volunteered in order to build a positive future with AI.  

At Infused Innovations, we strive for transparency and ethical use of AI. When you adopt AI for your business, we can guide you through the steps necessary to ensure that you’re protecting your customers’ rights as well as your own legal standing. We can help you build AI models with data that you’re in control of. Contact us to share your thoughts on this important topic and discuss how we can work together to use AI responsibly to achieve your best potential. 

Leave a Comment