Harmonizing Creation and Compensation: The Future of Open Source AI in a Copyrighted World
How can Open Source AI models remain competitive in an environment that requires copyrighted material for training?
At the forefront of AI are foundation models and Generative AI (GenAI) models that demonstrate amazing skill at creating novel output derived from their training data. GenAI is trained on a huge corpus of data — text, images, videos, music — both to learn to express itself in a human-like manner appropriate for the culture and to learn information about a wide variety of topics. AI training data is not mined for its discrete, expressive content. GenAI is able to produce new content similar to existing material and can be prompted to unintentionally or intentionally reproduce content from which it was trained.
Content creators and copyright holders are concerned about GenAI models training on copyrighted material and reproducing copyrighted material in their output that infringes on their rights, without license or compensation. The AI industry generally believes that AI Models should be permitted to train on any publicly available material, much the same way that humans learn from the entire culture — reading library books, watching TV shows, and listening to music. AI models do not reproduce exact training data sets for unrelated queries, but their ability to combine, derive, and sometimes recall portions of the training corpus changes the presumptions.
Open Source Software proved to be a huge benefit to the technology industry by encouraging innovation and leveraging the skills of a vast pool of talent around the world. Open Source AI will provide a fertile ground for a similar impact. AI entrepreneurs need the same access to and benefits from Open Source AI models as Open Source Software provided. The largest and best AI models are trained on copyrighted material. If access to the best models and best training data are limited to closed models or models with restrictions on use, progress in the AI industry could be strangled.
The OSI (Open Source Initiative) AI license definition will guarantee commercial use (see the Open Source AI Draft here). We have analyzed the challenges associated with the Open Source AI definition in our previous blog. If so-called Open Source AI models are trained on copyrighted material, because the models are derivative works, it might not be possible to use them for commercial purposes. The small and medium businesses developing new businesses around AI models don't have the resources to seek out and license each piece of training material independently. The large companies that create state-of-the-art Open Source foundation AI models may license the training material for their own, commercial use, but they are not necessarily going to acquire the rights for all downstream users.
We would like to explore the challenge associated with copyrighted material and AI by providing reference to the challenge of patents owned by the technology sector.
The Open Invention Network and non-assertion guarantees of corporate participants provide patent protection, but there is nothing equivalent to training data, and the copyright owners have different motivations than tech companies. For software development, the tech companies both owned the patents and directly benefited from the innovation of Open Source Software, so they could trade off the benefits derived from the different assets in their corporate portfolio, especially in collaboration with other tech companies. For Open Source AI models, the copyrighted training data is owned by and benefits different entities than the AI models. The owners of the copyrighted data need to be compensated in some manner, either through licenses, royalties, or outright purchase of the rights.
Open Source Software has created patent pools to mostly eliminate concerns about patent infringement, but the major tech companies own the patents, contribute to Open Source Software, and commercially utilize Open Source Software, so they can balance the cost-benefit amongst themselves. Copyright training data mostly is owned by third parties outside of the tech industry, so they do not benefit from free use of the data equivalent to patents.
For over a hundred years, music performances have been managed by rights organizations, including legislated compulsory licensing requirements. This provides a guaranteed framework and centralized clearance process for music performances. We propose to use an equivalent concept for data used and reproduced by Generative AI. This associates payment with performance, when revenue is generated from the operation of an AI model.
The music industry created a system of rights management organizations (performance rights, mechanical rights, synchronization rights) to track the uses of copyrighted material, collect payments, and distribute royalties to composers, performers, and publishers. Expanding on the historical agreements for performances, modern, digital services, such as Spotify, Apple Music, and YouTube have agreements with the rights management organizations to provide payments. Google YouTube has created a system called Content ID to recognize the reproduction of copyrighted material and pay on behalf of their users publishing content that includes copyrighted material. While the content creators on YouTube are performing the infringement, the YouTube platform hosts the content and the revenue-generating advertising and operates at scale, so it is in the best position to implement and manage the remedy.
Advantages of a performance-based approach for AI model royalties:
It bypasses the arguments about paying for training.
It aligns the royalties with performance, as is common practice in the entertainment industry.
It pairs royalty payments with revenue generation.
Stable, predictable cost for AI publishers.
Payment is associated with the actual use of data, not size of the model or the amount of training data.
Suppose the entertainment industry is too intransigent in its negotiating position. In that case, the AI industry will wait out the entertainment industry and win through attrition or will utilize alternative training methods that avoid copyright infringement. The big tech companies dominate social media and can shift attention towards or away from existing celebrity and entertainment brands. If the brands atrophy and decay, importance of the copyrighted material diminishes. The AI industry can create and promote personalized entertainment content that is provably not derivative of copyrighted material. While the entertainment industry is show business, the AI industry also is a business that can be motivated to innovate around technical hurdles for business reasons.
If the major tech companies do license copyrighted material for training, that may lock out Open Source AI models and inhibit innovation. The Open Source AI models could be used for experiments and education, but not commercially. The creators of Open Source AI models trained on copyrighted material may license the training data for their own use, but they are not going to license the data for all potential downstream users and don’t have the equivalent of a patent pool for indemnification.
Although most of the content creators in the entertainment industry are averse to the compelled licensing framework used in the music industry, it does provide a simplified method for downstream small developers of applications based on Open Source AI to participate on a more level playing field with major tech companies. And the large tech companies and large copyright catalog owners can continue to negotiate individual agreements.
Or maybe the tech companies grow sufficiently large to purchase the vast majority of the back catalogs to mine for training data, thereby bringing the cost-benefit under the same corporate structure.
There is always the option to build the model on licensed data, this is something that only a few companies can do, and is probably close to impossible in Open Source settings. There are a few examples where large companies building LLM (Large Language Models) licensed parts of the data (e.g., Reddit in AI content licensing deal with Google).
One company that takes the licensing approach all the way is Bria.ai. Bria is a visual Generative AI company done right. They licensed all the images used to train their models. On top of using a licensed dataset, Bria also built an attribution engine, fostering a sustainable ecosystem that benefits all. The attribution engine analyzes the original images and their impact on the generated results and then Bria the company compensates those creators. But even Bria cannot provide their models as pure Open Source because it requires the users to run the attribution agent in order to compensate the creators.
Commoditized computer hardware and Open Source Software democratized software development, which spurred a huge amount of innovation and efficiency that benefited the entire world. A few deep-pocketed tech companies and startups could negotiate individual deals with the owners of the large back catalogs of copyrighted material from the entertainment industry for their closed AI models. A new, democratized, dynamic, and innovative industry based on Generative AI requires a flexible, competitive, and scalable licensing model for state-of-the-art training data. A royalty model based on similarity and interpretability of Generative AI model output is a possible outline of a solution for the unprivileged user and developer community.
Navigating the intersection of Open Source AI and digital rights management requires innovative and collaborative solutions. By integrating the adaptability of Open Source models with equitable compensation systems inspired by the music industry, we can respect content creators' rights while maintaining the collaborative spirit essential for AI's progress. Embracing compensation tied to the AI inference and output generation phase (performance) offers a path forward that balances innovation with fairness, ensuring AI development remains open and accessible. As we continue to explore AI's possibilities, fostering a sustainable ecosystem that respects both technological advancement and copyright is crucial. LFAI&Data Generative AI Commons and LF Academy Software Foundation provide a unique forum to continue this conversation. Together, we can achieve a future where Open Source AI flourishes alongside the creative works that inspire it.