AI & Robotics News

More ethical AI? Fairly Trained launches to certify gen AI tools trained on licensed data

January 18, 2024

It is in some ways the “original sin” of generative AI: many of the leading models from the likes of OpenAI and Meta have been trained on data scraped from the web without prior knowledge or express permission of those who posted it.

AI companies who took this approach argue it is fair game and legally permissible. As OpenAI put it in a recent blog post: “Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”

Indeed, the same type of data scraping occurred long before generative AI became the latest tech sensation and was used to power many research databases and popular commercial products, including the very search engines such as Google that the data posters’ relied upon to get traffic and audience to their projects.

Nonetheless, there is a growing vocal opposition to this type of data scraping, with numerous best-selling authors and artists suing various AI companies for allegedly infringing copyright by training on their work without express consent. (VentureBeat uses some of the companies being sued, including Midjourney and OpenAI, to create header artwork for our articles.)

Now a new organization has emerged to support those who believe data creators and posters should be asked in advance for consent before their work is used in AI training.

Called “Fairly Trained,” the non-profit announced its existence today, co-founded and led by CEO Ed Newton-Rex, a former employee turned vocal objector to Stability AI, the company behind the widely used Stable Diffusion open source image generation service, among other AI models.

“We believe there are many consumers and companies who would prefer to work with generative AI companies who train on data provided with the consent of its creators,” reads the organization’s website.

“I firmly believe there is a path forward for generative AI that treats creators with the respect they deserve, and that licensing training data is key to this,” Newton-Rex wrote in a post on the social network X. “If you work at or know a generative AI company that takes this approach, I hope you’ll consider getting certified.”

It’s hard to know which generative AI companies train on scraped data, and which take a more ethical approach by licensing. So today we’re launching Fairly Trained, a non-profit that certifies gen AI companies for fairer training data practices.

Our first certification, called…

VentureBeat reached out to Newton-Rex over email and asked him about the common argument from leading AI companies and proponents that training on publicly available data is analogous to what human beings already do passively when observing other works of art and creative material that may later inspire them — consciously or otherwise. He wasn’t having it. As he wrote in response:

“I think the argument is flawed for two reasons. First, AI scales. A single AI, trained on all the world’s content, can produce enough output to replace the demand for much of that content. No individual human can scale in this way. Second, human learning is part of a long-established social contract. Every creator who wrote a book, or painted a picture, or composed a song, did so knowing that others would learn from it. That was priced in. This is definitively not the case with AI. Those creators did not create and publish their work in the expectation that AI systems would learn from it and then be able to produce competing content at scale. The social contract has never been in place for the act of AI training. AI training is a different proposition from human learning, based on different assumptions and with different effects. It should be treated as such.”

Fair enough. But what about companies that have already trained on data publicly posted online?

Netwton-Rex advises they change course and train new models on data that was obtained with creator permission, ideally by licensing it from them, potentially for a fee. (This is an approach OpenAI has adopted with news outlets lately, including The Associated Press and Axel-Springer, publisher of Politico and Business Insider, and OpenAI is reportedly paying millions annually for the privilege of using their data. However, OpenAI has continued to defend its right to collect and train on public data it scrapes even without licensing deals in place.)

“My only suggestion is that they [AI companies generally] change their approach, and move to a licensing model. We are still early in the evolution of generative AI, and there is still time to help contribute to creating an ecosystem in which the work that human creators and AI companies do is mutually beneficial,” Newton-Rex wrote us.

Fairly Trained elaborated on the motivations behind its founding in a blog post:

“There is a divide emerging between two types of generative AI companies: those who get the consent of training data providers, and those who don’t, claiming they have no legal obligation to do so. We know there are many consumers and companies who would prefer to work with the former, because they respect creators’ rights. But right now it’s hard to tell which AI companies take which approach.“

In other words: Fairly Trained still wants people to be able to use generative AI tools and services. The org simply wants to help consumers find and choose tools trained on data licensed expressly to AI companies for that purpose, as opposed to scraping the web for anything publicly posted.

In order to help consumers make this type of informed decision, Fairly Trained offers a “Licensed Model (L) certification for AI providers.”

The Licensed Model (L) certification process is outlined on the Fairly Trained website, and ultimately involves an AI company filling out an online form and then going through a longer written submission process from Fairly Trained, culminating in a written submission and potential follow-up questions.

Fairly Trained charges fees for this service to the companies seeking L certification on a sliding scale based on the companies’ annual revenue, ranging from a one time submission fee of $150 + $500 annually to a one-time fee of $500 + $6,000 annually for companies with revenue eclipsing $10 million annually.

VentureBeat reached out to Newton-Rex via email to ask about why the non-profit charges fees, and he responded that: “We charge fees to cover our costs. I think the fees are low enough that they shouldn’t be prohibitive for generative AI companies.”

Already, some companies have sought and received the L certification Fairly Trained offers, including Beatoven.AI, Boomy, BRIA AI, Endel, LifeScore, Rightsify, Somms.ai, Soundful, and Tuney. Netwon-Rex said the certification process for these AI firms took place “over the last month or so,” but declined to comment on which companies paid the fees and how much they paid.

Asked about other services that fall between the public scraping approach and licensing approach, such as Adobe or Shutterstock, which say their stock image library terms-of-service allow them to train gen AI models on creators’ works (among other uses), Newton-Rex also deferred.

“We’d rather not comment on specific models that we haven’t certified,” he wrote. “If they feel they’ve trained models that meet our certification requirements, I hope they’ll apply for certification.”

Among Fairly Trained’s advisers, according to its website, are Tom Gruber, the former chief technologist of Siri (acquired by Apple), and Maria Pallante, President & CEO of the Association of American Publishers.

The nonprofit also says lists among its supporters the Association of American Publishers, Association of Independent Music Publishers, Concord (a leading music and audio group), and Universal Music Group. The latter two groups are suing AI company Anthropic over its Claude chatbot’s reproduction of copyrighted song lyrics.

Asked whether Fairly Trained was involved in any AI lawsuits via email, Netwon-Rex answered VentureBeat in writing saying: “No, I’m not involved in any of the lawsuits.”

Are any of these groups donating money to Fairly Certified? Netwon-Rex said “there’s no funding at this stage,” for the enterprise — aside from the fees it charges for certification.

It is in some ways the “original sin” of generative AI: many of the leading models from the likes of OpenAI and Meta have been trained on data scraped from the web without prior knowledge or express permission of those who posted it.

AI companies who took this approach argue it is fair game and legally permissible. As OpenAI put it in a recent blog post: “Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”

Indeed, the same type of data scraping occurred long before generative AI became the latest tech sensation and was used to power many research databases and popular commercial products, including the very search engines such as Google that the data posters’ relied upon to get traffic and audience to their projects.

Nonetheless, there is a growing vocal opposition to this type of data scraping, with numerous best-selling authors and artists suing various AI companies for allegedly infringing copyright by training on their work without express consent. (VentureBeat uses some of the companies being sued, including Midjourney and OpenAI, to create header artwork for our articles.)

Now a new organization has emerged to support those who believe data creators and posters should be asked in advance for consent before their work is used in AI training.

Called “Fairly Trained,” the non-profit announced its existence today, co-founded and led by CEO Ed Newton-Rex, a former employee turned vocal objector to Stability AI, the company behind the widely used Stable Diffusion open source image generation service, among other AI models.

“We believe there are many consumers and companies who would prefer to work with generative AI companies who train on data provided with the consent of its creators,” reads the organization’s website.

Respectful AI?

“I firmly believe there is a path forward for generative AI that treats creators with the respect they deserve, and that licensing training data is key to this,” Newton-Rex wrote in a post on the social network X. “If you work at or know a generative AI company that takes this approach, I hope you’ll consider getting certified.”

It’s hard to know which generative AI companies train on scraped data, and which take a more ethical approach by licensing. So today we’re launching Fairly Trained, a non-profit that certifies gen AI companies for fairer training data practices.

Our first certification, called…

— Ed Newton-Rex (@ednewtonrex) January 17, 2024

VentureBeat reached out to Newton-Rex over email and asked him about the common argument from leading AI companies and proponents that training on publicly available data is analogous to what human beings already do passively when observing other works of art and creative material that may later inspire them — consciously or otherwise. He wasn’t having it. As he wrote in response:

“I think the argument is flawed for two reasons. First, AI scales. A single AI, trained on all the world’s content, can produce enough output to replace the demand for much of that content. No individual human can scale in this way. Second, human learning is part of a long-established social contract. Every creator who wrote a book, or painted a picture, or composed a song, did so knowing that others would learn from it. That was priced in. This is definitively not the case with AI. Those creators did not create and publish their work in the expectation that AI systems would learn from it and then be able to produce competing content at scale. The social contract has never been in place for the act of AI training. AI training is a different proposition from human learning, based on different assumptions and with different effects. It should be treated as such.”

Fair enough. But what about companies that have already trained on data publicly posted online?

Netwton-Rex advises they change course and train new models on data that was obtained with creator permission, ideally by licensing it from them, potentially for a fee. (This is an approach OpenAI has adopted with news outlets lately, including The Associated Press and Axel-Springer, publisher of Politico and Business Insider, and OpenAI is reportedly paying millions annually for the privilege of using their data. However, OpenAI has continued to defend its right to collect and train on public data it scrapes even without licensing deals in place.)

“My only suggestion is that they [AI companies generally] change their approach, and move to a licensing model. We are still early in the evolution of generative AI, and there is still time to help contribute to creating an ecosystem in which the work that human creators and AI companies do is mutually beneficial,” Newton-Rex wrote us.

Certification — for a fee

Fairly Trained elaborated on the motivations behind its founding in a blog post:

“There is a divide emerging between two types of generative AI companies: those who get the consent of training data providers, and those who don’t, claiming they have no legal obligation to do so. We know there are many consumers and companies who would prefer to work with the former, because they respect creators’ rights. But right now it’s hard to tell which AI companies take which approach.“

In other words: Fairly Trained still wants people to be able to use generative AI tools and services. The org simply wants to help consumers find and choose tools trained on data licensed expressly to AI companies for that purpose, as opposed to scraping the web for anything publicly posted.

In order to help consumers make this type of informed decision, Fairly Trained offers a “Licensed Model (L) certification for AI providers.”

The Licensed Model (L) certification process is outlined on the Fairly Trained website, and ultimately involves an AI company filling out an online form and then going through a longer written submission process from Fairly Trained, culminating in a written submission and potential follow-up questions.

Fairly Trained charges fees for this service to the companies seeking L certification on a sliding scale based on the companies’ annual revenue, ranging from a one time submission fee of $150 + $500 annually to a one-time fee of $500 + $6,000 annually for companies with revenue eclipsing $10 million annually.

VentureBeat reached out to Newton-Rex via email to ask about why the non-profit charges fees, and he responded that: “We charge fees to cover our costs. I think the fees are low enough that they shouldn’t be prohibitive for generative AI companies.”

Already, some companies have sought and received the L certification Fairly Trained offers, including Beatoven.AI, Boomy, BRIA AI, Endel, LifeScore, Rightsify, Somms.ai, Soundful, and Tuney. Netwon-Rex said the certification process for these AI firms took place “over the last month or so,” but declined to comment on which companies paid the fees and how much they paid.

Asked about other services that fall between the public scraping approach and licensing approach, such as Adobe or Shutterstock, which say their stock image library terms-of-service allow them to train gen AI models on creators’ works (among other uses), Newton-Rex also deferred.

“We’d rather not comment on specific models that we haven’t certified,” he wrote. “If they feel they’ve trained models that meet our certification requirements, I hope they’ll apply for certification.”

Noteworthy advisers and supporters

Among Fairly Trained’s advisers, according to its website, are Tom Gruber, the former chief technologist of Siri (acquired by Apple), and Maria Pallante, President & CEO of the Association of American Publishers.

The nonprofit also says lists among its supporters the Association of American Publishers, Association of Independent Music Publishers, Concord (a leading music and audio group), and Universal Music Group. The latter two groups are suing AI company Anthropic over its Claude chatbot’s reproduction of copyrighted song lyrics.

Asked whether Fairly Trained was involved in any AI lawsuits via email, Netwon-Rex answered VentureBeat in writing saying: “No, I’m not involved in any of the lawsuits.”

Are any of these groups donating money to Fairly Certified? Netwon-Rex said “there’s no funding at this stage,” for the enterprise — aside from the fees it charges for certification.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Author: Carl Franzen
Source: Venturebeat
Reviewed By: Editorial Team

476

0