A massive open-source AI dataset, LAION-5B, which has been used to train popular AI text-to-image generators like Stable Diffusion and Google’s Imagen, contains at least 1,008 instances of child sexual abuse material, a new report from the Stanford Internet Observatory found — with thousands more instances suspected. The Stanford Internet Observatory is a program of the Cyber Policy Center, a joint initiative of the Freeman Spogli Institute for International Studies and Stanford Law School.
The LAION-5B dataset, which was released in March 2022 and contains more than 5 billion images and related captions from the internet, may also include thousands of additional pieces of suspected child sexual abuse material, or CSAM, according to the report. The report warned that CSAM material in the dataset could enable AI products built on this data to output new and potentially realistic child abuse content.
In response, LAION told 404 Media on Tuesday that out of “an abundance of caution,” it was taking down its datasets temporarily “to ensure they are safe before republishing them.”
But this is not the first time LAION’s image datasets has come under fire. As far back as October 2021, cognitive scientist Abeba Birhane, currently a senior fellow in trustworthy AI at Mozilla, published a paper, Multimodal datasets: misogyny, pornography, and malignant stereotypes, which examined LAION-400M, an earlier image dataset. It found that the dataset contained “troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.”
In September 2022, there was an instance of an artist discovering private medical record photos taken by her doctor in 2013 referenced in the LAION-5B image dataset. The artist, Lapine, discovered the photos on the Have I Been Trained website, which allows people to look for their work in popular AI training datasets.
And a class-action lawsuit, Andersen et al. v. Stability AI LTD et al., was brought by visual artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt in January 2023. While LAION was not sued, it was named in the lawsuit, which said that “Stability is alleged to have ‘downloaded of otherwise acquired copies of billions of copyrighted images without permission to create Stable Diffusion’ known as ‘training images.’ Over five billion images were scraped (and thereby copied) from the internet for training purposes for Stable Diffusion through the services of an organization (LAION, Large-Scale Artificial Intelligence Open Network) paid by Stability.”
Ortiz, an award-winning artist who has worked for Industrial Light & Magic (ILM), Marvel Film Studios, Universal Studios and HBO, spoke at a virtual FTC panel in October and discussed the LAION-5B dataset.
“LAION-5B is a dataset that contains 5.8 billion text and image pairs, which…includes the entirety of my work and the work of almost everyone I know,” she said. “Beyond intellectual property, data sets like LAION-5B also contain deeply concerning material like private medical records, non consensual pornography, images of children, even social media pictures of our actual faces.”
As VentureBeat reported in September, Andrew Ng, former co-founder and head of Google Brain, has made no bones about the fact that the latest advances in machine learning have depended on free access to large quantities of data, much of it scraped from the open internet.
In an issue of his DeepLearning.ai newsletter, The Batch, titled “It’s Time to Update Copyright for Generative AI, he wrote that a lack of access to massive popular datasets such as Common Crawl, The Pile, and LAION would put the brakes on progress or at least radically alter the economics of current research.
“This would degrade AI’s current and future benefits in areas such as art, education, drug development, and manufacturing, to name a few,” he said.
And in the June 7 edition of The Batch, Ng admitted that the AI community is entering an era in which it will be called upon to be more transparent in our collection and use of data. “We shouldn’t take resources like LAION for granted, because we may not always have permission to use them,” he wrote.
Hamburg, Germany-based high school teacher and trained actor Christoph Schuhmann helped found LAION, short for “Large-scale AI Open Network. According to an April 2023 Bloomberg article, Schuhmann was hanging out on a Discord server for AI enthusiasts and was inspired by the first iteration of OpenAI’s DALL-E to make sure there would be an open-source dataset to help train image-to-text diffusion models.
“Within a few weeks, Schuhmann and his colleagues had 3 million image-text pairs. After three months, they released a dataset with 400 million pairs,” the Bloomberg article said. “That number is now over 5 billion, making LAION the largest free dataset of images and captions.”
Since then, the nonprofit LAION has weighed in publicly on open-source AI topics: For example, after an open letter in March 2023 calling for AI ‘pause’ heated up a fierce debate around risks vs. hype, LAION called for accelerating research and establishing a joint, international computing cluster for large-scale open-source artificial intelligence models.
LAION was scraped, in part, by using visual data from online shopping services such as Shopify, eBay and Amazon. In a recent paper from the Allen Institute for AI called “What’s in My Big Data?“, researchers studied LAION-2B-en, a subset of LAION-5B, which is 2.32 billion photo captions in English. It found, for example, that 6% of the documents in LAION-2B-en were from Shopify.
“That was a surprise because no one had looked at that before,” Jesse Dodge, a research scientist at the Allen Institute for AI, told VentureBeat in November. “No one had been able to say like, what parts of the internet is the most images of text from in this dataset?”
Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.
A massive open-source AI dataset, LAION-5B, which has been used to train popular AI text-to-image generators like Stable Diffusion and Google’s Imagen, contains at least 1,008 instances of child sexual abuse material, a new report from the Stanford Internet Observatory found — with thousands more instances suspected. The Stanford Internet Observatory is a program of the Cyber Policy Center, a joint initiative of the Freeman Spogli Institute for International Studies and Stanford Law School.
The LAION-5B dataset, which was released in March 2022 and contains more than 5 billion images and related captions from the internet, may also include thousands of additional pieces of suspected child sexual abuse material, or CSAM, according to the report. The report warned that CSAM material in the dataset could enable AI products built on this data to output new and potentially realistic child abuse content.
In response, LAION told 404 Media on Tuesday that out of “an abundance of caution,” it was taking down its datasets temporarily “to ensure they are safe before republishing them.”
LAION datasets have come under fire before
But this is not the first time LAION’s image datasets has come under fire. As far back as October 2021, cognitive scientist Abeba Birhane, currently a senior fellow in trustworthy AI at Mozilla, published a paper, Multimodal datasets: misogyny, pornography, and malignant stereotypes, which examined LAION-400M, an earlier image dataset. It found that the dataset contained “troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.”
VB Event
The AI Impact Tour
Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!
In September 2022, there was an instance of an artist discovering private medical record photos taken by her doctor in 2013 referenced in the LAION-5B image dataset. The artist, Lapine, discovered the photos on the Have I Been Trained website, which allows people to look for their work in popular AI training datasets.
And a class-action lawsuit, Andersen et al. v. Stability AI LTD et al., was brought by visual artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt in January 2023. While LAION was not sued, it was named in the lawsuit, which said that “Stability is alleged to have ‘downloaded of otherwise acquired copies of billions of copyrighted images without permission to create Stable Diffusion’ known as ‘training images.’ Over five billion images were scraped (and thereby copied) from the internet for training purposes for Stable Diffusion through the services of an organization (LAION, Large-Scale Artificial Intelligence Open Network) paid by Stability.”
Ortiz, an award-winning artist who has worked for Industrial Light & Magic (ILM), Marvel Film Studios, Universal Studios and HBO, spoke at a virtual FTC panel in October and discussed the LAION-5B dataset.
“LAION-5B is a dataset that contains 5.8 billion text and image pairs, which…includes the entirety of my work and the work of almost everyone I know,” she said. “Beyond intellectual property, data sets like LAION-5B also contain deeply concerning material like private medical records, non consensual pornography, images of children, even social media pictures of our actual faces.”
AI pioneer Andrew Ng has criticized removing access to LAION
As VentureBeat reported in September, Andrew Ng, former co-founder and head of Google Brain, has made no bones about the fact that the latest advances in machine learning have depended on free access to large quantities of data, much of it scraped from the open internet.
In an issue of his DeepLearning.ai newsletter, The Batch, titled “It’s Time to Update Copyright for Generative AI, he wrote that a lack of access to massive popular datasets such as Common Crawl, The Pile, and LAION would put the brakes on progress or at least radically alter the economics of current research.
“This would degrade AI’s current and future benefits in areas such as art, education, drug development, and manufacturing, to name a few,” he said.
And in the June 7 edition of The Batch, Ng admitted that the AI community is entering an era in which it will be called upon to be more transparent in our collection and use of data. “We shouldn’t take resources like LAION for granted, because we may not always have permission to use them,” he wrote.
LAION was founded to create an open-source dataset
Hamburg, Germany-based high school teacher and trained actor Christoph Schuhmann helped found LAION, short for “Large-scale AI Open Network. According to an April 2023 Bloomberg article, Schuhmann was hanging out on a Discord server for AI enthusiasts and was inspired by the first iteration of OpenAI’s DALL-E to make sure there would be an open-source dataset to help train image-to-text diffusion models.
“Within a few weeks, Schuhmann and his colleagues had 3 million image-text pairs. After three months, they released a dataset with 400 million pairs,” the Bloomberg article said. “That number is now over 5 billion, making LAION the largest free dataset of images and captions.”
Since then, the nonprofit LAION has weighed in publicly on open-source AI topics: For example, after an open letter in March 2023 calling for AI ‘pause’ heated up a fierce debate around risks vs. hype, LAION called for accelerating research and establishing a joint, international computing cluster for large-scale open-source artificial intelligence models.
LAION was scraped from visual data on shopping sites
LAION was scraped, in part, by using visual data from online shopping services such as Shopify, eBay and Amazon. In a recent paper from the Allen Institute for AI called “What’s in My Big Data?“, researchers studied LAION-2B-en, a subset of LAION-5B, which is 2.32 billion photo captions in English. It found, for example, that 6% of the documents in LAION-2B-en were from Shopify.
“That was a surprise because no one had looked at that before,” Jesse Dodge, a research scientist at the Allen Institute for AI, told VentureBeat in November. “No one had been able to say like, what parts of the internet is the most images of text from in this dataset?”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
Author: Sharon Goldman
Source: Venturebeat
Reviewed By: Editorial Team