One of the world’s largest AI training datasets is about to get bigger and ‘substantially better’
January 12, 2024
Massive AI training datasets, or corpora, have been called “the backbone of large language models.” But EleutherAI, the organization that created one of the world’s largest of these datasets, an 825 GB open-sourced diverse text corpora called the Pile, became a target in 2023 amid a growing uproar focused on the legal and ethical impact of the datasets that trained the most popular LLMs…