Join today’s leading executives online at the Data Summit on March 9th. Register here.
Chief data officers, data scientists and data analysts of all stripes may be interested in a new AI support and information community that’s debuting today.
That special-interest group is the idea of a startup, YData, a self-described “data-centric AI community” that created what it claims is the first development platform for data quality to accelerate the development of AI solutions. The new company aims to break down barriers for data science teams, researchers and beginners to create a “friendly place where data quality issues are discussed and solved,” CEO and founder Gonçalo Martins Ribeiro told VentureBeat via email.
Seattle-based YData is a free-of-charge, for-profit organization of open-source enthusiasts and community builders. The company’s business model is to sell enterprise support on top of the open-source-based tools that it gives to the community, Ribeiro said.
YData’s development platform follows a data-centric mindset by bringing together the major data science frameworks with proprietary tools for data access and profiling, synthetic data generation, and labeling to deliver better data quality for AI. Higher data quality means fewer errors, biases and a representative dataset that ensures that AI is built responsibly. Organizations have already adopted the company’s technology in the financial services, utilities and telecoms sectors, Ribeiro said.
Research shows that there will be no mainstream digital transformation without high-quality data to go with it. Recognizing the recent paradigm shift in approach to AI development — from model-centric to data-centric — YData created the Data-Centric AI Community to promote community-driven and expert-guided transformations for better AI development, Ribeiro said.
YData has been a pioneer in community-driven AI transformation, launching the Synthetic Data Community in 2020. In 2021, YData open-sourced two notable libraries, ydata-synthetic and ydata-quality, and placed them on GitHub, with the sole goal of ensuring data science teams have access to high-quality data.
YData’s Synthesiser uses state-of-the-art deep-learning techniques to learn the statistical information from the actual data and mimics it on a new dataset. YData’s Pandas Profiling helps one profile the raw data and understand the quality of the data in a few lines of code.
“We understand that a community driving the paradigm shift to data-centric AI is essential, and we aim to focus on data profiling, synthetic data, and data labeling, the most significant pain points of the data scientists,” Ribeiro said.
With experts such as Andrew Ng raising awareness for the data-centric approach and the first competitions and workshops conducted, the Data-Centric AI Community stands as the missing piece of the data-centric movement, Ribeiro said. “We believe that having quality data is truly a game-changer and that by creating high-quality data that resembles real-world data that was initially inaccessible, endless possibilities can be unlocked. Being able to profile and understand data, early in development, is crucial and can save a lot of time and money for organizations,” he said.
“Not every company, researcher or student has access to the most valuable data like some tech giants do. As ML algorithms coding frameworks evolve rapidly, it’s safe to say the scarcest resource in AI is high-quality data at scale. We need to find ways to improve the data used for AI development. The Data-Centric AI Community is a step towards addressing that,” Ribeiro said.
A Q&A with the CEO and founder
VentureBeat: Are you the first dev community to specialize in AI development?
Ribeiro: Not the first dev community for AI, but the first dev community focused on the new trend of data-centric AI. After the initial buzz created by Andrew Ng and some initiatives like the Resources Hub and the Stanford and ETH workshop, we’re the first community centered around this topic, and we provide a lot of open source [software] to help data scientists move from a model-centric to a data-centric approach.
VentureBeat: What are the main challenges in resetting development mindset from model-centric to data-centric?
Ribeiro: Until now, most platforms and tools available follow the model-centric approach. Even well-renowned conferences, such as NeurIPS, were focused on optimizing models. It was only last year that NeurIPS launched a track for Datasets, making it clear that focusing on the data – the prime matter of AI – is the missing piece that companies are still struggling with. Many have challenges to overcome, from changing the status quo of building AI solutions, to the lack of tooling available, not to mention education and training. At the Data-Centric AI Community, we aim to help overcome all of these challenges by fostering community-driven and expert-guided discussions, content and new open-source projects.
VentureBeat: Tell me something about the new project that I probably don’t know.
Ribeiro: The Data-Centric AI Community already counts with some open-source contributors and projects that can be found on GitHub, being the data quality profiling and the synthetic data generation some of the most popular worldwide.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More
Author: Chris J. Preimesberger
Source: Venturebeat