Have you ever tried to intentionally forget something you had already learned? You can imagine how difficult it would be.
As it turns out, it’s also difficult for machine learning (ML) models to forget information. So what happens when these algorithms are trained on outdated, incorrect or private data?
Retraining the model from scratch every time an issue arises with the original dataset is hugely impractical. This has led to the requirement of a new field in AI called machine unlearning.
With new lawsuits being filed what seems like every other day, the need for ML systems to efficiently ‘forget’ information is becoming paramount for businesses. Algorithms have proven to be incredibly useful in many areas, but the inability to forget information has significant implications for privacy, security and ethics.
Let’s take a closer look at the nascent field of machine unlearning — the art of teaching artificial intelligence (AI) systems to forget.
So as you might have gathered by now, machine unlearning is the process of erasing the influence specific datasets have had on an ML system.
Most often, when a concern arises with a dataset, it’s a case of modifying or simply deleting the dataset. But in cases where the data has been used to train a model, things can get tricky. ML models are essentially black boxes. This means that it’s difficult to understand exactly how specific datasets impacted the model during training and even more difficult to undo the effects of a problematic dataset.
OpenAI, the creators of ChatGPT, have repeatedly come under fire regarding the data used to train their models. A number of generative AI art tools are also facing legal battles regarding their training data.
Privacy concerns have also been raised after membership inference attacks have shown that it’s possible to infer whether specific data was used to train a model. This means that the models can potentially reveal information about the individuals whose data was used to train it.
While machine unlearning might not keep companies out of court, it would certainly help the defense’s case to show that datasets of concern have been removed entirely.
With the current technology, if a user requests data deletion, the entire model would need to be retrained, which is hugely impractical. The need for an efficient way to handle data removal requests is imperative for the progression of widely accessible AI tools.
The simplest solution to produce an unlearned model is to identify problematic datasets, exclude them and retrain the entire model from scratch. While this method is currently the simplest, it is prohibitively expensive and time-consuming.
Recent estimates indicate that training an ML model currently costs around $4 million. Due to an increase in both dataset size and computational power requirements, this number is predicted to rise to a whopping $500 million by 2030.
The “brute force” retraining approach might be appropriate as a last resort under extreme circumstances, but it’s far from a silver bullet solution.
The conflicting objectives of machine unlearning present a challenging problem. Specifically, forgetting bad data while retaining utility, which must be done at high efficiency. There’s no point in developing a machine unlearning algorithm that uses more energy than retraining would.
All this isn’t to say there hasn’t been progress toward developing an effective unlearning algorithm. The first mention of machine unlearning was seen in this paper from 2015, with a follow-up paper in 2016. The authors propose a system that allows incremental updates to an ML system without expensive retraining.
A 2019 paper furthers machine unlearning research by introducing a framework that expedites the unlearning process by strategically limiting the influence of data points in the training procedure. This means specific data can be removed from the model with minimal negative impact on performance.
This 2019 paper also outlines a method to “scrub” network weights clean of information about a particular set of training data without access to the original training dataset. This method prevents insights about forgotten data by probing the weights.
This 2020 paper introduced the novel approach of sharding and slicing optimizations. Sharding aims to limit the influence of a data point, while slicing divides the shard’s data further and trains incremental models. This approach aims to expedite the unlearning process and eliminate extensive retaining.
A 2021 study introduces a new algorithm that can unlearn more data samples from the model compared to existing methods while maintaining the model’s accuracy. Later in 2021, researchers developed a strategy for handling data deletion in models, even when deletions are based only on the model’s output.
Since the term was introduced in 2015, various studies have proposed increasingly efficient and effective unlearning methods. Despite significant strides, a complete solution is yet to be found.
Like any emerging area of technology, we generally have a good idea of where we want to go, but not a great idea of how to get there. Some of the challenges and limitations machine unlearning algorithms face include:
Addressing all these issues poses a significant challenge and a healthy balance must be found to ensure a steady progression. To help navigate these challenges, companies can employ interdisciplinary teams of AI experts, data privacy lawyers and ethicists. These teams can help identify potential risks and keep track of progress made in the machine unlearning field.
Google recently announced the first machine unlearning challenge. This aims to address the issues outlined so far. Specifically, Google hopes to unify and standardize the evaluation metrics for unlearning algorithms, as well as foster novel solutions to the problem.
The competition, which considers an age predictor tool that must forget certain training data to protect the privacy of specified individuals, began in July and runs through mid-September 2023. For business owners who might have concerns about data used in their models, the results of this competition are most certainly worth paying attention to.
In addition to Google’s efforts, the continuous build-up of lawsuits against AI and ML companies will undoubtedly spark action within these organizations.
Looking further ahead, we can anticipate advancements in hardware and infrastructure to support the computational demands of machine unlearning. There may be an increase in interdisciplinary collaboration that can assist in streamlining development. Legal professionals, ethicists and data privacy experts may join forces with AI researchers to align the development of unlearning algorithms.
We should also expect that machine unlearning will attract attention from lawmakers and regulators, potentially leading to new policies and regulations. And as issues of data privacy continue to make headlines, increased public awareness could also influence the development and application of machine unlearning in unforeseen ways.
Understanding the value of machine unlearning is crucial for businesses that are looking to implement or have already implemented AI models trained on large datasets. Some actionable insights include:
Keeping pace with machine unlearning is a smart long-term strategy for any business using large datasets to train AI models. By implementing some or all of the strategies outlined above, businesses can proactively manage any issues that may arise due to the data used in the training of large AI models.
AI and ML are dynamic and continuously evolving fields. Machine unlearning has emerged as a crucial aspect of these fields, allowing them to adapt and evolve more responsibly. It ensures better data handling capabilities while maintaining the quality of the models.
The ideal scenario is to use the right data from the start, but the reality is that our perspectives, information and privacy needs change over time. Adopting and implementing machine unlearning is no longer optional but a necessity for businesses.
In the broader context, machine unlearning fits into the philosophy of responsible AI. It underscores the need for systems that are transparent and accountable and that prioritize user privacy.
It’s still early days, but as the field progresses and evaluation metrics become standardized, implementing machine unlearning will inevitably become more manageable. This emerging trend warrants a proactive approach from businesses that regularly work with ML models and large datasets.
Matthew Duffin is a mechanical engineer, dedicated blogger and founder of Rare Connections.
Head over to our on-demand library to view sessions from VB Transform 2023. Register Here
Have you ever tried to intentionally forget something you had already learned? You can imagine how difficult it would be.
As it turns out, it’s also difficult for machine learning (ML) models to forget information. So what happens when these algorithms are trained on outdated, incorrect or private data?
Retraining the model from scratch every time an issue arises with the original dataset is hugely impractical. This has led to the requirement of a new field in AI called machine unlearning.
With new lawsuits being filed what seems like every other day, the need for ML systems to efficiently ‘forget’ information is becoming paramount for businesses. Algorithms have proven to be incredibly useful in many areas, but the inability to forget information has significant implications for privacy, security and ethics.
Event
VB Transform 2023 On-Demand
Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.
Let’s take a closer look at the nascent field of machine unlearning — the art of teaching artificial intelligence (AI) systems to forget.
Understanding machine unlearning
So as you might have gathered by now, machine unlearning is the process of erasing the influence specific datasets have had on an ML system.
Most often, when a concern arises with a dataset, it’s a case of modifying or simply deleting the dataset. But in cases where the data has been used to train a model, things can get tricky. ML models are essentially black boxes. This means that it’s difficult to understand exactly how specific datasets impacted the model during training and even more difficult to undo the effects of a problematic dataset.
OpenAI, the creators of ChatGPT, have repeatedly come under fire regarding the data used to train their models. A number of generative AI art tools are also facing legal battles regarding their training data.
Privacy concerns have also been raised after membership inference attacks have shown that it’s possible to infer whether specific data was used to train a model. This means that the models can potentially reveal information about the individuals whose data was used to train it.
While machine unlearning might not keep companies out of court, it would certainly help the defense’s case to show that datasets of concern have been removed entirely.
With the current technology, if a user requests data deletion, the entire model would need to be retrained, which is hugely impractical. The need for an efficient way to handle data removal requests is imperative for the progression of widely accessible AI tools.
The mechanics of machine unlearning
The simplest solution to produce an unlearned model is to identify problematic datasets, exclude them and retrain the entire model from scratch. While this method is currently the simplest, it is prohibitively expensive and time-consuming.
Recent estimates indicate that training an ML model currently costs around $4 million. Due to an increase in both dataset size and computational power requirements, this number is predicted to rise to a whopping $500 million by 2030.
The “brute force” retraining approach might be appropriate as a last resort under extreme circumstances, but it’s far from a silver bullet solution.
The conflicting objectives of machine unlearning present a challenging problem. Specifically, forgetting bad data while retaining utility, which must be done at high efficiency. There’s no point in developing a machine unlearning algorithm that uses more energy than retraining would.
Progression of machine unlearning
All this isn’t to say there hasn’t been progress toward developing an effective unlearning algorithm. The first mention of machine unlearning was seen in this paper from 2015, with a follow-up paper in 2016. The authors propose a system that allows incremental updates to an ML system without expensive retraining.
A 2019 paper furthers machine unlearning research by introducing a framework that expedites the unlearning process by strategically limiting the influence of data points in the training procedure. This means specific data can be removed from the model with minimal negative impact on performance.
This 2019 paper also outlines a method to “scrub” network weights clean of information about a particular set of training data without access to the original training dataset. This method prevents insights about forgotten data by probing the weights.
This 2020 paper introduced the novel approach of sharding and slicing optimizations. Sharding aims to limit the influence of a data point, while slicing divides the shard’s data further and trains incremental models. This approach aims to expedite the unlearning process and eliminate extensive retaining.
A 2021 study introduces a new algorithm that can unlearn more data samples from the model compared to existing methods while maintaining the model’s accuracy. Later in 2021, researchers developed a strategy for handling data deletion in models, even when deletions are based only on the model’s output.
Since the term was introduced in 2015, various studies have proposed increasingly efficient and effective unlearning methods. Despite significant strides, a complete solution is yet to be found.
Challenges of machine unlearning
Like any emerging area of technology, we generally have a good idea of where we want to go, but not a great idea of how to get there. Some of the challenges and limitations machine unlearning algorithms face include:
- Efficiency: Any successful machine unlearning tool must use fewer resources than retraining the model would. This applies to both computational resources and time spent.
- Standardization: Currently, the methodology used to evaluate the effectiveness of machine unlearning algorithms varies between each piece of research. To make better comparisons, standard metrics need to be identified.
- Efficacy: Once an ML algorithm has been instructed to forget a dataset, how can we be confident it has really forgotten it? Solid validation mechanisms are needed.
- Privacy: Machine unlearning must ensure that it doesn’t inadvertently compromise sensitive data in its efforts to forget. Care must be taken to ensure that traces of data are not left behind in the unlearning process.
- Compatibility: Machine unlearning algorithms should ideally be compatible with existing ML models. This means that they should be designed in a way that they can be easily implemented into various systems.
- Scalability: As datasets become larger and models more complex, it’s important that machine unlearning algorithms are able to scale to match. They need to handle large amounts of data and potentially perform unlearning tasks across multiple systems or networks.
Addressing all these issues poses a significant challenge and a healthy balance must be found to ensure a steady progression. To help navigate these challenges, companies can employ interdisciplinary teams of AI experts, data privacy lawyers and ethicists. These teams can help identify potential risks and keep track of progress made in the machine unlearning field.
The future of machine unlearning
Google recently announced the first machine unlearning challenge. This aims to address the issues outlined so far. Specifically, Google hopes to unify and standardize the evaluation metrics for unlearning algorithms, as well as foster novel solutions to the problem.
The competition, which considers an age predictor tool that must forget certain training data to protect the privacy of specified individuals, began in July and runs through mid-September 2023. For business owners who might have concerns about data used in their models, the results of this competition are most certainly worth paying attention to.
In addition to Google’s efforts, the continuous build-up of lawsuits against AI and ML companies will undoubtedly spark action within these organizations.
Looking further ahead, we can anticipate advancements in hardware and infrastructure to support the computational demands of machine unlearning. There may be an increase in interdisciplinary collaboration that can assist in streamlining development. Legal professionals, ethicists and data privacy experts may join forces with AI researchers to align the development of unlearning algorithms.
We should also expect that machine unlearning will attract attention from lawmakers and regulators, potentially leading to new policies and regulations. And as issues of data privacy continue to make headlines, increased public awareness could also influence the development and application of machine unlearning in unforeseen ways.
Actionable insights for businesses
Understanding the value of machine unlearning is crucial for businesses that are looking to implement or have already implemented AI models trained on large datasets. Some actionable insights include:
- Monitoring research: Keeping an eye on recent academic and industry research will help you stay ahead of the curve. Pay particular attention to the results of events like Google’s machine unlearning challenge. Consider subscribing to AI research newsletters and following AI thought leaders for up-to-date insights.
- Implementing data handling rules: It’s crucial to examine your current and historical data handling practices. Always try to avoid using questionable or sensitive data during the model training phase. Establish procedures or review processes for the proper handling of data.
- Consider interdisciplinary teams: The multifaceted nature of machine unlearning benefits from a diverse team that could include AI experts, data privacy lawyers and ethicists. This team can help ensure your practices align with ethical and legal standards.
- Consider retraining costs: It never hurts to prepare for the worst. Consider the costs for retraining in the case that machine unlearning is unable to solve any issues that may arise.
Keeping pace with machine unlearning is a smart long-term strategy for any business using large datasets to train AI models. By implementing some or all of the strategies outlined above, businesses can proactively manage any issues that may arise due to the data used in the training of large AI models.
Final thoughts
AI and ML are dynamic and continuously evolving fields. Machine unlearning has emerged as a crucial aspect of these fields, allowing them to adapt and evolve more responsibly. It ensures better data handling capabilities while maintaining the quality of the models.
The ideal scenario is to use the right data from the start, but the reality is that our perspectives, information and privacy needs change over time. Adopting and implementing machine unlearning is no longer optional but a necessity for businesses.
In the broader context, machine unlearning fits into the philosophy of responsible AI. It underscores the need for systems that are transparent and accountable and that prioritize user privacy.
It’s still early days, but as the field progresses and evaluation metrics become standardized, implementing machine unlearning will inevitably become more manageable. This emerging trend warrants a proactive approach from businesses that regularly work with ML models and large datasets.
Matthew Duffin is a mechanical engineer, dedicated blogger and founder of Rare Connections.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!
Author: Matthew Duffin, Rare Connections
Source: Venturebeat