AI & RoboticsNews

Researchers turn to Harry Potter to make AI forget about copyrighted material

As the debate heats up around the use of copyrighted works to train large language models (LLMs) such as OpenAI’s ChatGPT, Meta’s Llama 2, Anthropic’s Claude 2, one obvious question arises: can these models even be altered or edited to remove their knowledge of such works, without totally retraining them or rearchitecting them?

In a new paper published on the open access and non-peer reviewed site arXiv.org, co-authors Ronen Eldan of Microsoft Research and Mark Russinovich of Microsoft Azure propose a new way of doing exactly this by erasing specific information from a sample LLM — namely, all knowledge of the existence of the Harry Potter books (including characters and plots) from Meta’s open source Llama 2-7B.

As the Microsoft researchers write: “While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model’s ability to generate or recall Harry Potter-related content.”

This work provides an important step toward adaptable language models. The ability to refine AI over time according to shifting organizational needs is key to long-term, enterprise-safe deployments.

“Traditional models of [machine] learning predominantly focus on adding or reinforcing knowledge through basic fine-tuning but do not provide straightforward mechanisms to ‘forget’ or ‘unlearn’ knowledge,” the authors write.

How did they overcome this? They developed a three-part technique to approximate unlearning specific information in LLMs.

First, they trained a model on the target data (Harry Potter books) to identify tokens most related to it by comparing predictions to a baseline model.

Second, they replaced unique Harry Potter expressions with generic counterparts and generated alternative predictions approximating a model without that training.

Third, they fine-tuned the baseline model on these alternative predictions, effectively erasing the original text from its memory when prompted with the context.

To evaluate, they tested the model’s ability to generate or discuss Harry Potter content using 300 automatically generated prompts, as well as by inspecting token probabilities. As Eldan and Russinovich state, “to the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models.”

They found that while the original model could easily discuss intricate Harry Potter plot details, after only an hour of finetuning their technique, “it’s possible for the model to essentially ‘forget’ the intricate narratives of the Harry Potter series.” Performance on standard benchmarks like ARC, BoolQ and Winogrande “remains almost unaffected.”

As the authors note, more testing is still needed given limitations of their evaluation approach. Their technique may also be more effective for fictional texts than non-fiction, since fictional worlds contain more unique references.

Nonetheless, this proof-of-concept provides “a foundational step towards creating more responsible, adaptable, and legally compliant LLMs in the future.” As the authors conclude, further refinement could help address “ethical guidelines, societal values, or specific user requirements.”

In summarizing their findings, the authors state: “Our technique offers a promising start, but its applicability across various content types remains to be thoroughly tested. The presented approach offers a foundation, but further research is needed to refine and extend the methodology for broader unlearning tasks in LLMs.”

Moving forward, more general and robust techniques for selective forgetting could help ensure AI systems remain dynamically aligned with priorities, business or societal, as needs change over time.

VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More


As the debate heats up around the use of copyrighted works to train large language models (LLMs) such as OpenAI’s ChatGPT, Meta’s Llama 2, Anthropic’s Claude 2, one obvious question arises: can these models even be altered or edited to remove their knowledge of such works, without totally retraining them or rearchitecting them?

In a new paper published on the open access and non-peer reviewed site arXiv.org, co-authors Ronen Eldan of Microsoft Research and Mark Russinovich of Microsoft Azure propose a new way of doing exactly this by erasing specific information from a sample LLM — namely, all knowledge of the existence of the Harry Potter books (including characters and plots) from Meta’s open source Llama 2-7B.

As the Microsoft researchers write: “While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model’s ability to generate or recall Harry Potter-related content.”

This work provides an important step toward adaptable language models. The ability to refine AI over time according to shifting organizational needs is key to long-term, enterprise-safe deployments.

Event

AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.


Learn More

The magic formula

“Traditional models of [machine] learning predominantly focus on adding or reinforcing knowledge through basic fine-tuning but do not provide straightforward mechanisms to ‘forget’ or ‘unlearn’ knowledge,” the authors write.

How did they overcome this? They developed a three-part technique to approximate unlearning specific information in LLMs.

First, they trained a model on the target data (Harry Potter books) to identify tokens most related to it by comparing predictions to a baseline model.

Second, they replaced unique Harry Potter expressions with generic counterparts and generated alternative predictions approximating a model without that training.

Third, they fine-tuned the baseline model on these alternative predictions, effectively erasing the original text from its memory when prompted with the context.

To evaluate, they tested the model’s ability to generate or discuss Harry Potter content using 300 automatically generated prompts, as well as by inspecting token probabilities. As Eldan and Russinovich state, “to the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models.”

They found that while the original model could easily discuss intricate Harry Potter plot details, after only an hour of finetuning their technique, “it’s possible for the model to essentially ‘forget’ the intricate narratives of the Harry Potter series.” Performance on standard benchmarks like ARC, BoolQ and Winogrande “remains almost unaffected.”

Expelliarmus-ing expectations

As the authors note, more testing is still needed given limitations of their evaluation approach. Their technique may also be more effective for fictional texts than non-fiction, since fictional worlds contain more unique references.

Nonetheless, this proof-of-concept provides “a foundational step towards creating more responsible, adaptable, and legally compliant LLMs in the future.” As the authors conclude, further refinement could help address “ethical guidelines, societal values, or specific user requirements.”

In summarizing their findings, the authors state: “Our technique offers a promising start, but its applicability across various content types remains to be thoroughly tested. The presented approach offers a foundation, but further research is needed to refine and extend the methodology for broader unlearning tasks in LLMs.”

Moving forward, more general and robust techniques for selective forgetting could help ensure AI systems remain dynamically aligned with priorities, business or societal, as needs change over time.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.


Author: Bryson Masse
Source: Venturebeat
Reviewed By: Editorial Team

Related posts
AI & RoboticsNews

DeepSeek’s first reasoning model R1-Lite-Preview turns heads, beating OpenAI o1 performance

AI & RoboticsNews

Snowflake beats Databricks to integrating Claude 3.5 directly

AI & RoboticsNews

OpenScholar: The open-source A.I. that’s outperforming GPT-4o in scientific research

DefenseNews

US Army fires Precision Strike Missile in salvo shot for first time

Sign up for our Newsletter and
stay informed!