‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches bad habits
July 31, 2025
A new study by Anthropic shows that language models might learn hidden characteristics during distillation, a popular method for fine-tuning models for special tasks. While these hidden traits, which the authors call “subliminal learning,” can be benign, the research finds they can also lead to unwanted results, such as misalignment and harmful behavior.
What is subliminal…