![](https://toptech.news/wp-content/uploads/2025/02/Less-r-own13-02-2025.jpg)
Language models can generalize better when left to create their own solutions, a new study by Hong Kong University and University of California, Berkeley, shows. The findings, which apply to both large language models (LLMs) and vision language models (VLMs), challenge one of the main beliefs of the LLM community — that models require hand-labeled training examples. In fact, the researchers show that training models on too many hand-crafted examples can have adverse effects on the model’s ability to generalize to unseen data.
SFT vs RL in model training
For a long time, supervised fine-tuning (SFT) has been the gold standard for training LLMs and VLMs. Once a model is pre-trained on raw text and image data, companies and AI labs usually post-train it on a large dataset of hand-crafted examples in question/answer or request/response format. After SFT, the model can undergo additional training stages, such as reinforcement learning from human feedback (RLHF), where the model tries to learn implicit human preferences based on signals such as answer rankings or liking/disliking the model’s responses.
SFT is useful for steering a model’s behavior toward the kind of tasks the model creators have designed it for. However, gathering the data is a slow and costly process, which is a bottleneck for many companies and labs.
Recent developments in LLMs have created interest in pure reinforcement learning (RL) approaches, where the model is given a task and left to learn it on its own without hand-crafted examples. The most important instance is DeepSeek-R1, the OpenAI o1 competitor that mostly used reinforcement learning to learn complex reasoning tasks.
Generalization vs memorization
One of the key problems of machine learning (ML) systems is overfitting, where the model performs well on its training data but fails to generalize to unseen examples. During training, the model gives the false impression of having learned the task, while in practice it has just memorized its training examples. In large and complex AI models, separating generalization from memorization can be difficult.
The new study focuses on the generalization abilities of RL and SFT training in textual and visual reasoning tasks. For textual reasoning, an LLM trained on a set of rules should be able to generalize to variants of those rules. In visual reasoning, a VLM should remain consistent in task performance against changes to different aspects of visual input, such as color and spatial layout.
![](https://venturebeat.com/wp-content/uploads/2025/02/image_fed545.png?w=800)
In their experiments, the researchers used two representative tasks. First was GeneralPoints, a benchmark that evaluates a model’s arithmetic reasoning capabilities. The model is given four cards, as textual descriptions or images, and is asked to combine them to reach a target number. For studying ruled-based generalization, the researchers trained the model using one set of rules, then evaluated it using a different rule. For visual generalization, they trained the model using cards of one color and tested its performance on cards of other colors and numbering schemes.
The second task is V-IRL, which tests the model’s spatial reasoning capabilities in an open-world navigation domain that uses realistic visual input. This task also comes in pure-language and vision-language versions. The researchers evaluated generalization by changing the kind of instructions and visual representations the model was trained and tested on.
![](https://venturebeat.com/wp-content/uploads/2025/02/image_5eb42e.png?w=800)
They ran their tests on Llama-3.2-Vision-11B, warming the model up by training it on a small SFT dataset, then creating separate versions for each task and training paradigm. For each task, they separately scaled the training on RL and SFT. The SFT process trains the model on additional hand-crafted solutions, while RL lets the model generate many solutions for each problem, evaluate the results and train itself on the correct answers.
The findings show that reinforcement learning consistently improves performance on examples that are drastically different from training data. On the other hand, SFT seems to memorize the training rules and doesn’t generalize to out-of-distribution (OOD) examples. These observations apply to both text-only and multimodal settings.
![](https://venturebeat.com/wp-content/uploads/2025/02/image_30908f.png?w=800)
Implications for real-world applications
While their experiments show that RL is better at generalizing than SFT, the researchers also found that SFT is helpful for stabilizing the model’s output format, and is crucial to enabling RL to achieve its performance gains. The researchers found that, without the initial SFT stage, RL training did not achieve desirable results.
This is a bit different from the results obtained by DeepSeek-R1-Zero, which was post-trained on pure RL. The researchers suggest that this can be due to the different backbone model they used in their experiments.
It is clear that there is a lot of untapped potential in RL-heavy approaches. For use cases that have verifiable results, letting the models learn on their own can often lead to unanticipated results that humans could not have crafted themselves. This could come in very handy in settings where creating hand-crafed examples can be tedious and expensive.
Author: Ben Dickson
Source: Venturebeat
Reviewed By: Editorial Team