How to Fine-Tune Small Language Models to Think with Reinforcement Learning

Reasoning models are currently in fashion. DeepSeek-R1, Gemini-2.5-Pro, OpenAI’s O-series models, Anthropic’s Claude, Magistral, and Qwen3 — there is a new one every month. When you ask these models a question, they go into a chain of thought before generating an answer.

A simple demonstration of what reasoning looks like. When asked a question, the Language Model (LM) generates a chain of thought first, followed by the answer. (Illustration by the Author)

I recently asked myself the question, “Hmm… I wonder if I should write a Reinforcement Learning loop from scratch that teaches this ‘thinking’ behaviour to really small models — like only 135 million parameters“. It should be easy, right?

Well, it wasn’t.

Small models simply do not have the world knowledge that large models do. This makes < 1B parameter model lack the “common sense” to easily reason through complex logical tasks. Therefore, you cannot just rely on compute to train them to reason.

You need additional tricks up your sleeve.

In this article, I won’t just cover tricks though. I will cover the major ideas behind training reasoning behaviours into language models, share some simple code snippets, and some practical tips to fine-tune Small Language Models (SLMs) with RL.

This article is divided into 5 sections:

Unless otherwise mentioned, all images used in this article are illustrations produced by the author.

At the end of this article, I will link to the 50-minute companion YouTube video of this article. If you have any queries, that video likely has the answers/clarification you need. You can also reach out to me on X (@neural_avb).

1. Reinforcement Learning with Verifiable Rewards (RLVR)

Before diving into specific challenges with Small models, let’s first introduce some terms.

Group Relative Policy Optimization, or GRPO, is a (rather new) Reinforcement Learning (RL) technique that researchers are using to fine-tune Large Language Models (LLMs) on logical and analytical tasks. Since its inception, a new term has been circulating in the LLM research space: RLVR, or Reinforcement Learning with Verifiable Rewards.

To understand what makes RLVR unique, it’s helpful to contrast it with the most common application of RL in language models: RLHF (Reinforcement Learning with Human Feedback). In RLHF, an RL module is trained to maximize scores from a separate reward model, which acts as a proxy for human preferences. This reward model is trained on a dataset where humans have ranked or rated different model responses.

In other words, RLHF is trained so LLMs can output responses that are more aligned with human preferences. It tries to make models follow instructions more closely.
RLVR tries to solve a different problem. RLVR teaches a model to be verifiably correct, often by learning to generate it’s own chain of thought.

Where RLHF had a subjective reward model, RLVR uses an objective verifier. The core idea is to provide rewards based on whether an answer is demonstrably correct, not on a prediction of what a human might prefer.

Search This Blog

Engineering Scientist Awards

How to Fine-Tune Small Language Models to Think with Reinforcement Learning

1. Reinforcement Learning with Verifiable Rewards (RLVR)

Comments

Post a Comment

Popular posts from this blog

Drone radar facilitates agricultural monitoring

TVS Motor Sets Up Global Design & Engineering Hub in Italy with Acquisition of Engines Engineering SpA

Japan to increase reliance on nuclear energy