Thursday, July 10, 2025

How to Fine-Tune Small Language Models to Think with Reinforcement Learning

 A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch

Reasoning models are currently in fashion. DeepSeek-R1, Gemini-2.5-Pro, OpenAI’s O-series models, Anthropic’s Claude, Magistral, and Qwen3 — there is a new one every month. When you ask these models a question, they go into a chain of thought before generating an answer.

A simple demonstration of what reasoning looks like. When asked a question, the Language Model (LM) generates a chain of thought first, followed by the answer. (Illustration by the Author)

I recently asked myself the question, “Hmm… I wonder if I should write a Reinforcement Learning loop from scratch that teaches this ‘thinking’ behaviour to really small models — like only 135 million parameters“. It should be easy, right?

Well, it wasn’t.

Small models simply do not have the world knowledge that large models do. This makes < 1B parameter model lack the “common sense” to easily reason through complex logical tasks. Therefore, you cannot just rely on compute to train them to reason.

You need additional tricks up your sleeve.

In this article, I won’t just cover tricks though. I will cover the major ideas behind training reasoning behaviours into language models, share some simple code snippets, and some practical tips to fine-tune Small Language Models (SLMs) with RL.

This article is divided into 5 sections:

  1. Intro to RLVR (Reinforcement Learning with Verifiable Rewards) and why it is uber cool
  2. A visual overview of the GRPO algorithm and the clipped surrogate PPO loss.
  3. A code walkthrough!
  4. Supervised fine-tuning and practical tips to train reasoning models
  5. Results!

Unless otherwise mentioned, all images used in this article are illustrations produced by the author.

At the end of this article, I will link to the 50-minute companion YouTube video of this article. If you have any queries, that video likely has the answers/clarification you need. You can also reach out to me on X (@neural_avb).

1. Reinforcement Learning with Verifiable Rewards (RLVR)

Before diving into specific challenges with Small models, let’s first introduce some terms.

Group Relative Policy Optimization, or GRPO, is a (rather new) Reinforcement Learning (RL) technique that researchers are using to fine-tune Large Language Models (LLMs) on logical and analytical tasks. Since its inception, a new term has been circulating in the LLM research space: RLVR, or Reinforcement Learning with Verifiable Rewards.

To understand what makes RLVR unique, it’s helpful to contrast it with the most common application of RL in language models: RLHF (Reinforcement Learning with Human Feedback). In RLHF, an RL module is trained to maximize scores from a separate reward model, which acts as a proxy for human preferences. This reward model is trained on a dataset where humans have ranked or rated different model responses.

In other words, RLHF is trained so LLMs can output responses that are more aligned with human preferences. It tries to make models follow instructions more closely.

RLVR tries to solve a different problem. RLVR teaches a model to be verifiably correct, often by learning to generate it’s own chain of thought.

Where RLHF had a subjective reward model, RLVR uses an objective verifier. The core idea is to provide rewards based on whether an answer is demonstrably correct, not on a prediction of what a human might prefer.


No comments:

Post a Comment

Scientists from Russia and Vietnam discover new antimicrobial compounds in marine sponges

  Scientists from the G. B. Elyakov Pacific Institute of Bioorganic Chemistry of the Far Eastern Branch of the Russian Academy of Sciences, ...