Unsloth: Fine-Tune LLMs from VS Code Using Free Colab GPUs

By Prahlad Menon Published 2026-02-19 3 min read

Here’s a workflow that changes everything: fine-tune LLMs directly from Visual Studio Code, running on Google Colab’s free GPUs. No expensive hardware. No cloud bills. Just your familiar IDE connected to free compute.

Unsloth makes this possible—and makes it 2x faster with 70% less VRAM than standard approaches.

Why Fine-Tune at All?

Before diving into the how, let’s address the why. Base models like Llama, Qwen, or Gemma are trained on internet-scale data to be generalists. They’re impressive, but they’re not yours.

Fine-tuning transforms a general model into a specialist:

1. Domain Expertise

A base model knows a little about everything. Fine-tuning on your data—medical records, legal documents, codebase, internal wikis—creates a model that deeply understands your domain. A fine-tuned model for radiology will outperform GPT-4 on radiology tasks, despite being 100x smaller.

2. Style and Voice

Want outputs that match your brand voice? Formal legal language? Casual customer support? Fine-tuning on examples teaches the model how to communicate, not just what to say.

3. Task-Specific Performance

Base models are optimized for general chat. Fine-tuning on structured outputs (JSON, SQL, specific formats) dramatically improves reliability for production use cases. No more fighting with prompts to get consistent formatting.

4. Smaller, Faster, Cheaper

A fine-tuned 3B model often beats a general 70B model on narrow tasks. That’s 20x fewer parameters, meaning faster inference, lower costs, and the ability to run locally.

5. Data Privacy

Fine-tuning means your sensitive data trains a model you control—not one that lives on someone else’s servers. For healthcare, legal, and enterprise applications, this is often a requirement, not a preference.

6. Reasoning and Chain-of-Thought

With reinforcement learning (GRPO/DPO), you can train models to reason through problems step-by-step. This is how labs create “reasoning models”—and now you can do it too.

The barrier has always been hardware. Fine-tuning typically requires 40-80GB of VRAM. That’s an A100 or H100—thousands of dollars in cloud compute or tens of thousands in hardware.

Unsloth removes that barrier.

The VS Code + Colab Workflow

This is the key insight: Google Colab gives you free GPU access (T4s, sometimes better). Unsloth’s VS Code extension lets you use those GPUs directly from your local IDE.

What this means:

Write and edit code in your familiar VS Code environment
Execute on Colab’s free GPUs
No context switching between browser tabs
Full IDE features: debugging, extensions, git integration

Setup in 5 Minutes

1. Install the Colab Extension

Open VS Code extensions (Ctrl+Shift+X) and search for “Google Colab”. Install it.

2. Clone Unsloth’s Notebooks

git clone https://github.com/unslothai/notebooks
cd notebooks

3. Open a Notebook and Connect

Open any notebook (e.g., nb/Qwen3_(4B)-GRPO.ipynb). In the kernel selector, choose “Colab”, then ”+ Add New Colab Server”. Authenticate with Google, select GPU as hardware accelerator, and you’re connected.

4. Run Your Fine-Tuning

Hit “Run All”. Unsloth handles the rest—installing dependencies, loading models, running training. Watch your model improve in real-time.

That’s it. You’re fine-tuning LLMs from VS Code on free hardware.

Why Unsloth is Fast

The efficiency gains aren’t magic—they’re engineering. Instead of building on existing frameworks, Unsloth’s team:

Manually derived gradients for all compute-heavy operations
Wrote custom Triton kernels for attention, MLP, and RoPE
Built a manual backpropagation engine that avoids framework overhead
Implemented padding-free packing to eliminate wasted compute

The result: 2x faster training, 70% less VRAM, zero accuracy loss. No approximations—just better implementation.

What You Can Train

Unsloth supports the full spectrum:

Category	Models
Text LLMs	Llama 3.x, Qwen 3, Gemma 3, DeepSeek, Mistral, gpt-oss
Vision LLMs	Qwen3-VL, Gemma 3 Vision, Ministral 3 VL
Text-to-Speech	Orpheus-TTS, sesame/csm-1b
Embeddings	EmbeddingGemma, BERT-style models
MoE	DeepSeek, GLM, Qwen MoE (12x faster)

Real VRAM Numbers

These are achievable on Colab’s free tier (T4 with 15GB):

Llama 3.2 1B/3B: Fits easily, fast iteration
Qwen3 4B with GRPO: Full reinforcement learning workflow
Gemma 3 4B Vision: Multimodal fine-tuning

With Colab Pro (A100):

gpt-oss 20B: 14GB VRAM
Llama 3.1 8B: Full fine-tuning with room to spare
500K context training: Possible on 80GB for 20B models

A Minimal Fine-Tuning Example

from unsloth import FastLanguageModel

# Load model in 4-bit (fits in less VRAM)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# Your training loop here—Unsloth handles the optimization

The notebooks handle all the boilerplate. Just swap in your dataset.

When to Fine-Tune vs. Prompt Engineer

Fine-tuning isn’t always the answer:

Use prompting when:

You need quick iteration
The task is straightforward
You don’t have training data
The base model already performs well

Use fine-tuning when:

Prompting hits a ceiling
You need consistent structured outputs
Domain expertise matters
You want smaller, faster models
Data privacy is a requirement
You’re building reasoning capabilities

Getting Started Today

Install VS Code Colab extension
Clone the notebooks: git clone https://github.com/unslothai/notebooks
Pick a model that fits your task
Prepare your data in the expected format (usually prompt/completion pairs)
Run the notebook and iterate

The Unsloth documentation covers dataset formatting, hyperparameter tuning, and deployment. Their fine-tuning guide walks through the full workflow.

The Bottom Line

Fine-tuning used to require serious hardware investment. Now you can do it from VS Code using free Colab GPUs, with training that’s 2x faster than standard approaches.

The combination of accessible compute (Colab) and efficient training (Unsloth) means anyone can create specialized models. The next production-ready fine-tune might come from someone working on a laptop at a coffee shop.

That’s a meaningful shift in who gets to build AI.

The Menon Lab explores tools that democratize AI development. Follow along for more on making advanced ML accessible.