Cohere Transcribe Runs in Your Browser — 1 Hour of Audio in 100 Seconds, Completely Free

By Prahlad Menon 5 min read

Speech recognition just had a quiet but significant moment. Cohere released a 2-billion-parameter transcription model — open-source, Apache 2.0 — that runs entirely in your browser using WebGPU. No installation. No API key. No audio ever leaves your machine.

1 hour of audio transcribed in 100 seconds. Locally. For free.

It currently sits at #1 on HuggingFace’s Open ASR Leaderboard for English accuracy, and matches or beats all existing open-source models across 13 other languages.

Try it now →

What Is Cohere Transcribe?

cohere-transcribe-03-2026 is Cohere’s first audio model. It’s a dedicated ASR (automatic speech recognition) model trained from scratch on 500,000 hours of curated audio-transcript pairs across 14 languages:

European: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
East/Southeast Asian: Chinese (Mandarin), Japanese, Korean, Vietnamese
MENA: Arabic

The license is Apache 2.0 — meaning you can use it commercially, self-host it, build products with it, modify it. No strings.

The Architecture: Why It’s Fast

Most recent speech models take a shortcut: they grab a pre-trained text LLM and bolt audio understanding onto it. Models like Qwen-1.7B-ASR and IBM Granite Speech work this way. It’s cheaper to train but slow to run — you’re doing full autoregressive inference through a giant text backbone just to get a transcript.

Cohere took the opposite approach. cohere-transcribe-03-2026 uses a Fast-Conformer encoder-decoder architecture:

  • Conformer encoder — interleaves CNN and Transformer layers. CNNs handle local acoustic features (phonemes, rapid sound transitions). Transformers handle long-range linguistic context (sentence meaning, speaker intent). Interleaving them gives you both.
  • Lightweight decoder — more than 90% of parameters live in the encoder. The decoder is deliberately small, minimizing autoregressive compute.

The result: throughput 3x higher than similarly-sized competitors. On the RTFx metric (real-time factor — how fast the model processes audio relative to real time), Cohere Transcribe pulls ahead of every other 1B+ model at the same accuracy level.

The training data got serious attention too: 500K hours of curated pairs, synthetic augmentation after error analysis, noise augmentation across 0–30 dB SNR range, a 16k multilingual BPE tokenizer trained in-distribution, and audio decontamination checks to prevent test/train overlap.

The WebGPU Browser Demo

The headline capability is the in-browser inference. Using WebGPU — the modern GPU compute API now available in Chrome and Edge — the model runs entirely client-side. Your audio never touches a server.

This is made possible by the same technology stack that’s been pushing on-device AI forward: WebGPU exposes GPU compute to web applications without requiring CUDA, Metal, or any local install. If your GPU supports WebGPU (most modern discrete GPUs do, and many integrated GPUs), you can run the model from a tab.

The practical implications:

  • Privacy-sensitive transcription — medical notes, legal recordings, personal audio — stays on your device
  • Zero cost — no API calls, no tokens, no billing
  • No setup friction — share a link; recipients can transcribe immediately

The 100 seconds per hour figure is real: on a WebGPU-capable GPU, the model processes audio at roughly 36x real time.

How It Compares

ModelSizeWER (English)LanguagesLicenseBrowser
Cohere Transcribe2B#1 Open ASR14Apache 2.0✅ WebGPU
Whisper Large v31.5BStrong99MIT
Distil-Whisper756MGood1 (EN)MIT
Qwen3-ASR-1.7B1.7BCompetitiveMultiApache 2.0

Cohere wins on accuracy (English) and is competitive across its 14 languages. Whisper still wins on language breadth at 99 languages. The browser execution is unique to Cohere Transcribe for a model of this quality.

Self-Hosting and Production Use

For production, Cohere collaborated with the vLLM team to add native serving support — the PR is merged. That means you can serve Cohere Transcribe with the same open-source stack you’d use for any other LLM, with batching, concurrency management, and all the vLLM performance optimizations.

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="CohereLabs/cohere-transcribe-03-2026",
    device="cuda"
)

result = pipe("your_audio.mp3")
print(result["text"])

Or use the model’s native transcribe() method, which handles long-form audio chunking automatically:

from transformers import AutoModel

model = AutoModel.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
transcript = model.transcribe("long_audio.wav")

The model is available on HuggingFace at CohereLabs/cohere-transcribe-03-2026, and also via Cohere’s Model Vault for managed enterprise deployment.

Why This Matters

A few threads converge here that are worth naming.

The “runs in the browser” moment is new. Until recently, state-of-the-art speech recognition meant cloud APIs — OpenAI Whisper API, Google Speech-to-Text, AWS Transcribe. The tradeoff was accuracy for privacy and cost. Cohere Transcribe at this quality level in a browser tab collapses that tradeoff.

WebGPU is quietly becoming a platform. What started as a web graphics API is now capable of running billion-parameter models. The same shift that happened with JavaScript (from toy scripting to production runtime) is happening with the browser as an inference environment.

Open source is winning ASR. A year ago, proprietary APIs were the obvious choice for production-quality transcription. Today, an Apache 2.0 model sits at #1 on the leaderboard, runs locally, and can be self-hosted with vLLM. The calculus for building on proprietary services has shifted.

For anyone building voice agents, transcription pipelines, medical documentation tools, or anything that touches audio — this is the model to evaluate first.


Links: