Tencent's 440MB Translation Model Runs Entirely On-Device — and Outperforms 72B Models
Tencent’s Hunyuan team released something genuinely interesting this week: a translation model that fits in 440MB, runs entirely on-device without an internet connection, and according to published benchmarks, outperforms models that are 40× larger.
The model is called Hy-MT1.5-1.8B-1.25bit, and it’s fully open-source on HuggingFace.
What It Is
Hy-MT1.5 is a 1.8B parameter translation model built by Tencent’s Hunyuan team. The base model was trained through a multi-stage pipeline: MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning — all focused on translation quality rather than general capability.
It supports 33 languages, 5 dialects and minority languages, and 1,056 translation directions.
The base model in FP16 weighs 3.3GB — reasonable for a server but too heavy for a phone. Tencent’s AngelSlim team then applied Sherry, their 1.25-bit quantization framework (accepted at ACL 2026), to compress it down to 440MB.
How Sherry Works
The compression method is worth understanding because it’s clever. Standard quantization rounds weights to the nearest representable value and calls it done. Sherry does something more structured:
For every group of 4 weights, the 3 most important are stored in 1-bit (either -1 or +1), and the remaining 1 is zeroed out entirely. This packs 4 weights into just 5 bits — an effective 1.25-bit width. The pattern (3 active, 1 zeroed, aligned to powers of two) is specifically designed to map cleanly to SIMD instructions on mobile CPUs.
The result: a 7.5× compression from FP16, with what the team describes as “minimal accuracy loss.”
The Benchmark Claim
This is the headline number: on the Flores-200 Chinese-to-foreign translation benchmark, Hy-MT1.5-1.8B (the full-precision base) outperforms Tower-Plus-72B and Qwen3-32B — models with 40× and 18× more parameters respectively. It also beats Microsoft Translator and Doubao (ByteDance’s translation API) in several language directions.
The 1.25-bit compressed version maintains competitive quality with the FP16 base, which is the point of the Sherry algorithm.
These benchmarks are self-reported, which is worth noting. But the methodology (Flores-200) is standard, and the architecture choices (RL fine-tuning, distillation, translation-specific pre-training) are sound reasons why a smaller specialized model might legitimately beat a larger generalist one on a focused task.
On-Device, Offline, Private
The practical story here is about what you don’t need:
- No internet connection
- No cloud API calls
- No data leaving the device
Tencent has released an Android APK demo that includes a “background word extraction mode” — tap any text in any app, get an instant translation without switching context. The demo runs on a Snapdragon 7+ Gen 2 (a mid-range chip, not a flagship).
The GGUF format is available for llama.cpp, and Tencent has submitted a PR (#22836) to llama.cpp to support their new STQ_0 kernel for 1.25-bit inference.
Why This Matters
Two trends converge here.
First, specialized beats generalist for bounded tasks. A model trained end-to-end for translation — with RL feedback tuned specifically for translation quality — doesn’t need to carry around general reasoning, coding, or math capability. That’s why it can beat a 72B model at the one thing it’s built for.
Second, on-device AI is maturing faster than the discourse suggests. 440MB is smaller than many apps. A Snapdragon 888 (2021 flagship) can run it smoothly. The privacy and latency arguments for local inference are obvious; what’s changing is that quality is catching up.
The combination of aggressive quantization (Sherry’s 1.25-bit), task-specialized training, and mobile-optimized inference kernels is a template that will likely appear in other domains — speech recognition, document parsing, classification tasks where a specialized 1-2B model can match much larger generalists.
Get It
- HuggingFace: AngelSlim/Hy-MT1.5-1.8B-1.25bit
- GGUF version: AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF
- Android Demo APK: available on the GGUF model page
- Sherry paper (ACL 2026): arxiv.org/abs/2601.07892
- AngelSlim toolkit: github.com/Tencent/AngelSlim