MedCPT Hit 5 Million Downloads — Here's How to Use It in Your Medical RAG Pipeline
NIH’s MedCPT just crossed 5 million downloads on Hugging Face. That’s a meaningful milestone — and if you’re building anything that touches medical text, it’s worth understanding why this model exists, how it works, and when you should use it instead of a general-purpose embedding model.
Why General Embeddings Fall Short in Medicine
If you’ve tried plugging text-embedding-3-small or a general BGE model into a medical RAG pipeline, you’ve probably noticed the cracks:
- “MI” (myocardial infarction) and “MI” (motivational interviewing) land near each other
- “Positive” means something very different in oncology vs psychiatry
- Clinical shorthand (
SOB,c/o,prn) doesn’t exist in general training corpora - PubMed abstracts have their own syntactic patterns that general models never saw
The core problem: general embedding models are trained on web text. Medical language has its own vocabulary, syntax, and semantics. A query like “ACE inhibitor contraindications in bilateral renal artery stenosis” requires a model that has seen thousands of papers discussing exactly that trade-off.
What MedCPT Is
MedCPT (Medical Contrastively Pre-trained Transformer) is a family of three models from NIH/NLM, trained on an unprecedented dataset: 255 million real user query-article pairs from PubMed search logs.
That last part is key. It’s not synthetic data or document pairs — it’s 255 million times a real clinician, researcher, or student typed a query into PubMed and clicked an article. That’s a behavioral signal that captures what relevance actually means in biomedical search.
The family has three components:
| Model | Role | Max tokens | Use for |
|---|---|---|---|
ncbi/MedCPT-Query-Encoder | Query embedding | 64 | Short queries, questions, clinical notes |
ncbi/MedCPT-Article-Encoder | Document embedding | 512 | PubMed abstracts, clinical docs, guidelines |
ncbi/MedCPT-Cross-Encoder | Re-ranking | 512 | Scoring query-doc pairs after retrieval |
The query and article encoders share the same embedding space — dot product similarity works across them directly.
Setting Up MedCPT
pip install transformers torch
No API keys. No accounts. Runs locally.
Step 1: Embed Your Documents
import torch
from transformers import AutoTokenizer, AutoModel
# Load article encoder once (reuse across documents)
article_model = AutoModel.from_pretrained("ncbi/MedCPT-Article-Encoder")
article_tokenizer = AutoTokenizer.from_pretrained("ncbi/MedCPT-Article-Encoder")
article_model.eval()
def embed_articles(articles: list[list[str]]) -> torch.Tensor:
"""
articles: list of [title, abstract] pairs
returns: (N, 768) tensor of embeddings
"""
with torch.no_grad():
encoded = article_tokenizer(
articles,
truncation=True,
padding=True,
return_tensors="pt",
max_length=512,
)
return article_model(**encoded).last_hidden_state[:, 0, :]
# Example: embed a small corpus
corpus = [
[
"Metformin as first-line therapy for type 2 diabetes",
"Metformin remains the recommended first-line pharmacological therapy for type 2 diabetes due to its efficacy, safety profile, low cost, and potential cardiovascular benefits...",
],
[
"SGLT2 inhibitors in heart failure with reduced ejection fraction",
"Sodium-glucose cotransporter-2 (SGLT2) inhibitors have demonstrated significant reductions in cardiovascular death and hospitalization in patients with HFrEF...",
],
[
"GLP-1 receptor agonists and weight management in obesity",
"GLP-1 receptor agonists (GLP-1 RAs) reduce body weight through multiple mechanisms including delayed gastric emptying, increased satiety, and reduced food intake...",
],
]
doc_embeddings = embed_articles(corpus)
print(f"Corpus embedded: {doc_embeddings.shape}") # (3, 768)
Step 2: Embed Your Query
query_model = AutoModel.from_pretrained("ncbi/MedCPT-Query-Encoder")
query_tokenizer = AutoTokenizer.from_pretrained("ncbi/MedCPT-Query-Encoder")
query_model.eval()
def embed_query(query: str) -> torch.Tensor:
"""Returns (768,) embedding for a single query."""
with torch.no_grad():
encoded = query_tokenizer(
[query],
truncation=True,
padding=True,
return_tensors="pt",
max_length=64,
)
return query_model(**encoded).last_hidden_state[:, 0, :].squeeze()
query_emb = embed_query(
"What is the best first-line treatment for a newly diagnosed type 2 diabetic patient?"
)
print(f"Query embedded: {query_emb.shape}") # (768,)
Step 3: Retrieve
import torch.nn.functional as F
def retrieve(query_emb, doc_embeddings, corpus, top_k=3):
# Cosine similarity
scores = F.cosine_similarity(query_emb.unsqueeze(0), doc_embeddings)
top_indices = scores.argsort(descending=True)[:top_k]
return [(corpus[i][0], scores[i].item()) for i in top_indices]
results = retrieve(query_emb, doc_embeddings, corpus)
for title, score in results:
print(f" [{score:.3f}] {title}")
Output:
[0.847] Metformin as first-line therapy for type 2 diabetes
[0.612] GLP-1 receptor agonists and weight management in obesity
[0.489] SGLT2 inhibitors in heart failure with reduced ejection fraction
Step 4: Re-rank with the Cross-Encoder (Optional but Recommended)
The cross-encoder is slower but more accurate — use it to re-rank your top-K results:
from transformers import AutoModelForSequenceClassification
cross_tokenizer = AutoTokenizer.from_pretrained("ncbi/MedCPT-Cross-Encoder")
cross_model = AutoModelForSequenceClassification.from_pretrained("ncbi/MedCPT-Cross-Encoder")
cross_model.eval()
def rerank(query: str, candidates: list[str], top_k: int = 3) -> list[tuple]:
pairs = [[query, doc] for doc in candidates]
with torch.no_grad():
encoded = cross_tokenizer(
pairs, truncation=True, padding=True,
return_tensors="pt", max_length=512,
)
scores = cross_model(**encoded).logits.squeeze(dim=1)
ranked = sorted(zip(candidates, scores.tolist()), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
# Pass the top retrieved docs to the cross-encoder for final ranking
candidate_texts = [f"{c[0]}. {c[1]}" for c in corpus]
reranked = rerank(
"best first-line treatment for type 2 diabetes",
candidate_texts,
)
for doc, score in reranked:
print(f" [{score:.3f}] {doc[:80]}...")
Full RAG Pipeline in ~50 Lines
Here’s the complete pattern — drop in your own document collection and LLM:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
# --- Models (load once) ---
q_tok = AutoTokenizer.from_pretrained("ncbi/MedCPT-Query-Encoder")
q_mod = AutoModel.from_pretrained("ncbi/MedCPT-Query-Encoder").eval()
a_tok = AutoTokenizer.from_pretrained("ncbi/MedCPT-Article-Encoder")
a_mod = AutoModel.from_pretrained("ncbi/MedCPT-Article-Encoder").eval()
x_tok = AutoTokenizer.from_pretrained("ncbi/MedCPT-Cross-Encoder")
x_mod = AutoModelForSequenceClassification.from_pretrained("ncbi/MedCPT-Cross-Encoder").eval()
def encode(model, tokenizer, texts, max_len):
with torch.no_grad():
enc = tokenizer(texts, truncation=True, padding=True,
return_tensors="pt", max_length=max_len)
return model(**enc).last_hidden_state[:, 0, :]
def medcpt_rag(query: str, docs: list[str], top_k: int = 3) -> list[str]:
# 1. Embed
q_emb = encode(q_mod, q_tok, [query], 64)
d_emb = encode(a_mod, a_tok, [[d, ""] for d in docs], 512)
# 2. Dense retrieval (top 10)
scores = F.cosine_similarity(q_emb, d_emb)
top10 = scores.argsort(descending=True)[:10].tolist()
candidates = [docs[i] for i in top10]
# 3. Cross-encoder re-rank (top k)
pairs = [[query, c] for c in candidates]
with torch.no_grad():
enc = x_tok(pairs, truncation=True, padding=True,
return_tensors="pt", max_length=512)
rerank_scores = x_mod(**enc).logits.squeeze(dim=1)
ranked = sorted(zip(candidates, rerank_scores.tolist()),
key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:top_k]]
# --- Use it ---
context_docs = medcpt_rag(
query="contraindications for ACE inhibitors",
docs=your_document_collection,
)
# Feed to any LLM
prompt = f"""You are a clinical AI assistant. Use only the following sources:
{chr(10).join(f'[{i+1}] {d}' for i, d in enumerate(context_docs))}
Question: What are the contraindications for ACE inhibitors?
Answer:"""
When to Use MedCPT vs General Embeddings
| Use case | MedCPT | General (text-embedding-3, BGE) |
|---|---|---|
| PubMed / clinical literature search | ✅ Better | ❌ Misses medical semantics |
| EHR / clinical notes retrieval | ✅ Better | ⚠️ Mediocre |
| Drug/disease ontology matching | ✅ Better | ❌ Struggles with synonyms |
| General FAQ / product docs | ⚠️ Overkill | ✅ Fine |
| Multilingual content | ❌ English only | ✅ Better |
| Very short texts (<10 words) | ⚠️ OK | ✅ Fine |
The rule of thumb: if your documents would appear on PubMed or in a clinical system, use MedCPT.
Pairing with a Vector DB
MedCPT outputs 768-dimensional vectors. Drop them straight into any vector store:
# Qdrant example
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(":memory:") # or your Qdrant URL
client.create_collection(
collection_name="pubmed",
vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)
# Index your corpus
points = [
PointStruct(id=i, vector=emb.tolist(), payload={"title": c[0], "abstract": c[1]})
for i, (c, emb) in enumerate(zip(corpus, doc_embeddings))
]
client.upsert(collection_name="pubmed", points=points)
# Search
hits = client.search(
collection_name="pubmed",
query_vector=query_emb.tolist(),
limit=5,
)
Works identically with Chroma, Weaviate, Pinecone, pgvector — it’s just a 768-dim vector.
The 5M Number in Context
5 million downloads is notable not just as a vanity metric. It tells you:
- The model is production-tested — someone has already hit every edge case
- LitSense 2.0 uses it in production at NIH, serving millions of PubMed searches
- The community has done the integration work — there are examples for LangChain, LlamaIndex, Haystack, custom pipelines
- It’s not going away — NIH is committed to it, and the HuggingFace repo is actively maintained
For healthcare AI builders, that stability matters. General-purpose embedding models get deprecated, fine-tuned, versioned, and priced. MedCPT is open-weight, free, and institutionally backed.
Resources
- Paper: arXiv:2307.00589 — published in Bioinformatics
- Code: github.com/ncbi/MedCPT
- Query Encoder: huggingface.co/ncbi/MedCPT-Query-Encoder
- Article Encoder: huggingface.co/ncbi/MedCPT-Article-Encoder
- Cross Encoder: huggingface.co/ncbi/MedCPT-Cross-Encoder
- LitSense 2.0: https://www.ncbi.nlm.nih.gov/research/litsense/
If you’re building a medical AI system and you haven’t looked at MedCPT yet, now’s the time. Five million developers apparently agree.