site-md: Serve Markdown to AI Agents, HTML to Humans — Same URLs
Here’s the problem: AI agents are terrible at parsing HTML. They waste tokens on nav bars, cookie banners, and script tags just to get at your actual content. Meanwhile, your human visitors need all that chrome.
site-md solves this with content negotiation. Same URL, different response based on who’s asking:
GET /docs → <html>…</html> (humans)
GET /docs.md → # Docs … (agents)
GET /docs → # Docs … (Accept: text/markdown)
One command install. No content duplication. No separate API.
Install
npx site-md
That’s the entire setup. The CLI:
- Writes
middleware.ts(or merges into your existing one) - Adds an API route at
app/api/site-md/[...path]/route.ts - Wraps your
next.configwithwithNextMd
Test it:
curl http://localhost:3000/ # HTML
curl http://localhost:3000/index.md # Markdown
curl http://localhost:3000/llms.txt # Site index for LLMs
How Detection Works
A request gets Markdown when any of these match:
| Trigger | Example |
|---|---|
Path ends with .md | /docs.md, /blog/post.md |
?format=md query param | /docs?format=md |
Accept: text/markdown header | Agents negotiating content type |
| Known bot User-Agent | GPTBot, ClaudeBot, Googlebot |
Path is /llms.txt or /llms-full.txt | Standard LLM index files |
Everything else passes through untouched. Your human visitors never see a difference.
Why This Matters Practically
For agent builders
If you’re building agents that consume web content, you already know the pain: Jina Reader, Firecrawl, r.jina.ai wrappers — all workarounds for the fact that websites serve bloated HTML to everyone.
site-md puts the solution on the publisher side. If sites adopt this, your agent can just append .md to any URL and get clean content. No third-party extraction service needed.
For site owners
Your content is already being consumed by AI. The question is whether it’s being consumed well. Garbled HTML parsing means:
- Your docs get misquoted
- Your product gets misunderstood
- Your content gets attributed incorrectly
Serving clean Markdown means AI agents represent your content accurately.
For SEO/discovery
The /llms.txt endpoint is becoming a de facto standard (like robots.txt for AI). It tells agents what content is available and how to access it. site-md generates this automatically from your sitemap.
Configuration
The defaults work, but you can tune:
import { withNextMd } from "site-md/config";
export default withNextMd(
{ reactStrictMode: true },
{
cacheTTL: 600, // cache Markdown 10 min
passthrough: ["/admin/*", "/app/*"], // never convert these
stripSelectors: [".cookie-banner"], // remove from output
bots: {
trainingScrapers: "block", // block GPTBot, Bytespider
searchCrawlers: "markdown",
userAgents: "markdown",
},
llmsTxt: {
title: "My Site",
description: "Public docs for AI consumers",
sitemapUrl: "/sitemap.xml",
},
}
);
The bots config is smart — you can serve Markdown to legitimate agents while blocking training scrapers entirely. Granular control over who gets what.
The Bigger Picture: Content Negotiation for AI
This is essentially HTTP content negotiation — a pattern from the ’90s — applied to the AI era. The server looks at who’s asking and responds appropriately:
Accept: text/html→ Full rendered pageAccept: text/markdown→ Clean content- Known bot UA → Clean content
.mdextension → Clean content
It’s elegant because it uses existing web standards rather than inventing new infrastructure.
How It Compares
| Approach | Publisher effort | Agent effort | Quality |
|---|---|---|---|
| Raw HTML scraping | None | High (parsing) | Poor |
| Jina Reader / Firecrawl | None | Medium (API call) | Good |
Separate /api/content | High (build + maintain) | Low | Good |
| site-md | Low (one command) | Low (append .md) | Excellent |
The tradeoff is clear: minimal publisher effort, minimal agent effort, maximum content fidelity.
Limitations
- Next.js only — this is middleware-based, tightly coupled to Next.js App Router. If you’re on Astro, Remix, or plain HTML, you’ll need a different approach.
- Dynamic content — works best for content-heavy pages. Highly interactive SPAs won’t produce useful Markdown.
- Cache invalidation — the
cacheTTLmeans agents might see slightly stale content. Set it low for fast-changing pages.
Try It
If you run a Next.js site with docs, a blog, or any content that agents should be able to consume cleanly:
npx site-md --title "Your Site" --description "Docs for humans and machines" --yes
Two files. Zero content duplication. Your site now speaks both languages.
site-md on GitHub — MIT licensed, by @yazinsai