CrawlAI RAG: Turn Any Website Into a Queryable Knowledge Base
CrawlAI RAG: Turn Any Website Into a Queryable Knowledge Base
Ever wanted to ask questions about an entire website without reading every page? CrawlAI RAG does exactly that—crawl a site, index all its content, and chat with it using natural language.
Repository: github.com/AnkitNayak-eth/CrawlAI-RAG
What It Does
CrawlAI RAG is an AI-powered website intelligence platform:
- Crawl — Spider through all internal pages of any website
- Index — Extract text, chunk it, generate embeddings, store in a vector database
- Query — Ask natural-language questions, get grounded answers
The answers come directly from the website content—no hallucinations, no making things up.
The Tech Stack
| Layer | Technology |
|---|---|
| Backend | FastAPI |
| Frontend | Streamlit |
| RAG Framework | LangChain |
| Vector Database | ChromaDB |
| Embeddings | Sentence-Transformers |
| LLM | Groq (LLaMA 3.3 70B) |
| Web Scraping | BeautifulSoup4 + Playwright |
This is a clean, modern RAG stack. FastAPI for speed, ChromaDB for local vector storage, and Groq for fast inference on LLaMA 3.3 70B.
Key Features
Multi-Website Indexing
Index multiple websites into the same vector database. Query across all of them at once—useful for research, competitive analysis, or building comprehensive knowledge bases.
True RAG Architecture
The system follows proper Retrieval-Augmented Generation:
- User query → retrieve most relevant chunks from ChromaDB
- Relevant chunks → LLM context
- LLM generates answer grounded in retrieved content
This minimizes hallucinations because the LLM can only reference what’s actually on the website.
Clean Text Extraction
Uses BeautifulSoup4 for static content and Playwright for JavaScript-rendered pages. Extracts readable text, strips boilerplate, and chunks intelligently.
How to Use It
1. Index a Website
Enter URL: https://example.com
Click "Index Website"
→ Crawls all pages, chunks content, generates embeddings
2. Ask Questions
"What is this website about?"
"List all services mentioned"
"Who is the author?"
"What pricing plans are available?"
The system retrieves relevant chunks and generates accurate, sourced answers.
Use Cases
Documentation Q&A Index your product docs, let users ask questions naturally. No more searching through hundreds of pages.
Competitive Intelligence Index competitor websites. Ask “What features do they offer?” or “How do they describe their pricing?”
Research & Analysis Crawl multiple sources on a topic. Query across all of them to synthesize information.
Client Onboarding Index a client’s existing website before a project. Quickly understand their business, services, and messaging.
Internal Knowledge Bases Turn your company wiki or intranet into a conversational interface.
Why ChromaDB + Groq?
ChromaDB is lightweight, runs locally, and doesn’t require external services. Perfect for prototyping and smaller-scale deployments.
Groq provides the fastest LLaMA inference available—running LLaMA 3.3 70B at hundreds of tokens per second. Free tier available for experimentation.
For production scale, you could swap ChromaDB for Qdrant/Pinecone and Groq for any OpenAI-compatible endpoint.
Getting Started
# Clone the repo
git clone https://github.com/AnkitNayak-eth/CrawlAI-RAG.git
cd CrawlAI-RAG
# Install dependencies
pip install -r requirements.txt
# Set up environment
cp .env.example .env
# Add your Groq API key
# Run the backend
uvicorn main:app --reload
# Run the frontend (separate terminal)
streamlit run app.py
The RAG Pipeline
Website URL
↓
Crawler (BeautifulSoup/Playwright)
↓
Raw HTML → Clean Text
↓
Text Chunker (LangChain)
↓
Sentence-Transformers Embeddings
↓
ChromaDB Vector Store
↓
User Query → Similarity Search
↓
Top-K Chunks → LLM Context
↓
Groq LLaMA 3.3 → Answer
Limitations
- JavaScript-heavy sites may require Playwright mode (slower)
- Rate limiting — be respectful when crawling; add delays for large sites
- No login/auth — can’t crawl authenticated pages
- Storage — large sites = large vector databases
Similar Tools
- Firecrawl — Production-grade web scraping API with LLM-ready output
- Crawl4AI — Async crawler optimized for AI/LLM data extraction
- ScrapeGraphAI — AI-powered scraping with graph-based extraction
CrawlAI RAG is simpler and more focused: crawl → index → query. If you need the full pipeline in one place, it’s a great starting point.
Built by Ankit Kumar Nayak. The Menon Lab covers open-source AI tools and emerging ML infrastructure.