SentrySearch: Natural Language Search Over Hours of Video Footage

By Prahlad Menon 4 min read

The use case is immediately obvious: you have hours of dashcam footage, security camera recordings, or raw video files, and you need to find one specific moment. Not a timestamp. Not a filename. Just: “red truck running a stop sign.”

SentrySearch does exactly that — and exports the trimmed clip.

How It Works

The architecture is straightforward once you see it. SentrySearch:

  1. Splits your mp4 files into overlapping chunks (default 30s, 5s overlap)
  2. Embeds each chunk as a video using either Gemini Embedding 2 (API) or Qwen3-VL (local)
  3. Stores the vectors in a local ChromaDB database
  4. Searches by embedding your text query into the same vector space and finding the nearest match
  5. Trims the top match from the original file and saves it as a clip

The key step is #2 — video embeddings, not frame-by-frame image embeddings. The model actually watches each chunk as a video and produces a single embedding that represents the visual content, motion, and sequence of events. This is what makes “red truck running a stop sign” work as a query rather than just “red truck” or “stop sign” independently.

Two Backends: Cloud vs. Fully Local

Gemini backend (default): Uses Google’s Gemini Embedding 2 API. Better search quality, no local GPU required, free tier available at aistudio.google.com. Setup is one command:

sentrysearch init  # prompts for API key, validates, done

Local backend (fully private): Uses Qwen3-VL-Embedding running entirely on your machine. No API calls, no data leaves your system. Auto-detects your hardware and picks the right model:

HardwareModelNotes
Apple Silicon 24GB+ / NVIDIA 18GB+ VRAMQwen3-VL 8BFull precision
Apple Silicon 16GBQwen3-VL 2B8B won’t fit
NVIDIA 8–16GB VRAMQwen3-VL 8B (4-bit).[local-quantized] install
Intel Mac / CPU-onlyToo slow; use Gemini API instead

Install the local backend:

uv tool install ".[local]"           # Mac / NVIDIA full precision
uv tool install ".[local-quantized]" # NVIDIA with 4-bit quantization

The Full Workflow

# Install
git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch
uv tool install .

# Index your footage
sentrysearch index /path/to/footage

# Search
sentrysearch search "red truck running a stop sign"

Output:

#1 [0.87] front_2024-01-15_14-30.mp4 @ 02:15-02:45
#2 [0.74] left_2024-01-15_14-30.mp4 @ 02:10-02:40
#3 [0.61] front_2024-01-20_09-15.mp4 @ 00:30-01:00

Saved clip: ./match_front_2024-01-15_14-30_02m15s-02m45s.mp4

Similarity scores are shown alongside each result. Below the confidence threshold (default 0.41), it prompts before trimming — you won’t silently get a wrong clip. --save-top N exports multiple clips instead of just the best match.

Tesla Dashcam: Search + Telemetry Overlay

If you drive a Tesla, SentrySearch has a feature that goes beyond clip extraction. Starting with Tesla firmware 2025.44.25+, dashcam videos embed telemetry data directly in SEI NAL units in the video file — speed, GPS coordinates, location name, turn signal state. SentrySearch can read that data and burn it as a HUD overlay onto any trimmed clip:

# Search and auto-apply overlay to the result
sentrysearch search "running a red light" --overlay

# Or apply overlay to any Tesla dashcam file directly
sentrysearch overlay /path/to/tesla_video.mp4

The overlay shows speed, GPS coordinates, reverse-geocoded location name, and turn signal status — frame-accurate, since it’s reading from the embedded telemetry rather than estimating. For insurance claims, incident documentation, or just reviewing a close call, that context matters. A clip showing 52mph in a 25mph zone is a different artifact than the same clip without the data.

Why This Matters Beyond the Demo

The obvious use case is security/dashcam footage review. But the underlying capability — semantic search over raw video using natural language — has a much wider range of applications:

Medical imaging and procedure review — search surgical recordings, endoscopy footage, or training videos by describing a specific maneuver or finding. Connects directly to the spatial reasoning work in MedOpenClaw — the same VLM spatial understanding that helps agents navigate 3D volumes could index video with the same approach.

Legal and compliance — search hours of deposition recordings, court proceedings, or workplace incident footage by event description rather than timestamp.

Sports analysis — “fast break leading to turnover,” “penalty kick save,” “player collision in the third quarter.” Frame-accurate clip extraction for coaching review.

Journalism and documentary research — search archive footage by content rather than metadata. Decades of raw footage that currently requires manual review becomes keyword-searchable.

Content moderation at scale — the local model option is particularly relevant here, where processing can’t go through third-party APIs for privacy or compliance reasons.

The Local-First Angle

The Gemini API backend is the path of least resistance for getting started. But the Qwen3-VL local option is the more interesting story.

Video content is one of the last major data types that hasn’t been made fully searchable without cloud dependency. Running a capable VLM locally for video embeddings changes the economics for anyone processing sensitive, proprietary, or high-volume footage where per-API-call costs or data residency requirements would make a cloud approach impractical.

The 8B model on Apple Silicon with 24GB+ RAM or a mid-range NVIDIA GPU is fast enough for real workloads. The 4-bit quantized option brings it down to hardware that’s genuinely accessible.

SentrySearch is a clean implementation of a capability that’s been theoretically possible for a while but hadn’t been packaged into something a developer could install and use in an afternoon. That’s the contribution.

Repo: github.com/ssrajadh/sentrysearch
OpenClaw skill: clawhub.ai/ssrajadh/natural-language-video-search