HUANGCHIHHUNGLeo/claude-real-video｜GitHub 仓库详情

claude-real-video

Let Claude — or any LLM — actually watch a video.

demo

Same 58-second clip: fixed 1 fps sampling = 58 frames. crv keeps the 26 that actually differ — and --grid packs them into 3 contact sheets. Fewer tokens, nothing missed.

Most AI tools don't really see a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude won't take a video file at all. Even Gemini, which can read video natively, has to send it up to Google and samples frames at a fixed interval (1 fps by default), so fast cuts slip past.

claude-real-video does it differently, and locally: point it at a URL or a file, and it pulls the frames that actually matter (every scene change, not a fixed quota), throws away the near-duplicates, transcribes the audio, and hands you a clean folder any LLM can read. All the processing happens on your own machine — what gets sent anywhere is only the frames/text choose to paste into an LLM afterwards.

	fixed-interval sampling	claude-real-video
Frame selection	every N seconds	scene-change detection + density floor
Repeated shots (A-B-A cuts)	sent again every time	sliding-window dedup sends each shot once
Static slide (10 min)	~600 near-identical frames	collapses to 1 (dedup)
Fast-cut reel	misses frames between samples	catches each visual change
Audio	often ignored	Whisper transcript w/ language detect
Where the processing happens	often in someone's cloud	on your machine (you choose what to share with an LLM afterwards)
Input	usually local file only	URL (yt-dlp) or local file

OS	command
macOS	`brew install ffmpeg`
Linux	`sudo apt install ffmpeg` (or your distro's package manager)
Windows	`winget install Gyan.FFmpeg` — or `choco install ffmpeg` — or download a build and add its `bin\` folder to your `PATH`

flag	default	meaning
`-o, --out`	`crv-out`	output directory
`--scene`	`0.30`	scene-change sensitivity (lower = more frames)
`--fps-floor`	`1.0`	at least one frame every N seconds
`--max-frames`	`150`	hard cap on total frames
`--lang`	`auto`	Whisper language (`en`, `zh`, `auto`, ...)
`--dedup-threshold`	`8`	% of pixels that must change for a frame to count as new; higher = fewer frames
`--dedup-window`	`4`	compare against the last N kept frames — a shot the model already saw doesn't come back after a cutaway (`1` = consecutive-only)
`--report`	off	keep dropped frames in `./dropped` + write `report.html` visualising every keep/drop decision
`--no-transcribe`	off	skip audio
`--keep-audio`	off	also save the full soundtrack (`audio.m4a`) so audio models can hear it
`--why`	–	why you're watching, e.g. `--why "find the pricing strategy"` — written into `MANIFEST.txt` so the model analyses with that lens instead of a generic summary
`--kb`	–	also save the analysis as a dated markdown note into this folder (your Obsidian vault, notes dir, ...) — so it joins your knowledge base instead of dying in `crv-out`
`--cookies`	–	Netscape cookie file for login-gated sources

HUANGCHIHHUNGLeo/claude-real-video

项目说明

claude-real-video

Why not just sample frames?

Install

System requirement: ffmpeg

Usage

Options

What `--grid` output looks like

Use it from Python

How it works

Notes

crv Pro — understand how a video was shot

License

HUANGCHIHHUNGLeo/claude-real-video

项目说明

claude-real-video

Why not just sample frames?

Install

System requirement: ffmpeg

Usage

Options

What --grid output looks like

Use it from Python

How it works

Notes

crv Pro — understand how a video was shot

License

What `--grid` output looks like