D

Draftr AI

2h ago

The format you already know

D

Draftr AI

2h ago

URL → clean markdown

D

Draftr AI

2h ago

Prompt → scripts → QA

D

Draftr AI

2h ago

Word-level forced alignment

D

Draftr AI

2h ago

9:16 · libass · sidechain

Short-form video, pointed at things worth knowing.

People will sit through a five-minute breakdown of quantum computing delivered over Minecraft parkour. They won't open the paper it's based on. We didn't set out to exploit that. We set out to point it at something worth knowing.

Draftr is a pipeline that takes any written source (a URL, a PDF, a raw paste) and produces a fully rendered, narrated, subtitled short-form video. No editing. No recording. No script writing. You drop the content in chat. The machine handles everything else.

The same format that made you watch six hours of gameplay clips can make you actually learn something. We just had to build the machine.

Getting the text out of anything.

The internet stores knowledge in dozens of formats: behind JavaScript renders, inside PDF binaries, spread across multi-page documentation sites. We use Firecrawl. For a single article URL, the backend calls Firecrawl's scrape endpoint requesting both markdown and summary formats. Firecrawl handles the render, strips the nav, footer, and ads, and returns the actual content as clean markdown.

For an entire documentation website, we first call the map endpoint to enumerate candidate URLs across the domain, then rank and filter them by path heuristics. Paths containing /blog, /docs, or /research score up. Paths like /tag or /legal score down. The top-ranked URLs go into a batch crawl job, polled until completion.

A script pipeline that writes for the ear.

Once the source text is ingested, it goes to the Producer stage. The backend sends the cleaned brief to OpenAI, which returns a structured bundle of candidate scripts. Each script includes a title, hook line, narration text, caption text, estimated duration, visual beat notes, gameplay tags, music tags, and the source facts it used.

The goal is not one generic summary. The Producer generates up to fifteen scripts per batch, each covering a different angle of the source material. Every narration text must land inside a configured word range targeting 25–30 second videos.

After generation, the backend runs a local QA and repair pass. It checks hook grounding, duplicate ideas, schema shape, pacing, and word counts. If a script bundle fails validation, the backend sends the errors back for repair and retries up to three times before isolating the remaining failed slots and continuing independently.

Every word, timestamped to the millisecond.

With scripts in hand, the orchestrator calls the Narrator. This is a second ElevenLabs agent, separate from the Producer, whose job is to convert the narration text into audio and return word-level timestamps. Every word in the output has a precise start time and end time in milliseconds, produced by ElevenLabs' forced alignment system running over the generated audio.

The word timings are then passed directly into the subtitle generator, which builds an Advanced SubStation Alpha (.ass) track, the same subtitle format used in professional anime fansubs and broadcast production. Unlike SRT captions that just display text at timestamps, ASS supports per-word animation effects. Draftr ships with five subtitle presets: a karaoke-style sweep, and four single-word-pop variants using the Komika, Bebas Neue, Anton, and Lilita One typefaces.

Clip in. MP4 out. Nothing in between.

The final step is assembly. The renderer takes three inputs (gameplay video, narration audio, and the generated .ass subtitle file) and runs them through a single FFmpeg encode pass. The gameplay clip is cropped and scaled to fill a 9:16 vertical frame. The narration audio goes on the primary audio track.

The subtitle track is burned directly into the video using libass during the encode pass. The final file is a self-contained MP4. It goes straight to Supabase storage and the public URL is returned to the orchestrator. A batch of ten videos generates in about three minutes from the moment ingestion completes.