Pipeline flowchart
Step details
1 · Parse & Validate URL
1 · Parse & Validate URL
Module:
youtube/parser.py → parse_youtube_url()Accepts any of these as input:- Full YouTube watch URL (
https://youtube.com/watch?v=...) - Short URL (
https://youtu.be/...) - Embed URL or Shorts URL
- Bare 11-character video ID
- Bare playlist ID (
PL,UU,LL,FL,RD,UL,WL,OLAK5uy_) - Playlist URL (
?list=...) - Path to a
.txtbatch file (one URL per line)
ParsedURL domain object with url_type ("video" or "playlist"), video_id, and/or playlist_id. For playlists, extract_playlist_videos() expands into individual video IDs before the pipeline starts.2 · Fetch Video Metadata
2 · Fetch Video Metadata
Module:
youtube/metadata.py → get_video_metadata()Retrieves title, duration (seconds), and chapter list. Two early-exit checks:- Cache hit — if the video ID is in SQLite cache and
--forceis not set → emitVIDEO_SKIPPED. - Output path check — if notes already exist in the output directory and
--forceis not set → skip.
3 · Fetch Transcript
3 · Fetch Transcript
Module:
youtube/transcript.py → fetch_transcript()Tries languages in the preference list in order (default: ["en"]), falling back to the next if unavailable. If a cookie_file is configured, it is used for all requests — enabling private video access.- Retries: up to 3 attempts with backoff on network errors.
- Returns a
VideoTranscriptwith a list ofTranscriptSegment(text, start time, duration).
4 · Chunk & Generate
4 · Chunk & Generate
Module:
pipeline/generation.py → StudyMaterialGeneratorToken counting uses LiteLLM’s token_counter with the configured model. If the transcript exceeds 4,000 tokens (DEFAULT_CHUNK_SIZE), it is split into overlapping 200-token chunks, split on sentence boundaries where possible.4a — Chapter-level generationActivated when the video has chapters and duration exceeds 3,600 seconds (1 hour).- Chapters processed concurrently, up to 3 at a time.
- Each chapter is independently chunked and reduced to a Markdown document.
- Events:
CHAPTER_GENERATING→CHAPTER_CHUNK_GENERATING→CHAPTER_COMBINING→CHAPTER_COMPLETE
- Single-pass for small transcripts, multi-chunk with a combine step for large ones.
- Events:
GENERATION_START→CHUNK_GENERATING→GENERATION_COMBINING→GENERATION_COMPLETE
5 · Write Outputs
5 · Write Outputs
Module:
pipeline/_artifacts.py| Condition | Output |
|---|---|
| Standard video | <OUTPUT_DIR>/<sanitized title>.md |
| Chapter-aware video | <OUTPUT_DIR>/<title>/01_<chapter>.md, 02_<chapter>.md, … |
--quiz | <title>_quiz.md alongside notes |
--export-transcript txt | <title>_transcript.txt |
--export-transcript json | <title>_transcript.json |
6 · Persist to Cache
6 · Persist to Cache
Module:
storage/repository.py → DatabaseRepository.aupsert_video_cache()After a successful run, the following are written to SQLite:- Video metadata (id, title, duration)
- Raw transcript text and language
- Token usage (prompt, completion, total)
- Cost estimate (USD)
- Timing (transcript fetch seconds, generation seconds)
- Model name
stats / history commands.7 · Events & Result
7 · Events & Result
PipelineEvent objects are emitted via the on_event callback throughout. The CLI subscribes via PipelineDashboard to update the live terminal UI.At the end, CorePipeline.run() returns a PipelineResult with success_count, failure_count, total_count, errors, and metrics.See Pipeline Events for the full event type reference.Concurrency model
notewise usesasyncio throughout.
| Semaphore | Default | Config key |
|---|---|---|
| Video-level concurrency | 5 | MAX_CONCURRENT_VIDEOS |
| Chapter-level concurrency | 3 | Code default only |
| YouTube request rate | 10 / min | YOUTUBE_REQUESTS_PER_MINUTE |
MAX_CONCURRENT_VIDEOS. Within a single long video with chapters, chapters are processed concurrently up to DEFAULT_MAX_CONCURRENT_CHAPTERS.
A PipelineSharedState object is passed into CorePipeline during batch runs so all video instances share the same semaphores.