transcription
Production-ready audio/video transcription using OpenAI Whisper. Features model selection, timing sync (SRT/VTT/JSON), speaker diarization via pyannote, and optimized batch processing for media workflows.
Introduction
This skill provides a professional-grade framework for converting media assets into text using OpenAI Whisper. It is designed for developers, content creators, and media engineers who require high-accuracy, automated transcription workflows. By supporting various installation methods—including standard Python packages, high-performance C++ via whisper.cpp, and GPU-accelerated execution with Insanely Fast Whisper—it adapts to diverse infrastructure requirements. The skill enables users to handle complex transcription tasks such as multi-speaker diarization using pyannote.audio, frame-accurate timing synchronization for editing software like Final Cut Pro, and bulk processing for large video libraries.
-
Multi-engine support: Choose between OpenAI Whisper (Python), whisper.cpp (C++), and Insanely Fast Whisper (GPU) for varied performance needs.
-
Advanced export formats: Generate standard SRT and WebVTT for subtitles or structured JSON with word-level timing for programmatic use.
-
Speaker Diarization: Integrated support for pyannote.audio to identify and label individual speakers in multi-voice content.
-
Workflow optimization: Pre-processing tools include FFmpeg-based audio extraction, noise reduction using highpass and lowpass filters, and FFprobe analysis for frame-rate consistency.
-
Batch processing: Automated scripts provided to transcribe entire directories of media files, complete with temp file cleanup and output management.
-
Production-ready patterns: Includes guidance on model selection—from 'tiny' for quick previews to 'large-v3' for final high-accuracy production delivery.
-
Recommended input: For optimal results, extract audio using FFmpeg as mono-channel 16kHz WAV (pcm_s16le).
-
Contextual assistance: Enhance accuracy by providing initial prompts that include domain-specific vocabulary or context descriptions.
-
Scaling: Use environment-specific optimizations like CUDA device flags for GPU hardware to significantly reduce processing time on long-form content.
-
Constraints: Large models like 'large-v3' require significant VRAM (approx. 10GB); ensure hardware meets minimum requirements for the chosen model size.
-
File compatibility: Supports all standard video and audio containers (MP4, MOV, AVI, MP3, WAV) through FFmpeg integration.
Repository Stats
- Stars
- 255
- Forks
- 31
- Open Issues
- 7
- Language
- TypeScript
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- Apr 29, 2026, 01:39 AM