Engineering
transcription avatar

transcription

Production-ready audio/video transcription using OpenAI Whisper. Features model selection, timing sync (SRT/VTT/JSON), speaker diarization via pyannote, and optimized batch processing for media workflows.

Introduction

This skill provides a professional-grade framework for converting media assets into text using OpenAI Whisper. It is designed for developers, content creators, and media engineers who require high-accuracy, automated transcription workflows. By supporting various installation methods—including standard Python packages, high-performance C++ via whisper.cpp, and GPU-accelerated execution with Insanely Fast Whisper—it adapts to diverse infrastructure requirements. The skill enables users to handle complex transcription tasks such as multi-speaker diarization using pyannote.audio, frame-accurate timing synchronization for editing software like Final Cut Pro, and bulk processing for large video libraries.

  • Multi-engine support: Choose between OpenAI Whisper (Python), whisper.cpp (C++), and Insanely Fast Whisper (GPU) for varied performance needs.

  • Advanced export formats: Generate standard SRT and WebVTT for subtitles or structured JSON with word-level timing for programmatic use.

  • Speaker Diarization: Integrated support for pyannote.audio to identify and label individual speakers in multi-voice content.

  • Workflow optimization: Pre-processing tools include FFmpeg-based audio extraction, noise reduction using highpass and lowpass filters, and FFprobe analysis for frame-rate consistency.

  • Batch processing: Automated scripts provided to transcribe entire directories of media files, complete with temp file cleanup and output management.

  • Production-ready patterns: Includes guidance on model selection—from 'tiny' for quick previews to 'large-v3' for final high-accuracy production delivery.

  • Recommended input: For optimal results, extract audio using FFmpeg as mono-channel 16kHz WAV (pcm_s16le).

  • Contextual assistance: Enhance accuracy by providing initial prompts that include domain-specific vocabulary or context descriptions.

  • Scaling: Use environment-specific optimizations like CUDA device flags for GPU hardware to significantly reduce processing time on long-form content.

  • Constraints: Large models like 'large-v3' require significant VRAM (approx. 10GB); ensure hardware meets minimum requirements for the chosen model size.

  • File compatibility: Supports all standard video and audio containers (MP4, MOV, AVI, MP3, WAV) through FFmpeg integration.

Repository Stats

Stars
255
Forks
31
Open Issues
7
Language
TypeScript
Default Branch
main
Sync Status
Idle
Last Synced
Apr 29, 2026, 01:39 AM
View on GitHub