Engineering
transcription avatar

transcription

Production-ready audio/video transcription using OpenAI Whisper. Features model selection, timing synchronization, speaker diarization, and batch processing for media workflows.

Introduction

This skill provides a robust framework for performing high-quality speech-to-text transcription using OpenAI Whisper. It is designed for developers and content creators who need to integrate automated transcription, subtitle generation, and speaker identification into their media pipelines. By supporting various installation methods, including the standard Python-based OpenAI Whisper, the high-performance C++ whisper.cpp implementation, and GPU-accelerated Insanely Fast Whisper, it allows users to balance between speed, hardware constraints, and accuracy requirements effectively.

The skill covers the end-to-end process of media preparation, such as using ffmpeg for optimal audio extraction (e.g., converting to 16kHz mono WAV), ensuring high-quality input for the models. It also provides advanced pattern matching for post-processing, including converting Whisper JSON output to frame-accurate timing for video editing suites like Final Cut Pro, and utilizing pyannote.audio for speaker diarization to identify distinct voices in multi-speaker audio recordings.

  • Multi-model support: Choose from tiny, base, small, medium, and large-v3 models based on VRAM capacity and accuracy needs.
  • Format flexibility: Generate industry-standard subtitle formats including SRT, VTT, and detailed JSON with word-level timestamps.
  • Audio engineering: Includes precise ffmpeg recipes for noise reduction, highpass/lowpass filtering, and channel normalization.
  • Workflow automation: Pre-configured bash scripts for batch processing entire directories of video files.
  • Performance optimization: Guidelines for GPU usage (CUDA), initial prompting for context-aware transcription, and chunking strategies for long-form content.
  • Input: Supports raw audio files (mp3, wav) and video containers (mp4, mov, avi) via automated extraction.
  • Output: Time-coded text files, JSON metadata, and diarized speaker logs.
  • Constraints: Requires local computational resources (VRAM recommended for large-v3), and specific environment setup (Python/C++ dependencies).

Repository Stats

Stars
255
Forks
31
Open Issues
7
Language
TypeScript
Default Branch
main
Sync Status
Idle
Last Synced
Apr 29, 2026, 08:08 AM
View on GitHub