Introduction

The speak skill provides a powerful, local-first text-to-speech (TTS) engine powered by the Kokoro TTS model. This tool allows users to convert text files, raw strings, or documents into high-quality audio files without relying on external cloud APIs or privacy-compromising services. It is designed for developers, content creators, and users who require efficient voice generation for accessibility, narration, or media production workflows. By running entirely locally, it ensures data sovereignty while maintaining low latency and high performance.

Multilingual Support: Synthesize speech in English (US/UK), Mandarin (cmn), Japanese (ja), French (fr-fr), and Italian (it) using a diverse library of pre-trained voices.
Advanced Audio Customization: Fine-tune output with parameters including custom speed adjustments and voice blending (mixing multiple voice profiles).
Flexible Format Support: Processes input from simple text strings and files to structured formats like EPUB or PDF, enabling automated audio-book creation or long-form content narration.
No External Dependency: Operates fully offline; requires only the kokoro-v1.0.onnx model and voices-v1.0.bin files to be present in the working directory.
Stream Playback: Offers a stream option to pipe audio directly to hardware for real-time feedback without the need to save intermediate files to disk.
Usage: Ensure model files are downloaded and placed in the project root. Invoke via the command line to convert text files or strings using the --voice parameter to select specific tones.
Constraints: Requires local computational resources; performance is dependent on your machine's CPU/GPU capabilities. Ensure uv tool is installed to manage the binary dependencies effectively.
Use Cases: Perfect for turning technical documentation into audio guides, building localized TTS agents, prototyping interactive voice interfaces, or creating personal reading assistants.

Startup Courses

Online Courses

Physical Courses

speak

Introduction

Repository Stats