gemini-audio
Implement Google Gemini API audio capabilities: process, transcribe, and summarize audio files, analyze environmental sounds, and generate natural speech with controllable TTS.
Introduction
This skill provides a robust interface for the Google Gemini API, enabling developers and analysts to integrate advanced audio processing directly into their workflows. It is designed for applications requiring high-fidelity transcription, intelligent summarization, and multi-modal audio understanding. By leveraging models like gemini-2.5-flash and gemini-2.5-pro, users can handle diverse inputs ranging from professional podcasts and meeting recordings to raw environmental audio. The skill simplifies the complexities of the File API, including handling large-scale files up to 9.5 hours, managing retention, and optimizing token consumption for cost-effective analysis.
-
Transcribe audio files to text with high accuracy, including support for timestamp generation in MM:SS format and multi-speaker identification.
-
Summarize complex audio content, extract key action items, and perform semantic analysis on speech, music, or environmental sounds like birdsong or sirens.
-
Generate high-quality, natural-sounding speech from text input with advanced control over style, pace, tone, and accent using the Gemini TTS native audio models.
-
Support for multiple industry-standard audio formats including WAV, MP3, AAC, FLAC, OGG, and AIFF, with automated downsampling for processing efficiency.
-
Integrated helper scripts for common developer tasks like batch transcription, specific segment analysis, and audio-to-text workflows.
-
Flexible input methods including direct file uploads for large datasets exceeding 20MB and inline byte transmission for smaller audio snippets.
-
Use the File API for larger files (up to 2GB) or when repeated analysis is required; note that files uploaded via this method are subject to a 48-hour auto-delete policy and project quota limits.
-
For cost optimization, prioritize the gemini-2.5-flash model for general transcription and summarization, reserving pro tiers for complex reasoning tasks.
-
Prompt engineering is essential for segment-specific results: provide clear time ranges in MM:SS format to isolate analysis to specific moments in a recording.
-
Ensure the environment is configured with the GEMINI_API_KEY through the .env file in the skill or project directory to allow the client to auto-detect credentials securely.
-
Be aware of the 20MB request limit for inline data; leverage multipart requests or the File API for any production-grade processing pipelines.
Repository Stats
- Stars
- 1
- Forks
- 0
- Open Issues
- 0
- Language
- Handlebars
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 06:23 PM