gemini-audio

Introduction

This skill provides a robust interface for the Google Gemini API, enabling developers and analysts to integrate advanced audio processing directly into their workflows. It is designed for applications requiring high-fidelity transcription, intelligent summarization, and multi-modal audio understanding. By leveraging models like gemini-2.5-flash and gemini-2.5-pro, users can handle diverse inputs ranging from professional podcasts and meeting recordings to raw environmental audio. The skill simplifies the complexities of the File API, including handling large-scale files up to 9.5 hours, managing retention, and optimizing token consumption for cost-effective analysis.

Transcribe audio files to text with high accuracy, including support for timestamp generation in MM:SS format and multi-speaker identification.
Summarize complex audio content, extract key action items, and perform semantic analysis on speech, music, or environmental sounds like birdsong or sirens.
Generate high-quality, natural-sounding speech from text input with advanced control over style, pace, tone, and accent using the Gemini TTS native audio models.
Support for multiple industry-standard audio formats including WAV, MP3, AAC, FLAC, OGG, and AIFF, with automated downsampling for processing efficiency.
Integrated helper scripts for common developer tasks like batch transcription, specific segment analysis, and audio-to-text workflows.
Flexible input methods including direct file uploads for large datasets exceeding 20MB and inline byte transmission for smaller audio snippets.
Use the File API for larger files (up to 2GB) or when repeated analysis is required; note that files uploaded via this method are subject to a 48-hour auto-delete policy and project quota limits.
For cost optimization, prioritize the gemini-2.5-flash model for general transcription and summarization, reserving pro tiers for complex reasoning tasks.
Prompt engineering is essential for segment-specific results: provide clear time ranges in MM:SS format to isolate analysis to specific moments in a recording.
Ensure the environment is configured with the GEMINI_API_KEY through the .env file in the skill or project directory to allow the client to auto-detect credentials securely.
Be aware of the 20MB request limit for inline data; leverage multipart requests or the File API for any production-grade processing pipelines.

Startup Courses

Online Courses

Physical Courses

Introduction

Repository Stats