ai-multimodal
Process and generate multimedia with Google Gemini. Analyze audio, images, videos, and PDFs with high-context windows. Supports transcription, visual QA, OCR, and AI-driven image creation.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
109 skills found
Process and generate multimedia with Google Gemini. Analyze audio, images, videos, and PDFs with high-context windows. Supports transcription, visual QA, OCR, and AI-driven image creation.
Robot perception system design, configuration, and optimization for cameras, LiDAR, and sensor fusion pipelines. Includes camera calibration, 3D reconstruction, and production deployment best practices.
Implement Google Gemini API vision capabilities for image/document analysis including captioning, object detection, segmentation, and multi-image comparison.
Extract text from images using the Tesseract OCR engine, supporting multiple languages, image preprocessing, and various formats.
Generate and edit images using the Gemini API via the nanaban CLI. Create illustrations, logos, and icons, or perform photo edits like background removal and style transfer.
Find, review, and remove duplicate or near-duplicate images in FiftyOne datasets using computer vision similarity embeddings.
Generate high-quality visual content, characters, and scenes using structured JSON prompts and automated Python execution for guided image synthesis.
High-performance document intelligence library for extracting text, tables, code, and metadata from 91+ file formats, with OCR and LLM-ready output.
macOS visual automation tool for precise window capture, video recording, UI mockup annotation, Excalidraw wireframing, and automated visual regression testing.
Capture snapshots, video clips, and monitor motion events from RTSP and ONVIF compatible security cameras.
Analyze and identify codebase patterns (naming, architecture, testing) to maintain consistency and enforce standards during development.
Google Gemini Image Generation API interface for text-to-image, editing, style templates, and automated retry workflows.