mls
Unified local ML inference server for ASR, TTS, Translation, Image Generation, and Vision on Apple Silicon, powered by MLX.
Introduction
MLS (MLX Local Serving) provides a comprehensive, high-performance infrastructure for running multiple on-device machine learning models on macOS with Apple Silicon. Designed to keep all active models resident in the GPU memory, it eliminates cold-start latency and provides a unified HTTP interface for multimodal AI tasks. This system is ideal for developers, researchers, and power users who require reliable, private, and low-latency inference for local automation or creative workflows without relying on external cloud APIs.
-
Multi-modal capabilities including ASR (Qwen3), TTS (Qwen3-VoiceDesign), Neural Machine Translation (TranslateGemma), Image Generation (Z-Image-Turbo), and Vision (jina-vlm).
-
Unified API architecture utilizing HTTP/JSON for easy integration with tools like LangChain, OpenAI SDKs, and local automation wrappers like OpenClaw.
-
Real-time dashboard for monitoring GPU utilization, memory usage, inference queues, and live server logs.
-
File-based batch processing for long-form text translation and synthesis, with robust support for polling progress via API status endpoints.
-
Drop-in OpenAI-compatible vision completion endpoint for multimodal chat applications.
-
Requires macOS 14+ on Apple Silicon and Python 3.12+ with the uv package manager.
-
Operates by default on http://127.0.0.1:18321 for local access.
-
Inputs for ASR and translation tasks should be provided as absolute local file paths to ensure correct system access.
-
The system supports 70+ languages for translation and offers customizable voice instruction for TTS (VoiceDesign model) to control output characteristics such as tone and accent.
-
Model management is handled through the server control API, allowing individual service pause, resume, or restart cycles without disrupting the entire stack.
-
Performance optimization tips: Use high-quality 20-step configurations for image generation at the cost of latency, or stick to the 9-step default for real-time needs.
Repository Stats
- Stars
- 11
- Forks
- 1
- Open Issues
- 0
- Language
- HTML
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 4, 2026, 01:26 AM