Productivity
mls avatar

mls

Unified local ML inference server for ASR, TTS, Translation, Image Generation, and Vision on Apple Silicon, powered by MLX.

Introduction

MLS (MLX Local Serving) provides a comprehensive, high-performance infrastructure for running multiple on-device machine learning models on macOS with Apple Silicon. Designed to keep all active models resident in the GPU memory, it eliminates cold-start latency and provides a unified HTTP interface for multimodal AI tasks. This system is ideal for developers, researchers, and power users who require reliable, private, and low-latency inference for local automation or creative workflows without relying on external cloud APIs.

  • Multi-modal capabilities including ASR (Qwen3), TTS (Qwen3-VoiceDesign), Neural Machine Translation (TranslateGemma), Image Generation (Z-Image-Turbo), and Vision (jina-vlm).

  • Unified API architecture utilizing HTTP/JSON for easy integration with tools like LangChain, OpenAI SDKs, and local automation wrappers like OpenClaw.

  • Real-time dashboard for monitoring GPU utilization, memory usage, inference queues, and live server logs.

  • File-based batch processing for long-form text translation and synthesis, with robust support for polling progress via API status endpoints.

  • Drop-in OpenAI-compatible vision completion endpoint for multimodal chat applications.

  • Requires macOS 14+ on Apple Silicon and Python 3.12+ with the uv package manager.

  • Operates by default on http://127.0.0.1:18321 for local access.

  • Inputs for ASR and translation tasks should be provided as absolute local file paths to ensure correct system access.

  • The system supports 70+ languages for translation and offers customizable voice instruction for TTS (VoiceDesign model) to control output characteristics such as tone and accent.

  • Model management is handled through the server control API, allowing individual service pause, resume, or restart cycles without disrupting the entire stack.

  • Performance optimization tips: Use high-quality 20-step configurations for image generation at the cost of latency, or stick to the 9-step default for real-time needs.

Repository Stats

Stars
11
Forks
1
Open Issues
0
Language
HTML
Default Branch
main
Sync Status
Idle
Last Synced
May 4, 2026, 01:26 AM
View on GitHub