Skip to main content

Rapid-MLX

Rapid-MLX is an OpenAI-compatible inference server optimized for Apple Silicon (MLX). 2-4x faster than Ollama, with full tool calling, reasoning separation, and prompt caching.

PropertyDetails
DescriptionLocal LLM inference server for Apple Silicon. Docs
Provider Route on LiteLLMrapid_mlx/
Provider DocRapid-MLX ↗
Supported Endpoints/chat/completions

Quick Start​

Install and start Rapid-MLX​

brew tap raullenchai/rapid-mlx
brew install rapid-mlx
rapid-mlx serve qwen3.5-9b

Or install via pip:

pip install vllm-mlx
rapid-mlx serve qwen3.5-9b

Usage - litellm.completion (calling OpenAI compatible endpoint)​

import litellm

response = litellm.completion(
model="rapid_mlx/default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Environment Variables​

VariableDescriptionDefault
RAPID_MLX_API_KEYAPI key (optional, Rapid-MLX does not require auth by default)not-needed
RAPID_MLX_API_BASEServer URLhttp://localhost:8000/v1

Supported Models​

Any MLX model served by Rapid-MLX works. Use the model name as loaded by the server. Common choices:

  • rapid_mlx/default - Whatever model is currently loaded
  • rapid_mlx/qwen3.5-9b - Best small model for general use
  • rapid_mlx/qwen3.5-35b - Smart and fast
  • rapid_mlx/qwen3.5-122b - Frontier-level MoE model

Features​

  • Streaming - Full SSE streaming support
  • Tool calling - 17 parser formats (Qwen, Hermes, MiniMax, GLM, etc.)
  • Reasoning separation - Native support for thinking models (Qwen3, DeepSeek-R1)
  • Prompt caching - KV cache reuse and DeltaNet state snapshots for fast TTFT
  • Multi-Token Prediction - Speculative decoding for supported models
🚅
LiteLLM Enterprise
SSO/SAML, audit logs, spend tracking, multi-team management, and guardrails — built for production.
Learn more →