Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
TL;DR Highlight
We open-sourced a real-time multimodal AI speech and video conversation system that runs completely locally on Apple Silicon M3 Pro without the internet. It is attracting attention for its ability to handle speech recognition, video understanding, and TTS simultaneously without cloud costs.
Who Should Read
Developers who want to build their own on-device AI voice assistants, or backend/ML developers who want to build a multimodal pipeline locally without cloud AI API costs.
Core Mechanics
- This project (Parlor) is a real-time multimodal conversation system that takes microphone and camera input and responds with voice, with all processing done only on the user's device.
- It uses Google's recently released Gemma 4 E2B model for language understanding and video recognition, and utilizes the LiteRT-LM runtime for GPU acceleration.
- Kokoro is used for TTS (text-to-speech). It operates with the MLX backend on Mac and the ONNX backend on Linux.
- The structure streams microphone and camera data from the browser to a FastAPI server via WebSocket, and the server returns the processed audio to the browser via WebSocket.
- The browser runs Silero VAD (Voice Activity Detection model) to enable hands-free conversation without push-to-talk, and also supports barge-in functionality to interrupt the AI while it's speaking.
- TTS is streamed sentence by sentence, so audio playback starts before the entire response is generated, reducing perceived latency.
- The motivation for development was to eliminate server costs and make an English learning service sustainable. Six months ago, RTX 5090 was required for real-time processing, but now it's possible with M3 Pro.
- The developer particularly emphasized that Gemma 4 E2B supports multilingualism, allowing users to freely mix their native language and the language they are learning in conversation, which is especially useful for language learning.
Evidence
- "During offline environment testing, a bug was discovered where the page would stop at 'loading...' when localhost was first opened with the internet disconnected. One user reported that it works normally if the page is loaded once while connected to the internet and then the connection is disconnected. I was impressed with the fast performance, including video input, on an M4 Pro 48GB.\n\nA developer who is creating a similar project shared that Gemma 4 E2B is still too heavy despite being E2B, and they are using the Qwen 0.8B model instead. This demonstrates the trade-off between model size and real-time responsiveness is a real barrier.\n\nMultiple comments agreed that the latency of Kokoro TTS is very low, and one developer commented, 'Apple should have used this in Siri,' criticizing Apple for falling behind in technology.\n\nA comment shared information that only the text portion of Gemma 4 E2B can be fine-tuned. They shared their experience of fine-tuning it into an 'AI that talks like a pirate' along with a related video link, and also noted that the TTS portion cannot be fine-tuned.\n\nSome users reported that the voice recognition speed of Gemma E2B does not reach real-time on hardware such as M1 Max (64GB), RTX 5060 Ti (16GB), and Snapdragon 8 Gen 2, and asked for solutions. This suggests that performance may not be guaranteed in environments other than M3 Pro."
How to Apply
- If you need a hands-free workshop assistant or a voice AI while driving long distances, you can launch Parlor as a local server and open a browser to use a voice assistant for timer, calculation, memo search, etc. without the internet and without push-to-talk.
- If you are operating an English learning service or a multilingual conversation app and cloud API costs are burdensome, you can eliminate server costs by referring to the structure of Parlor (FastAPI + WebSocket + Gemma 4 E2B + Kokoro) and converting it to an on-device pipeline.
- If you want to change the response style of the Gemma 4 E2B model to a specific domain, you can apply text fine-tuning to the E2B model to learn the desired tone or response pattern. Keep in mind that fine-tuning does not apply to the TTS portion, so it only applies to the text generation stage.
- If you feel that Gemma 4 E2B is too heavy for low-spec environments (M1 or lower, 16GB or lower GPU, etc.), as mentioned in the comments, it is a good idea to first try replacing it with a smaller model like Qwen 0.8B and measure the real-time latency to confirm the trade-off.
Code Example
snippet
# Architecture flow (excerpt from README)
Browser (mic + camera)
│
│ WebSocket (audio PCM + JPEG frames)
▼
FastAPI server
├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision
└── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back
│
│ WebSocket (streamed audio chunks)
▼
Browser (playback + transcript)
# Installation and execution (based on README)
git clone https://github.com/fikrikarim/parlor
cd parlor
cp .env.example .env
# Modify necessary settings in .env
pip install -r requirements.txt
uvicorn src.main:app --reloadTerminology
LiteRT-LMAn on-device LLM inference runtime created by Google. It is an optimized execution engine for running models quickly on mobile and edge devices.
KokoroAn open-source TTS (text-to-speech) model. It has very low latency and is often used in real-time conversation systems.
VAD (Voice Activity Detection)A technology that automatically detects when a person is speaking in microphone input. It enables hands-free conversation without a push-to-talk button.
barge-inA feature that allows the user to interrupt the AI while it is speaking. It is necessary for a natural conversation flow.
MLXApple's machine learning framework optimized for Apple Silicon. It efficiently performs GPU inference by utilizing the unified memory architecture of M-series chips.
E2B (Efficient 2 Billion)A lightweight version of the Gemma 4 series with approximately 2 billion parameters. It is designed to enable real-time inference on consumer hardware.