Introducing Juno: An Open Voice Layer for Mac
Hey Juno!
Juno is a local-first, open-source voice layer for Mac — live voice input, screen context, rewrites, formatting, and simple actions.
Cassini Research is releasing Juno today as CR-01. It is a Mac-native voice layer that turns natural speech into polished writing, actions, and text directly inside the apps you are already using.
The keyboard remains the main interface for knowledge work because it is precise, not because it is natural. Intent usually forms faster than we can type. Long prompts, product notes, specs, bug reports, and emails often start as thoughts that would be easier to say than type. The problem is that voice on computers has historically felt like mere transcription, not an actual input layer.
We built Juno because the next input layer should feel live, private, editable, and usable all day.
What Changed
Voice input is no longer a niche accessibility feature; it is a serious productivity surface. Products like Wispr Flow, Superwhisper, Aqua, and MacWhisper moved the category forward and proved that users want natural speech translated into polished text.
They also clarified what we wanted to build. We wanted the full loop in one tool: live transcripts, strong final formatting, selected-text edits, and screen context, all powered by local models. We wanted a tool with no accounts, no usage meters, and open source code. If voice is going to become the primary way you write prompts and notes every day, an arbitrary cloud usage limit becomes a bottleneck.
For this category, open source is not decoration. It is the architecture.
What Juno Does
Juno sits where you already write. You trigger it, speak naturally, and watch the transcript appear live in the HUD. When you stop, Juno cleans up the final text, formats it, and inserts it into your active app. If direct insertion isn’t available, Juno keeps the result copy-ready.
It handles natural, messy speech seamlessly. An utterance can sound like this:
“Send the updated timeline by Friday, actually Friday morning.”
Juno catches the correction and writes the sentence you actually meant.
You can also select text and command it:
“Make this shorter and more direct.”
Juno rewrites the selected passage in place, without forcing you into a separate editor window. It also handles simple, create-only actions for native Mac apps:
“Note that the design review moved to Thursday. Remind me tomorrow at 9 to send the agenda. Set an alarm for 6:30.”
From one command, Juno will create an Apple Note, a Reminder, and an alarm.
How We Built It
Juno feels fast because it is built like a real-time system, not a basic speech-to-text wrapper. It consists of a native Mac shell and a local runtime.
- The Mac Shell: Owns the product surface (shortcuts, HUD, permissions, active-app detection, window state, insertion, and copy fallback).
- The Local Runtime: Owns the voice pipeline (audio, live preview, final transcription, formatting, actions, dictionary, memory, and screen context).
The core path is straightforward:
For speech recognition, Juno uses mlx-community/whisper-large-v3-turbo on Apple Silicon. For writing, formatting, and action planning, we use mlx-community/Qwen3-4B-Instruct-2507-4bit through MLX LM. A smaller Qwen3-0.6B-4bit model handles lighter correction work to minimize latency.
To keep the pipeline moving at the speed of thought, we cache the static LLM prompt prefix at the KV level. This means repeated writing and planning passes don't keep paying the same cold prefill cost. Under the hood, Juno is a stateful, latency-aware voice runtime.
The Engineering Challenges
Building a voice layer that feels instant and native requires solving several edge cases hidden in plain sight:
- Dual-Lane Live Transcription Whisper works over audio windows. If you commit text too early, the HUD flashes incorrect words. Juno runs speech through two lanes: a low-latency preview lane for the HUD, and a heavier final lane for accuracy, cleanup, writing, and actions. The HUD doesn't blindly stream whatever the model guesses; it uses strict agreement logic before committing words, keeping unstable tail text provisional.
- Silence & Hallucinations Speech models notoriously hallucinate on quiet audio. Juno uses Voice Activity Detection (VAD) and strict end-window defenses to ensure that silence and partial audio do not turn into fake words.
- Mid-Utterance Corrections People pause, restart, and change dates halfway through a sentence. Juno treats these as normal speech patterns, not errors.
- Native Insertion macOS apps handle insertion differently depending on focus, permissions, or secure input. Juno uses native insertion when possible and gracefully defaults to a copy-ready fallback when it cannot. The user never loses their text.
- Local Vocabulary & Context Generic models struggle with internal acronyms, teammate names, and specific product terms. Juno utilizes a local dictionary and memory so it gets better at your specific workflow without sending your data to the cloud. It also utilizes bounded screen context, allowing you to say “make this tighter” without needing to read the whole screen aloud.
The Local Bet
A daily voice layer requires a strict privacy posture. People dictate private names, financial numbers, unreleased product ideas, and whatever is currently on screen. You shouldn't have to trust a remote server with that data.
Juno runs locally on Apple Silicon. Audio, transcripts, history, dictionary, and memory never leave your machine.
Local models are now highly capable of handling this exact loop: transcribing accurately, preserving intent, fixing structure, and using local context. The focus is now simply on making that loop fast and native.
Juno is open source, local-first, and free forever.
Stop Typing, Start Speaking
Speak the messy version. Correct yourself mid-sentence. Select text and say the edit. We are excited to see how it fits into your daily Mac workflow.