projects·project·shipped

Wispr Flow

Thesis

Local-first voice-to-text for macOS. Capture speech, transcribe with Whisper, optionally route through a local LLM for rewriting — all on-device. Cloud voice transcription loses on two dimensions Zaid refuses to compromise on: latency (API round-trip kills flow state) and privacy (sensitive utterances leaving the machine is a non-starter).

Status

shipped. Repo pinned, 3 stars, single-push build from 2026-03-24. Portfolio lists as Live.

Stack

  • Transcription: Whisper via Apple MLX (hardware-accelerated inference on Apple Silicon).
  • LLM rewrite: Dolphin3 via Ollama (local).
  • Audio capture: system-level microphone with voice activity detection (VAD) for segmentation.
  • Threading: dedicated threads for capture / transcription / rewriting; Python asyncio rejected for real-time audio work.
  • Config: Whisper model tier (tiny.enlarge-v3), VAD energy threshold, push-to-talk key.

Key decisions

  • MLX over PyTorch: 3-4x faster inference on Apple Silicon because MLX avoids the CPU→GPU memory copy that breaks PyTorch on unified-memory Macs. Tradeoff is smaller ecosystem, acceptable for a single inference task.
  • Ollama over custom model serving: clean API, model management free, minimal abstraction tax.
  • Multi-threaded over async: audio capture needs real-time guarantees Python's async loop doesn't provide. Dedicated threads with explicit sync primitives give predictable latency.
  • delta, ... prefix as magic word: raw Whisper output is pasted instantly by default. Prepending delta to an utterance routes it through the LLM for cleanup. Keeps the common case fast and the LLM available when you want it — no mode toggle, no settings screen.
  • INT4 quantization for the rewrite LLM: required to hit the 500ms budget on consumer hardware.
  • Streaming transcription: process audio chunks as they arrive; don't wait for silence.

Learnings

  • Latency budget matters more than absolute accuracy for dictation flow. Sub-300ms end-to-end feels instant; 800ms feels broken.
  • Magic-word UX (delta, ...) beats a mode selector for frictionless dual-path tools.
  • For Apple Silicon inference, MLX is the right default for narrow workloads even though PyTorch has the broader ecosystem.

Outcomes

  • Sub-300ms end-to-end latency on M1 MacBook (portfolio).
  • Fully offline, fully private.
  • Used daily by Zaid (memory confirms: wispr-flow is also the backend for his dictation tooling in /Users/zaid/dictation).

Open questions

  • How often does the delta rewrite actually get used vs raw paste? Usage data would validate the dual-path UX choice.
  • What's the fallback plan if ollama isn't running? Currently only the delta path breaks; raw paste works. But a clearer error would help.

Links