DVT Log: Building a Zero-Cloud Privacy Vault with Local Computing
During the Design Validation Test (DVT) phase, our goal was proving that C1 could deliver accurate, fast transcripts without sending a single byte of audio to the cloud. Most AI recorders leverage Wi-Fi to offload processing to AWS. We took the opposite path: we physically desoldered the Wi-Fi/Bluetooth transceiver module from our DVT test boards. The C1 functions as a pure USB audio device, streaming raw 16-bit PCM data to your computer where all ASR (transcription) runs locally.
Tuning local VAD (Voice Activity Detection)
Continuous recording on a developer's desk is a data-dump nightmare. It captures keyboard clicking, breathing, and hours of silence. To prevent this noise from bloating our Markdown files, we implemented local VAD filtering in our desktop client. We hit an immediate problem: the standard WebRTC VAD was too aggressive. When developers spoke softly at the whiteboard or mumbled code terms like "git push," the VAD cut off the ends of sentences.
We resolved this by adjusting two main parameters in our VAD daemon:
- Frequency Bandpass: We restricted the VAD frequency gate to 300 Hz - 3400 Hz (the human vocal range), effectively ignoring the high-frequency click of mechanical switches and low-frequency PC fan hum.
- Tail Buffer Extension: We increased the silence hangover buffer from 500 ms to 2.0 seconds. The buffer waits for a full 2 seconds of silence before closing the file segment, ensuring that natural pauses during architectural problem-solving are not sliced in half.
Benchmarking Local ASR Engines
Running transcription locally must be fast and power-efficient. We compiled local ASR models using ONNX Runtime with Metal acceleration on macOS and DirectML on Windows. Here is the actual benchmark data from our DVT testing on a MacBook Pro M2 (16GB unified memory):
| Model Variant | RAM Footprint | 1-Hour Audio Process Time | WER (Word Error Rate) on Code Namespaces |
|---|---|---|---|
| Lightweight Model (39M) | 210 MB | 38 seconds | 8.4% (errors on camelCase and class names) |
| Standard Model (74M) | 420 MB | 84 seconds | 3.2% (handles code syntax accurately) |
| Heavyweight Model (244M) | 1.2 GB | 4.2 minutes | 2.9% (high battery drain, too slow) |
The Lightweight model (39M) was fast but failed on code vocabulary, transcribing `useEffect` as "you select" or `mcp` as "NCP". The Heavyweight model (244M) was highly accurate but consumed 1.2 GB of RAM and drained the laptop battery. We chose Standard Model (74M) as our default engine. At 420 MB RAM, it transcribes a one-hour meeting in 84 seconds, delivering clean, code-aware transcripts with minimal system overhead.