
Why Does Whisper Invent Random Sentences in Silent Audio?
If you've used any Whisper-based voice-to-text app — MetaWhisp, Wispr Flow, SuperWhisper, MacWhisper, raw whisper.cpp — you've probably seen this: you record audio with a few seconds of silence at the end, and the transcript contains a phrase you never spoke. Common offenders:- "Thank you for watching."
- "Please subscribe to my channel."
- "Subtitles by the Amara.org community."
- "Don't forget to like and subscribe."
- In other languages: "Спасибо за просмотр", "Merci d'avoir regardé"
How Whisper's Voice Activity Detection Actually Works
Whisper doesn't have a real voice activity detector. It has a no-speech probability — a per-chunk confidence score that the chunk contains no speech. The decoder uses this score, along with several other signals, to decide whether to emit text or skip the chunk. The default no-speech threshold is 0.6, per OpenAI's official transcribe.py implementation. The problem: in silent audio, the model often produces a no-speech probability of 0.3 to 0.55 — below the 0.6 threshold, so the chunk gets transcribed. The model then generates whatever it thinks "silence" sounds like, which is the YouTube outro phrases from training. This is fundamentally different from how dedicated VAD libraries work. A real VAD — like Silero VAD or WebRTC VAD — runs a separate audio-classifier model that outputs binary speech/non-speech labels with 95%+ accuracy. Whisper's internal no-speech score is a side-product of language modeling, not a purpose-built speech detector. It fails on silence about 15-20% of the time, by my measurements on test audio.The bug everyone misses: Whisper's no-speech threshold of 0.6 was tuned for transcribing podcasts and conversations — content with mostly continuous speech and only short silent gaps. For dictation workflows where users record 5-10 second clips and pause between them, the threshold is too permissive. Silent gaps get mistakenly classified as "containing speech" and the model hallucinates content.
The Three Failure Modes That Trigger Hallucinations
Through debugging the production pipeline, I've identified three distinct patterns that produce Whisper hallucinations:- Pre-recording silence — User presses the hotkey, takes half a second to start speaking. Those 500ms of silence get transcribed as "Thank you for watching" before the real speech begins.
- Mid-recording pauses — User thinks mid-sentence, pauses for 2-3 seconds. The pause gets transcribed as a YouTube outro, then the real speech resumes after.
- Post-recording trailing silence — User finishes speaking but the recording continues for another second. Trailing silence triggers "Please subscribe to my channel" appended at the end of the transcript.

Fix 1: Lower the No-Speech Threshold to 0.4
The simplest fix is to lower Whisper's no-speech threshold from the default 0.6 down to 0.4. This makes the model more aggressive about labeling silent chunks as "no speech" and skipping them. The trade-off: marginal speech (someone whispering, or speaking very quietly) may also get skipped. In OpenAI's reference implementation, the parameter is `no_speech_threshold`:import whisper
model = whisper.load_model("large-v3-turbo")
result = model.transcribe(
"audio.wav",
no_speech_threshold=0.4, # default is 0.6
logprob_threshold=-1.0,
condition_on_previous_text=False
)
In whisper.cpp, the flag is `-tp 0.4`. In WhisperKit, set `decodingOptions.noSpeechThreshold = 0.4`. Same parameter, three different APIs.
Fix 2: Pre-Filter Silence with a Dedicated VAD
The cleaner architectural fix is to never give Whisper silent audio in the first place. Run a dedicated voice activity detector (VAD) on the raw audio before Whisper sees it. The VAD identifies speech regions, strips out silence, and passes only the speech-containing segments to Whisper. Best-in-class VAD options:- Silero VAD — Open-source, MIT-licensed, runs on CPU at 0.5ms per 100ms of audio. 95.5% accuracy on the AVA-Speech benchmark. The de-facto standard for production speech pipelines in 2026.
- WebRTC VAD — Older but well-tested, built into Chrome and Firefox for WebRTC calls. 4 aggressiveness modes (0-3). Free and trivially small.
- Apple Speech framework's SFSpeechRecognizer — Built into macOS, has internal VAD but only exposes the result after transcription.

Fix 3: Use Whisper Prompts to Anchor the Decoder Context
Whisper has a `prompt` parameter that lets you pre-condition the model with context. The decoder treats the prompt as if it were the start of an existing transcript, which biases generation toward continuing that style. If you set a prompt that doesn't contain YouTube-style boilerplate, the model is less likely to generate it.result = model.transcribe(
"audio.wav",
initial_prompt="The following is a dictated note about software engineering."
)
This is documented in OpenAI's Speech-to-Text API documentation and works in both the API and local Whisper. The effectiveness varies by prompt content. Prompts that explicitly mention the user's domain (medical dictation, legal notes, code comments) reduce hallucination by approximately 30% in my measurements — not as effective as VAD pre-filtering but useful when you can't add a VAD to your pipeline.
Pro tip: Combine prompts with no-speech-threshold lowering for the best results without VAD. Set no_speech_threshold=0.45 and add a 1-sentence prompt like "Dictation note for [your domain]". This catches roughly 90% of hallucinations on common audio without requiring an external VAD model.
How MetaWhisp Handles Whisper Hallucinations
I built MetaWhisp with all three fixes layered together by default. The architecture:- Silero VAD pre-filter — Audio from the microphone passes through Silero VAD before reaching Whisper. Silent chunks are dropped at this stage. This eliminates the pre-recording, mid-pause, and post-recording silence hallucinations entirely.
- No-speech threshold lowered to 0.45 — As a second-line defense, if any silent chunk slips through the VAD, the Whisper decoder is configured to be more conservative.
- Domain-aware prompts — MetaWhisp's processing modes inject context-specific prompts (Raw mode gets a neutral prompt, Correct mode gets "professional dictation", Translate mode gets a translation-specific prompt).
Why Cloud-Based Voice-to-Text Apps Hallucinate More
Cloud-based apps like Wispr Flow, Otter.ai consumer tier, and the raw OpenAI Whisper API often hallucinate more than well-designed on-device apps. Three reasons:- No pre-filtering at the edge — Cloud apps upload raw microphone audio to their servers. Adding a Silero VAD on the user's device would add 3-5 ms latency but most cloud apps skip it to minimize on-device CPU usage. The server-side then processes silent chunks through Whisper, producing hallucinations.
- Default OpenAI Whisper API parameters — The OpenAI hosted Whisper API uses default `no_speech_threshold=0.6`. Apps that call the API without overriding it inherit the high hallucination rate. The OpenAI API documentation mentions the parameter but most developers don't tune it.
- Network latency masks the bug — When cloud transcription takes 500-1500 ms, users don't notice that some of that latency was spent generating hallucinated YouTube outros. The bug is harder to detect when transcription feels "naturally" slow.
How to Test Whisper Hallucinations on Your Audio
You can verify your voice-to-text tool's hallucination rate with a 30-second test:- Open your voice-to-text app (MetaWhisp, Wispr Flow, SuperWhisper, raw whisper.cpp — any Whisper-based tool)
- Record 30 seconds of complete silence in a quiet room (no speech, no music, no clearly identifiable sound)
- Stop the recording and let the tool produce a transcript
- Check the output. A well-built tool produces an empty transcript. A flawed tool produces phrases like "Thank you for watching" or worse.
For a more thorough benchmark, record 30 seconds where you speak for 20 seconds in the middle with 5 seconds of silence on each end. A well-built tool produces a transcript of only the spoken middle. A flawed tool prepends or appends YouTube boilerplate to the real content.
What About Whisper Hallucinations in Music or Noisy Environments?
Silence isn't the only trigger. Whisper also hallucinates when fed audio that's not speech but has speech-like patterns — music with vocals, multiple overlapping voices, persistent background noise. The same VAD and threshold fixes help, but the failure modes differ:- Music with vocals — Whisper sometimes transcribes song lyrics as if they were dictated speech, mangling the timing. VAD pre-filtering usually catches this because music has different acoustic signatures than speech.
- Multiple overlapping speakers — Whisper picks one speaker and ignores the rest. The "ignored" speaker's audio sometimes generates hallucinated content because Whisper doesn't know what to do with it. Speaker diarization tools like pyannote.audio are the standard fix.
- Persistent background noise — TV in another room, café ambience, refrigerator hum. Whisper's confidence drops in these conditions and hallucination rate increases to 5-8%. Lowering the no-speech threshold to 0.35 helps but also drops marginal speech.

Frequently Asked Questions About Whisper Hallucinations
Why does Whisper say "Thank you for watching" when there's no speech?
This is the most common Whisper hallucination. Whisper was trained on 680,000 hours of audio scraped from the public internet, much of it YouTube videos. Silent intros and outros in those videos were paired with subtitle annotations like "Thanks for watching, please subscribe". The model learned to associate audio silence with these scripted phrases. When Whisper sees silent audio in inference, it generates the phrase it learned silence "means". The fix is to pre-filter silence with a VAD or lower the no-speech threshold below 0.6 default.
How do I stop Whisper from generating random text?
Three layered fixes that I shipped in MetaWhisp: (1) Run a dedicated VAD like Silero VAD before audio reaches Whisper — drops silent chunks entirely. (2) Lower the no_speech_threshold from default 0.6 to 0.4 or 0.45. (3) Use the initial_prompt parameter to bias the decoder away from YouTube boilerplate. Combining all three drops hallucination rate from 15-18% to under 1% in production.
Does Whisper large-v3-turbo hallucinate less than older Whisper versions?
Marginally. Whisper large-v3-turbo (October 2024) has a similar hallucination rate to large-v3 because they share the same training data composition. The improvements in turbo are speed (8x faster) and slight accuracy gains, not hallucination handling. The same VAD and threshold fixes apply to all Whisper variants from tiny through large-v3-turbo. Distil-whisper, a community variant, has somewhat reduced hallucinations due to different training but isn't multilingual.
Why does Wispr Flow hallucinate more than MetaWhisp?
Wispr Flow runs Whisper in their cloud without dedicated VAD pre-filtering. Audio from your Mac uploads directly to their servers, then Whisper processes the entire stream including silent sections. Default OpenAI parameters mean no_speech_threshold stays at 0.6, which is too permissive for dictation. MetaWhisp runs Silero VAD on-device before Whisper sees the audio, so silent chunks never reach the model. Architecture matters more than which Whisper variant is running.
Can I use Whisper without hallucinations for legal or medical transcription?
Yes, but only with proper engineering. Out-of-the-box Whisper hallucinates 12-18% on silent audio — unacceptable for legal depositions or medical dictation where false phrases create liability. The combination of Silero VAD pre-filtering plus no-speech threshold at 0.4 plus domain-aware prompts drops the rate below 1%, comparable to professional transcription services. For HIPAA-bound healthcare workflows, on-device implementations like MetaWhisp also avoid the BAA requirement entirely — see our HIPAA speech-to-text guide for compliance details.
What's the difference between Whisper hallucinations and Whisper repetition loops?
Different bugs. Hallucinations are invented phrases from training data, typically appearing in silent or non-speech audio. Repetition loops are when Whisper gets stuck repeating the same word or phrase 20-100 times — caused by the autoregressive decoder picking high-confidence repeated tokens. Fix for hallucinations: VAD pre-filtering and threshold tuning. Fix for repetition loops: enabling beam search with beam_size=5 and lowering temperature to 0. Both can occur in the same recording.
How do I report a Whisper hallucination bug?
For OpenAI's reference Whisper, file an issue at the GitHub repo with the audio sample (or a description if you can't share audio), the model variant used, and the parameters. For whisper.cpp, similar process at the whisper.cpp repo. For commercial apps using Whisper as a backend (MetaWhisp, Wispr Flow, SuperWhisper), report to the app's support channel with reproduction steps. Most app vendors take hallucination bugs seriously because they affect transcript quality directly.
Does MetaWhisp guarantee zero Whisper hallucinations?
No app can guarantee absolute zero — Whisper is a probabilistic model and rare edge cases will always exist. MetaWhisp's implementation drops hallucination rate from industry-typical 12-18% to under 0.3% via the three-layer defense: Silero VAD pre-filtering, lowered no_speech_threshold, and domain-aware prompts. For mission-critical legal or medical workflows, always review transcripts for any unexpected content before relying on them — that's true of any Whisper-based tool.
About the Author
Andrew Dyuzhov is the solo founder and CEO of MetaWhisp, a free on-device voice-to-text app for macOS that runs Whisper large-v3-turbo on Apple Neural Engine. He ships Whisper inference in production via MetaWhisp, debugged the hallucination problem while building the app, and architected MetaWhisp's three-layer defense to drive industry-typical 12-18% hallucination rates below 1% in his own controlled testing. The fixes in this article come from reproduction tests on his M3 MacBook Air across Whisper tiny, base, small, medium, large-v3, and large-v3-turbo variants. Connect on X or GitHub.
Related Reading
- What Is Whisper large-v3-turbo? Local AI for Mac — architecture deep-dive on the model behind MetaWhisp
- Whisper Model Sizes: Tiny to Turbo Compared — disk, RAM, accuracy across all Whisper variants
- Private Voice-to-Text on Mac: Zero Cloud Upload — architectural deep dive on on-device transcription
- 7 Best Voice-to-Text Apps for Mac (2026) — how MetaWhisp compares on hallucination handling
- HIPAA-Compliant Speech-to-Text on Mac (2026) — when hallucinations matter for legal liability