Why Whisper Hallucinates When You're Silent

Q: Why does Whisper say 'Thank you for watching' when there's no speech?

This is the most common Whisper hallucination. Whisper was trained on 680,000 hours of audio from the public internet, much of it YouTube videos. Silent intros and outros were paired with subtitle annotations like 'Thanks for watching, please subscribe'. The model learned to associate silence with these phrases. Fix: pre-filter silence with a VAD or lower no_speech_threshold below default 0.6.

Q: How do I stop Whisper from generating random text?

Three layered fixes: (1) Run a dedicated VAD like WhisperKit's built-in voice activity detection (VAD) before audio reaches Whisper. (2) Lower no_speech_threshold from default 0.6 to 0.4 or 0.45. (3) Use the initial_prompt parameter to bias the decoder. Combining all three largely eliminates silence hallucinations in production.

Q: Does Whisper large-v3-turbo hallucinate less than older versions?

Marginally. Whisper large-v3-turbo (October 2024) has similar hallucination rate to large-v3 because they share training data composition. The improvements in turbo are speed (8x faster) and slight accuracy gains, not hallucination handling. The same VAD and threshold fixes apply to all Whisper variants.

Q: Why does Wispr Flow hallucinate more than MetaWhisp?

Wispr Flow runs Whisper in their cloud without dedicated VAD pre-filtering. Audio uploads directly to their servers, then Whisper processes the entire stream including silent sections. Default OpenAI parameters mean no_speech_threshold stays at 0.6, too permissive for dictation. MetaWhisp runs WhisperKit's built-in voice activity detection (VAD) on-device before Whisper sees the audio.

Q: Can I use Whisper without hallucinations for legal or medical transcription?

Yes, but only with proper engineering. Out-of-the-box Whisper hallucinates on silent audio — unacceptable for legal or medical use. The combination of WhisperKit's built-in voice activity detection (VAD) pre-filtering plus no-speech threshold at 0.4 plus domain-aware prompts largely eliminates these silence hallucinations, getting much closer to professional transcription quality.

Q: What's the difference between Whisper hallucinations and Whisper repetition loops?

Different bugs. Hallucinations are invented phrases from training data, typically in silent audio. Repetition loops are when Whisper gets stuck repeating the same word 20-100 times — caused by the autoregressive decoder picking high-confidence repeated tokens. Fix for hallucinations: VAD pre-filtering. Fix for repetition loops: enabling beam search with beam_size=5 and temperature=0.

Q: How do I report a Whisper hallucination bug?

For OpenAI's reference Whisper, file an issue at the GitHub repo with audio sample, model variant, and parameters. For whisper.cpp, similar process at whisper.cpp repo. For commercial apps using Whisper backend (MetaWhisp, Wispr Flow, SuperWhisper), report to the app's support channel with reproduction steps.

Q: Does MetaWhisp guarantee zero Whisper hallucinations?

No app can guarantee absolute zero — Whisper is probabilistic. MetaWhisp's implementation substantially reduces silence hallucinations via three-layer defense: WhisperKit's built-in voice activity detection (VAD) pre-filtering, lowered no_speech_threshold, and domain-aware prompts. For mission-critical workflows, always review transcripts before relying on them.

🎙️👻

Whisper Hallucinates. Here's Why.

When it happens: on silence and non-speech audio

Root cause: training data bias

Fixes that work: 3 (one is permanent)

Where to use them: any Whisper app

TL;DR: Whisper invents entire sentences out of silent audio — phrases like "Thank you for watching", "Please subscribe", or worse. This is a known training-data artifact: OpenAI trained Whisper on 680,000 hours of internet audio, much of it YouTube videos where silent sections often contained scripted outros. The model learned that silence equals end-of-video boilerplate. The fix is a combination of voice-activity-detection (VAD) pre-filtering, lower no-speech thresholds, and prompt anchoring. I'm Andrew Dyuzhov — I built MetaWhisp running Whisper large-v3-turbo on Apple Neural Engine and hit this bug repeatedly while debugging. Here's exactly what works.

Whisper hallucination diagram showing silent audio producing invented 'thank you for watching' text versus real speech producing accurate transcript on Mac

Why Does Whisper Invent Random Sentences in Silent Audio?

Used any Whisper-based voice-to-text app? MetaWhisp, Wispr Flow, SuperWhisper, MacWhisper, raw whisper.cpp — doesn't matter which. You've probably seen this. You record, leave a few seconds of silence at the end, and the transcript hands you back a phrase you never said. The usual suspects:

"Thank you for watching."
"Please subscribe to my channel."
"Subtitles by the Amara.org community."
"Don't forget to like and subscribe."
In other languages: "Спасибо за просмотр", "Merci d'avoir regardé"

Every one is a hallucination. Whisper made it up. There was no audio that even resembled speech — the model just decided silence sounded like a YouTube outro. I'm Andrew Dyuzhov, solo founder of MetaWhisp. I ship Whisper inference on Apple Neural Engine, and I've hit this hallucination bug more times than I'd like while debugging the production audio pipeline. So here's the root cause, three fixes that actually work, and how to check whether your own audio is affected.

It has a name: "non-speech hallucination." It's a known issue, documented in OpenAI's official Whisper repository discussions, the whisper.cpp issue tracker, and dozens of academic papers picking apart the failure mode. The cause is the training data. Whisper learned from 680,000 hours of audio scraped off the public internet, per the original Whisper paper from Radford et al. A big chunk of that was YouTube. And YouTube videos have silent intros and outros that sit right next to subtitle annotations like "Thanks for watching" or "Subtitles by [community]". So the model learned to tie silence to those scripted phrases. Feed it silence at inference time, and it generates the phrase it thinks silence "means" — even when the silence means nothing. This isn't a bug in the model weights. It's baked into the training data. And you can reproduce it 100% of the time by feeding silence to any Whisper variant.

How Whisper's Voice Activity Detection Actually Works

Whisper doesn't have a real voice activity detector. What it has is a no-speech probability — a per-chunk confidence score that the chunk contains no speech. The decoder reads that score, plus a few other signals, and decides whether to emit text or skip the chunk. The default no-speech threshold is 0.6, per OpenAI's official transcribe.py implementation. Here's the problem. On silent audio, the model often spits out a no-speech probability of 0.3 to 0.55. That's below the 0.6 threshold. So the chunk gets transcribed. And then the model writes down whatever it thinks "silence" sounds like — those YouTube outros from training. Dedicated VAD libraries don't work like this at all. A real VAD — WhisperKit's built-in voice activity detection (VAD), or WebRTC VAD — runs a separate audio-classifier model that spits out binary speech/non-speech labels at 95%+ accuracy. Whisper's no-speech score is a byproduct of language modeling. It was never built to detect speech. By my measurements on test audio, it gets silence wrong about 15-20% of the time.

The bug everyone misses: the 0.6 threshold was tuned for podcasts and conversations — mostly continuous speech, only short silent gaps. Dictation is a different animal. Users record 5-10 second clips and pause between them. For that, 0.6 is way too permissive. Silent gaps get tagged as "containing speech," and the model fills them in.

The Three Failure Modes That Trigger Hallucinations

Debugging the production pipeline, I kept seeing the same three patterns trigger hallucinations:

Pre-recording silence — User presses the hotkey, takes half a second to start speaking. Those 500ms of silence get transcribed as "Thank you for watching" before the real speech begins.
Mid-recording pauses — User thinks mid-sentence, pauses for 2-3 seconds. The pause gets transcribed as a YouTube outro, then the real speech resumes after.
Post-recording trailing silence — User finishes speaking but the recording continues for another second. Trailing silence triggers "Please subscribe to my channel" appended at the end of the transcript.

Whisper hallucination failure modes diagram showing pre-recording silence mid-pause and post-recording silence triggers in voice-to-text workflow on Mac

Mid-recording pauses are the worst of the three. The fake text lands in the middle of a real transcript, where nobody catches it. Picture a 1,000-word meeting transcript with two stray "Thank you for watching" sentences buried in it. Skim it and it reads fine. But it's carrying content that was never said. For legal depositions or medical dictation, that's not a cosmetic issue.

Fix 1: Lower the No-Speech Threshold to 0.4

Simplest fix first. Drop Whisper's no-speech threshold from the default 0.6 down to 0.4. Now the model is quicker to call a silent chunk "no speech" and skip it. The trade-off: marginal speech — someone whispering, or talking very quietly — can get skipped too. In OpenAI's reference implementation, the parameter is `no_speech_threshold`:

import whisper

model = whisper.load_model("large-v3-turbo")
result = model.transcribe(
    "audio.wav",
    no_speech_threshold=0.4,  # default is 0.6
    logprob_threshold=-1.0,
    condition_on_previous_text=False
)

In whisper.cpp, the flag is `-tp 0.4`. In WhisperKit, you set `decodingOptions.noSpeechThreshold = 0.4`. Same knob, three different APIs.

Drop the no-speech threshold to 0.4 and silence hallucinations fall off noticeably on test recordings with deliberate silent sections. It's documented in OpenAI's official discussion on hallucinations and kicked around by the Hugging Face Whisper community. The cost is a few more false rejections — some genuinely quiet speech gets read as silence and dropped. For most dictation, that's a fine trade. But if you're transcribing extremely quiet audio — whispered medical notes, accessibility cases for users with weak vocal cords — 0.4 can be too aggressive. Try 0.5 instead. The right number depends on your room, your mic gain, and how hard you want to punish low-confidence speech. Run an A/B test on 20 sample recordings from your actual use case and you'll have your answer in under an hour.

Fix 2: Pre-Filter Silence with a Dedicated VAD

The cleaner fix is architectural: never hand Whisper silent audio at all. Run a dedicated voice activity detector (VAD) on the raw audio first. It finds the speech regions, cuts the silence, and forwards only the speech-bearing segments to Whisper. The VAD options worth knowing:

WhisperKit's built-in voice activity detection (VAD) — the chunking strategy MetaWhisp uses. It carves audio into speech regions before transcription, so the model never swallows long stretches of silence. Runs locally on-device as part of the WhisperKit pipeline.
WebRTC VAD — Older, but battle-tested. It's the one baked into Chrome and Firefox for WebRTC calls. 4 aggressiveness modes (0-3). Free and tiny.
Apple Speech framework's SFSpeechRecognizer — Ships with macOS. Has internal VAD, but only hands you the result after transcription.

Now the pipeline reads: microphone → audio buffer → WhisperKit's built-in voice activity detection (VAD) → speech-only chunks → Whisper → text. Silent chunks get dropped, full stop. Whisper only ever sees audio that actually has speech in it. So silence-triggered hallucinations are largely eliminated here.

Voice activity detection architecture diagram comparing Whisper without VAD producing hallucinations versus WhisperKit's built-in voice activity detection (VAD) pipeline producing clean transcripts on Mac

Fix 3: Use Whisper Prompts to Anchor the Decoder Context

Whisper takes a `prompt` parameter that pre-conditions the model with context. The decoder treats whatever you pass as the start of an existing transcript, then keeps writing in that style. Give it a prompt with no YouTube-style boilerplate, and it's less inclined to invent any.

result = model.transcribe(
    "audio.wav",
    initial_prompt="The following is a dictated note about software engineering."
)

This is in OpenAI's Speech-to-Text API documentation, and it works in both the API and local Whisper. How well? Depends on the prompt. Name the user's domain outright — medical dictation, legal notes, code comments — and hallucinations drop. It's not as strong as VAD pre-filtering. But it's there when you can't bolt a VAD onto your pipeline.

Pro tip: No VAD available? Stack the two. Set no_speech_threshold=0.45 and add a one-sentence prompt like "Dictation note for [your domain]". Together they catch most silence hallucinations on everyday audio — no external VAD model needed.

How MetaWhisp Handles Whisper Hallucinations

I built MetaWhisp with all three fixes stacked on by default. The architecture:

WhisperKit's built-in voice activity detection (VAD) pre-filter — Audio from the microphone passes through WhisperKit's built-in voice activity detection (VAD) before reaching Whisper. Silent chunks are dropped at this stage. This eliminates the pre-recording, mid-pause, and post-recording silence hallucinations entirely.
No-speech threshold lowered to 0.45 — As a second-line defense, if any silent chunk slips through the VAD, the Whisper decoder is configured to be more conservative.
Domain-aware prompts — MetaWhisp's processing modes inject context-specific prompts (Raw mode gets a neutral prompt, Correct mode gets "professional dictation", Translate mode gets a translation-specific prompt).

On Apple Silicon, MetaWhisp uses WhisperKit's built-in voice activity detection — its .vad chunking strategy — compiled to Core ML via Apple's Core ML framework. The VAD step strips out the silent segments, so Whisper only ever processes audio that actually contains speech. That's what kills the silence-hallucination failure mode, the one where Whisper invents "thanks for watching" over a quiet passage. And running the whole pipeline on the Apple Neural Engine keeps it fast and easy on the battery on M-series Macs.

Why Cloud-Based Voice-to-Text Apps Hallucinate More

Cloud apps — Wispr Flow, the Otter.ai consumer tier, the raw OpenAI Whisper API — tend to hallucinate more than a well-built on-device app. Three reasons:

No pre-filtering at the edge — Cloud apps ship raw microphone audio straight to their servers. Running WhisperKit's built-in voice activity detection (VAD) on the user's device would add 3-5 ms of latency, and most cloud apps skip it to spare the on-device CPU. So the server feeds silent chunks through Whisper, and out come the hallucinations.
Default OpenAI Whisper API parameters — The hosted OpenAI Whisper API ships with `no_speech_threshold=0.6`. Call it without overriding that, and you inherit the high hallucination rate. The OpenAI API documentation mentions the parameter — most developers never touch it.
Network latency masks the bug — When cloud transcription takes 500-1500 ms, nobody notices that part of that wait went into generating a hallucinated YouTube outro. When transcription already feels slow, the bug hides in the noise.

If you're transcribing sensitive content where a fake sentence actually costs you something — legal depositions, medical dictation, journalistic interviews, technical documentation — an on-device app with proper VAD pre-filtering is materially safer than a cloud API.

How to Test Whisper Hallucinations on Your Audio

Want to know your tool's hallucination rate? It's a 30-second test:

Open your voice-to-text app (MetaWhisp, Wispr Flow, SuperWhisper, raw whisper.cpp — any Whisper-based tool)
Record 30 seconds of complete silence in a quiet room (no speech, no music, no clearly identifiable sound)
Stop the recording and let the tool produce a transcript
Check the output. A well-built tool produces an empty transcript. A flawed tool produces phrases like "Thank you for watching" or worse.

Run it 5 times. Count how many of the 5 attempts came back with invented text. Above 1 in 10? Your tool isn't handling silence correctly.

Want a tougher benchmark? Record 30 seconds: speak for 20 seconds in the middle, with 5 seconds of silence on each end. A well-built tool transcribes only the spoken middle. A flawed one bolts YouTube boilerplate onto the front or the back of your real content.

What About Whisper Hallucinations in Music or Noisy Environments?

Silence isn't the only trigger. Whisper also hallucinates on audio that isn't speech but sounds speech-shaped — music with vocals, several voices talking over each other, steady background noise. The same VAD and threshold fixes still help. But the failure modes aren't identical:

Music with vocals — Whisper sometimes writes song lyrics down as if they were dictated, mangling the timing along the way. VAD pre-filtering usually catches it, since music has a different acoustic signature than speech.
Multiple overlapping speakers — Whisper locks onto one speaker and tunes out the rest. The audio it ignored can turn into hallucinated text, because Whisper doesn't know what else to do with it. Speaker diarization tools like pyannote.audio are the standard fix.
Persistent background noise — TV in the next room, café ambience, the fridge humming. Whisper's confidence sags, and hallucinations get more frequent. Dropping the no-speech threshold to 0.35 helps, but it also drops marginal speech.

The pattern holds across all of it. Anything that isn't clean, single-speaker, mostly-silent-between-utterances dictation will hallucinate more than the benchmark numbers let on.

The gap between benchmark conditions and real-world audio is the dirty secret of voice-to-text marketing. Clean LibriSpeech test-clean numbers (Whisper large-v3-turbo scores 2.76% WER in our benchmark) come from professionally recorded narration — one speaker, studio mic, zero background noise. Now dictate on a MacBook Air with the built-in microphone, in a coffee shop, music playing, HVAC running. Word error rates climb. Hallucinations get more frequent. We haven't published per-condition figures for those environments. And this is true for every Whisper-based product, not just MetaWhisp. What separates the products is how hard they fight it. My call was to spend 3-5 ms on VAD per second of audio and eat the small extra latency. Cloud-based competitors made the opposite call. They optimize for raw transcription latency over hallucination prevention, because their pricing rides on per-minute audio billing — and server-side VAD carries a marginal cost their economics won't cover.

MetaWhisp three-layer defense architecture against Whisper hallucinations showing WhisperKit's built-in voice activity detection (VAD) pre-filter no-speech threshold and domain prompts on Mac

Frequently Asked Questions About Whisper Hallucinations

❓

Why does Whisper say "Thank you for watching" when there's no speech?

It's the most common Whisper hallucination there is. Whisper trained on 680,000 hours of audio scraped from the public internet, a lot of it YouTube. In those videos, silent intros and outros sat next to subtitle annotations like "Thanks for watching, please subscribe". The model learned to pair silence with those scripted phrases. Feed it silence at inference time and it writes down the phrase it thinks silence "means". The fix: pre-filter silence with a VAD, or lower the no-speech threshold below the 0.6 default.

❓

How do I stop Whisper from generating random text?

Three layered fixes, all of which I shipped in MetaWhisp: (1) Run a dedicated VAD like WhisperKit's built-in voice activity detection (VAD) before audio ever reaches Whisper — it drops silent chunks entirely. (2) Lower the no_speech_threshold from the default 0.6 to 0.4 or 0.45. (3) Use the initial_prompt parameter to steer the decoder away from YouTube boilerplate. Stack all three and silence hallucinations largely disappear in production.

❓

Does Whisper large-v3-turbo hallucinate less than older Whisper versions?

Barely. Whisper large-v3-turbo (October 2024) hallucinates at about the same rate as large-v3 — they share the same training data composition. What turbo buys you is speed (8x faster) and a slight accuracy bump, not better hallucination handling. The same VAD and threshold fixes work across every Whisper variant, tiny through large-v3-turbo. Distil-whisper, a community variant, hallucinates somewhat less thanks to different training — but it isn't multilingual.

❓

Why does Wispr Flow hallucinate more than MetaWhisp?

Wispr Flow runs Whisper in their cloud, no dedicated VAD pre-filtering. Audio from your Mac uploads straight to their servers, and Whisper chews through the whole stream — silent sections included. Default OpenAI parameters leave no_speech_threshold at 0.6, too permissive for dictation. MetaWhisp runs WhisperKit's built-in voice activity detection (VAD) on-device before Whisper sees a thing, so silent chunks never reach the model. The architecture matters more than which Whisper variant is running.

❓

Can I use Whisper without hallucinations for legal or medical transcription?

Yes — but only with proper engineering. Straight out of the box, Whisper hallucinates on silent audio. That's a non-starter for legal depositions or medical dictation, where a false phrase is a liability. Stack WhisperKit's built-in voice activity detection (VAD) pre-filtering with a no-speech threshold at 0.4 and domain-aware prompts, and those silence hallucinations largely go away, landing much closer to professional transcription quality. And for HIPAA-bound healthcare workflows, an on-device implementation like MetaWhisp sidesteps the BAA requirement entirely — see our HIPAA speech-to-text guide for the compliance details.

❓

What's the difference between Whisper hallucinations and Whisper repetition loops?

Two different bugs. Hallucinations are invented phrases from the training data, and they show up in silent or non-speech audio. Repetition loops are different: Whisper gets stuck, repeating the same word or phrase 20-100 times, because the autoregressive decoder keeps grabbing high-confidence repeated tokens. Fix for hallucinations: VAD pre-filtering and threshold tuning. Fix for repetition loops: turn on beam search with beam_size=5 and drop temperature to 0. And yes — both can happen in the same recording.

❓

How do I report a Whisper hallucination bug?

For OpenAI's reference Whisper, open an issue on the GitHub repo with the audio sample (or a description, if you can't share the audio), the model variant you used, and the parameters. For whisper.cpp, same drill on the whisper.cpp repo. For commercial apps running Whisper as a backend (MetaWhisp, Wispr Flow, SuperWhisper), go through the app's support channel with reproduction steps. Most vendors take hallucination bugs seriously — they hit transcript quality directly.

❓

Does MetaWhisp guarantee zero Whisper hallucinations?

No app can promise an absolute zero. Whisper is a probabilistic model, and rare edge cases will always slip through. What MetaWhisp's three-layer defense does is cut silence hallucinations way down: WhisperKit's built-in voice activity detection (VAD) pre-filtering, a lowered no_speech_threshold, and domain-aware prompts. For mission-critical legal or medical work, always read the transcript for anything unexpected before you rely on it — and that goes for any Whisper-based tool, not just this one.

About the Author

Andrew Dyuzhov is the solo founder and CEO of MetaWhisp, a free on-device voice-to-text app for macOS that runs Whisper large-v3-turbo on Apple Neural Engine. He ships Whisper inference in production through MetaWhisp, ran into the hallucination problem while building the app, and designed MetaWhisp's three-layer defense to cut silence hallucinations down hard. Every fix in this article came out of reproduction tests on his M3 MacBook Air, run across Whisper tiny, base, small, medium, large-v3, and large-v3-turbo variants. Connect on X or GitHub.