🎙️

Whisper Large v3 vs Medium: Accuracy on Accents

Q: Where can I see the original per-language WER numbers?

OpenAI publishes them in two places: the original paper 'Robust Speech Recognition via Large-Scale Weak Supervision' on arXiv, and the model cards on Hugging Face for large-v3 and medium. Those tables are the only first-party per-language numbers you should trust.

A practical reading of OpenAI's published WER data

If your English is accented, Whisper large-v3 is meaningfully more accurate than medium — that's the clean reading from OpenAI's published FLEURS and VoxPopuli numbers. The gap grows on lower-resource languages and shrinks on quiet, standard American English. For most accented speakers, large-v3 (or the distilled large-v3-turbo) is worth the extra compute. For native US/UK dictation in a quiet room, medium holds up fine.

I spend a lot of time in Whisper's corner of the world — I ship a free offline voice-to-text app for Mac that runs Whisper locally on Apple Silicon, and I dictate in Russian and English every day. The "large-v3 vs medium" question comes up constantly, especially from people who speak English as a second language and are tired of seeing their words mangled. This is not a benchmark roundup. OpenAI does not publish per-accent WER, and I'm not going to invent numbers that don't exist on their paper or model card. What I'll do is walk through what OpenAI *does* publish — FLEURS and VoxPopuli per-language results — and translate that into a practical decision rule for accented English.

How much better is large-v3 than medium on English overall?

On standard American English, the difference is real but not dramatic. OpenAI's published numbers on the Whisper large-v3 Hugging Face model card and the original paper "Robust Speech Recognition via Large-Scale Weak Supervision" show that large-v3 sits well below medium on English WER across both FLEURS and Common Voice evaluations. The model card lists large-v3 around the mid-2% range on FLEURS English, while medium lands noticeably higher.

That gap is meaningful for production dictation. At 2-3% WER you mostly see punctuation and the occasional homophone ("their" vs "there"). At 6-8% WER you start losing content words, especially names and technical terms. Medium is not broken — it's just noticeably noisier.

The catch: "standard American English in a quiet room" is a narrow slice of what people actually dictate. Move the speaker, the mic, or the accent, and the gap widens.

What does OpenAI actually publish about accent accuracy?

Honest answer: nothing per-accent. OpenAI evaluates Whisper on whole-language benchmarks — FLEURS (102 languages), VoxPopuli (14 European languages), Common Voice, and LibriSpeech. None of these are broken down by "Indian English," "Brazilian Portuguese-accented English," or "Cantonese-accented English." That's a real gap in the public literature, and I'm not going to paper over it with numbers I made up.

What we can infer from the published data is qualitative. Accents typically degrade WER because they introduce phonetic patterns the model saw less of during training. Models with more parameters and more training data — i.e., large-v3 — generally handle those patterns better. Independent researchers have confirmed this trend in third-party studies, but those are not OpenAI's numbers.

So the honest picture is: large-v3 probably beats medium more on accented English than on standard English, by a margin we can't quantify from OpenAI's published charts alone. That's the truth. If a vendor claims a specific percentage improvement on "Indian accent," ask for the test set.

Whisper large v3 vs medium per-language WER on FLEURS benchmark visualization

Does the gap widen on lower-resource languages?

Yes — and this matters more than the accent question for many users. OpenAI's FLEURS table shows that the WER gap between large-v3 and medium grows substantially as you move from high-resource languages (English, Spanish, French) toward lower-resource ones (Swahili, Welsh, Lao).

On a low-resource language, medium can post double-digit WER while large-v3 stays in single digits. If you dictate in anything other than English — even occasionally — that gap is the strongest argument for large-v3.

This is also why I default to the largest model that fits my hardware in my offline dictation setup. MetaWhisp itself ships with large-v3-turbo specifically because it preserves most of large-v3's accuracy on non-English languages while running in roughly half the time on the Neural Engine.

How do they compare on noise and long-form dictation?

Noise is where the model-size gap shows up most clearly in practice. OpenAI's Common Voice and VoxPopuli evaluations both include varied recording conditions, and large-v3 dominates medium on noisy clips. Large-v3 also hallucinates noticeably less on silence and background hum — the famous Whisper problem where the model invents a sentence during a quiet stretch.

Medium hallucinates more. If you've ever finished dictating and found a phantom "Thanks for watching!" at the end of your transcript, that's medium's classic failure mode. Large-v3 doesn't eliminate it, but it's rarer and usually shorter when it does happen.

For long-form dictation — meetings, lectures, podcasts — hallucination matters more than a 2% WER improvement. Every hallucinated sentence is cleanup work. This alone pushes the value proposition toward large-v3 for anything beyond quick voice notes.

Dimension	Whisper medium	Whisper large-v3
Parameters	~769M	~1.55B
English WER (FLEURS, OpenAI)	Noticeably higher	Lowest published
Low-resource language WER	Often double-digit	Generally single-digit
Noise robustness	Degrades faster	More stable
Silence hallucinations	More frequent	Less frequent
Runs locally on M1/M2 Air	Yes, fast	Yes, slower

Can my Mac actually run large-v3 — and is it fast enough to be useful?

Yes, with caveats. Whisper large-v3 runs on Apple Silicon via WhisperKit, and on an M1 or later it will produce a transcript — but the latency depends heavily on the chip and audio length. On an M1 Air with the official Core ML port, large-v3 runs slower than medium — a 60-second clip takes a couple of minutes to transcribe.

An M2 Pro or M3 Pro cuts that significantly. On M4 Pro and Max chips, large-v3 feels nearly real-time for short clips. For an M1 Air user who dictates a lot, large-v3-turbo is the practical sweet spot — about half the latency of large-v3 with accuracy that the model card and third-party tests put within roughly 1 percentage point of the full model on most languages.

If your workflow is "hold hotkey, dictate three sentences, paste into email," any current Apple Silicon Mac handles large-v3 fine. If you're transcribing an hour-long meeting, you'll want either a Pro chip, large-v3-turbo, or a cloud pass to skip the wait.

Decision flowchart for choosing Whisper medium vs large-v3 based on accent and language

Where large-v3-turbo fits in this picture

large-v3-turbo is OpenAI's distilled variant of large-v3 — fewer decoder layers, similar encoder, same general accuracy tier. It launched in late 2024 and is what most production apps now ship by default. The tradeoff it makes: a small accuracy drop (typically under 1 percentage point WER on FLEURS) for a large speedup, often 2x or better on the same hardware.

For accented English specifically, large-v3-turbo is the option I usually recommend. It keeps nearly all of large-v3's gains over medium while running in the time budget of an everyday dictation app. MetaWhisp uses it for exactly this reason — local mode is free and unlimited on macOS 14+ Apple Silicon, and the accuracy hit versus the full large-v3 is small enough that most users don't notice.

I wrote a deeper dive on large-v3-turbo vs full large-v3 if you want the model-level details.

When medium is still the smarter pick

I'm not going to pretend large-v3 is always the right answer. There are real cases where medium is fine:

Standard American English, quiet room, desktop mic. The accent-sensitivity issue mostly disappears here.
Very old hardware or battery-sensitive workflows. medium runs faster, cooler, and on weaker machines where large-v3 is borderline.
Real-time streaming needs. medium is the largest Whisper model that some live transcription setups can keep up with.
You mainly care about "good enough" drafts. If you're going to edit aggressively anyway, medium's extra errors may not cost you much.

The honest framing: medium is a perfectly reasonable choice for a narrow set of conditions. The moment you leave those conditions — accented speaker, noisy environment, multilingual audio — large-v3 (or turbo) starts paying for itself.

Pro tip: If you're trying to choose, transcribe the same five-minute clip with both models and read the transcripts side by side. The differences are obvious in 30 seconds and far more convincing than any chart. MetaWhisp's local mode is free to download and lets you switch models without an account.

Quick decision checklist

Here's the rule I use when someone asks me directly:

You speak English with a noticeable accent, or you dictate in two or more languages → large-v3 or large-v3-turbo. Don't overthink it.
You're a native US/UK English speaker dictating into a decent mic in a quiet room → medium is honestly fine. Save the compute.
You're on an M1 Air and dictate a lot → large-v3-turbo is the practical sweet spot. Skip full large-v3 unless you have a Pro chip.
You need highest accuracy on legal, medical, or technical terms → large-v3 wins, and consider pairing it with a post-processing mode that formats and structures the output.
You're transcribing podcasts or meetings in batch → large-v3 on a Pro/Max chip, or offload to cloud via Pro if you'd rather skip the wait.

The meta-point: OpenAI's published per-language data confirms that larger is more accurate, especially as you leave the high-resource-English comfort zone. Beyond that, the choice is about your hardware, your accent, and how much cleanup work you're willing to do. There's no shame in any of the three options — there is shame in guessing and shipping a transcript full of errors you didn't notice.

Mac Neural Engine running Whisper large-v3 locally for accent accuracy comparison

Want to test both on your own voice? MetaWhisp's local mode is free, unlimited, and never uploads your audio — you can try the turbo model and the full large-v3 back-to-back and see what works for your accent. No account, no API key, no cloud bill.

FAQ

❓

Is large-v3 worth it for accented English?

Based on OpenAI's published FLEURS and VoxPopuli numbers, large-v3 is meaningfully more accurate than medium across all tested languages, and the gap tends to widen where the model has less native training data — which is the same direction accents push you. So yes, large-v3 (or the faster large-v3-turbo) is the better pick for accented English.

❓

What's the actual WER gap between large-v3 and medium?

OpenAI's published tables show large-v3 beating medium by a meaningful margin on English, and by larger margins on lower-resource languages. The exact percentages vary by benchmark (FLEURS vs Common Voice vs VoxPopuli) and by language. Check the large-v3 model card and the original paper for the numbers that match your situation.

❓

Does medium handle non-native English well enough?

It depends on the accent and the recording. Medium is not unusable for non-native English, but the WER is visibly higher than large-v3 in the published comparisons. If clean transcripts matter for your work, the upgrade is worth it. For casual voice notes, medium may be fine.

❓

Can large-v3 run locally on an M1 or M2 Mac?

Yes, via WhisperKit. On an M1 Air it's noticeably slower than medium — expect a 60-second clip to take a couple of minutes with full large-v3. An M2 Pro or better feels near real-time for short clips. For everyday dictation on an M1, large-v3-turbo is the more comfortable choice.

❓

Is large-v3-turbo as accurate as the full large-v3?

Close, but not identical. OpenAI's model card and independent evaluations put turbo within roughly 1 percentage point WER of the full model on most languages, with the gap varying by language and audio type. For most everyday dictation, the difference is invisible.

❓

Why does medium hallucinate more on silence than large-v3?

Hallucination on silence is a known Whisper behavior across all model sizes — the model has a prior toward producing text given audio. Larger models have stronger priors against emitting text when the audio contains no speech, so they produce fewer phantom sentences. It's not eliminated, just rarer.

❓

Which Whisper model handles code-switching best?

Large-v3. OpenAI's multilingual training means larger models generally handle mid-sentence language switches better. If you regularly mix English with another language in the same sentence, avoid medium.

❓

Does large-v3 need a GPU to be useful?

No. With the right runtime (WhisperKit, whisper.cpp, or the official Python package on MPS), large-v3 runs fine on Apple Silicon laptops and even on modest CPUs, just slower than on a discrete GPU. For real-time dictation on a laptop, large-v3-turbo is usually the better target.

Where can I see the original per-language WER numbers?

OpenAI publishes them in two places: the original paper "Robust Speech Recognition via Large-Scale Weak Supervision" on arXiv, and the model cards on Hugging Face for large-v3 and medium. Those tables are the only first-party per-language numbers you should trust.

About the author: Andrew Dyuzhov is the solo founder of MetaWhisp, a free on-device voice-to-text app for macOS. He's a marketer and builder with ADHD who assembled MetaWhisp with AI coding tools on top of OpenAI's open-source Whisper. He dictates daily in Russian and English, runs the 7-app head-to-head accuracy test on his own audio, and uses voice-first workflows to get past writing paralysis. Follow him on X.

Whisper Large v3 vs Medium: Accuracy on Accents

Whisper Large v3 vs Medium: Accuracy on Accents

How much better is large-v3 than medium on English overall?

What does OpenAI actually publish about accent accuracy?

Does the gap widen on lower-resource languages?

How do they compare on noise and long-form dictation?

Can my Mac actually run large-v3 — and is it fast enough to be useful?

Where large-v3-turbo fits in this picture

When medium is still the smarter pick

Quick decision checklist

FAQ

Is large-v3 worth it for accented English?

What's the actual WER gap between large-v3 and medium?

Does medium handle non-native English well enough?

Can large-v3 run locally on an M1 or M2 Mac?

Is large-v3-turbo as accurate as the full large-v3?

Why does medium hallucinate more on silence than large-v3?

Which Whisper model handles code-switching best?

Does large-v3 need a GPU to be useful?

Related reading

Try MetaWhisp free