✂️🎙️
4 Ways to Strip Fillers from Whisper
Built-in app mode: MetaWhisp Clean
AI post-processing: GPT-4 / Claude
Scripted regex: sed / Python
Manual: Word find-replace
TL;DR: Four practical ways to remove filler words (um, uh, like, you know, I mean, basically, literally) from Whisper transcripts on Mac in 2026. (1) Use a voice-to-text app with built-in filler removal — MetaWhisp's Clean mode does this automatically with your own OpenAI key for pennies per day. (2) Run the transcript through ChatGPT or Claude with a cleanup prompt. (3) Apply a regex script for bulk processing. (4) Use Word's find-and-replace for one-off cleanup. Method 1 (Clean mode at dictation time) is the lowest-friction option for most users; Method 2 is best for editing existing transcripts; Method 3 is best for batch processing.
Whisper transcript filler word removal diagram showing before and after cleanup pipeline for Mac voice-to-text

Why Does Whisper Keep Filler Words in Transcripts?

OpenAI's Whisper is designed for verbatim accuracy. The model transcribes what you actually said, including hesitations, restarts, and filler words. This is a deliberate design choice in the architecture: removing fillers would require post-processing logic that introduces accuracy trade-offs and language-specific tuning, per OpenAI's Whisper repository documentation. For some use cases (legal transcription, medical dictation, journalism quotes), verbatim output is exactly what you want — every "um" is part of the record. For most other use cases (emails, Slack messages, blog drafts, meeting notes), fillers are noise that needs to come out before the transcript is useful. The common filler words in English: I'm Andrew Dyuzhov, solo founder of MetaWhisp. I built MetaWhisp's Clean mode specifically to handle filler removal for users who want polished transcripts without manual cleanup. This guide covers the Clean mode approach plus three alternative methods for cleaning up Whisper output from any voice-to-text app.
The reason Whisper outputs fillers verbatim, while older speech-recognition systems like Google Cloud Speech-to-Text sometimes auto-strip them, is architectural. Whisper is an end-to-end encoder-decoder transformer trained on raw audio paired with text transcripts from the public internet — much of that text came from professional captioning where fillers were retained for accuracy. Per the original Whisper paper from Radford et al., the training objective rewards the model for matching the source transcript exactly, including any "um" or "uh" that the human captioner chose to preserve. Older ASR systems used phoneme-level decoding with rule-based language models that could be configured to skip non-content words, but Whisper's neural decoder doesn't have an explicit filler-skipping layer. Post-processing is the only way to remove fillers from Whisper output. The trade-off is worth it because Whisper's overall transcription accuracy substantially exceeds the older systems, even with the filler-retention quirk.

Method 1: MetaWhisp Clean Mode (Automatic at Dictation Time)

The lowest-friction option is using a voice-to-text app that handles filler removal automatically as part of the transcription pipeline. MetaWhisp's Clean mode does this — your audio runs through Whisper for accurate transcription, then through a lightweight GPT pass that strips fillers and fixes grammar without changing your voice. Setup:
  1. Download MetaWhisp (free, requires Apple Silicon M1+)
  2. Open MetaWhisp Settings → Processing Modes
  3. Select Clean mode
  4. For free tier: enter your OpenAI API key (costs roughly $0.01-0.05/day of normal use). For Pro tier: no API key needed.
  5. Press Right Option, speak naturally with fillers, release
  6. The text that pastes into your active app has fillers removed automatically
Before (Raw mode output):
"so um I was thinking like we should probably uh move the deadline you know to next Friday because um the design team needs like more time to you know finish things"
After (Clean mode output):
"I was thinking we should move the deadline to next Friday because the design team needs more time to finish."
The key property of Clean mode: it preserves your original meaning and phrasing but removes fillers, fixes grammar, and adds punctuation. It does NOT restructure sentences or upgrade vocabulary — that's what Rewrite mode does. Clean is the right setting when you want polished text that still sounds like you.
Pro tip: Set MetaWhisp's default mode to Clean for daily dictation. Switch to Raw mode only for specific use cases (meeting notes, journaling, legal transcription) where verbatim output matters. This way you get filler-free output automatically without having to remember to enable it per-recording.

Method 2: AI Post-Processing with ChatGPT or Claude

If you already have a Whisper transcript from another app (MacWhisper file transcription, Wispr Flow, raw whisper.cpp, Word M365 Transcribe), the easiest cleanup path is pasting it into ChatGPT or Claude with a targeted prompt. The prompt that works well:
You are a transcript editor. Below is a verbatim voice transcript that contains
filler words and hesitations. Remove fillers (um, uh, like, you know, I mean,
basically, literally, actually as fillers), fix grammar, add proper punctuation,
and capitalize sentence starts. PRESERVE the original meaning, vocabulary, and
sentence structure. Do NOT rewrite, paraphrase, or upgrade word choice. Just
clean up.

Transcript:
[paste your transcript here]
This works in: Cost: Effectively free on free tiers for occasional use. On API access, roughly $0.001-0.005 per 1,000-word transcript via GPT-4o-mini or Claude Haiku. For high-volume use (50+ transcripts per day), API access via scripting is cheaper than the consumer Plus subscription. Strengths: Highest output quality among automatic methods. Handles edge cases (regional discourse markers, multilingual fillers, domain-specific verbal habits). Weaknesses: Requires copy-paste step. Manual per-transcript. Doesn't scale to batch processing without API scripting.
AI post-processing workflow diagram showing raw Whisper transcript flowing through ChatGPT or Claude with cleanup prompt producing filler-free transcript on Mac

Method 3: Regex Script for Bulk Processing

For batch processing dozens or hundreds of transcripts, AI APIs are overkill and slow. A regex-based script handles 90% of filler removal at near-zero cost and millisecond latency. Python script:
import re

FILLERS = [
    r'\b(um+|uh+|hmm+|er+|ah+|mhm+)\b',                # hesitations
    r'\b(like|you know|I mean|sort of|kind of)\b',      # discourse particles
    r'\b(basically|literally|actually|honestly)\b',     # intensifiers (use with care)
    r'\b(\w+)-\1\b',                                    # word repetitions: "the the"
]

def remove_fillers(text):
    for pattern in FILLERS:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)
    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text)
    # Fix punctuation spacing
    text = re.sub(r'\s+([.,!?])', r'\1', text)
    return text.strip()

# Process a transcript file
with open('transcript.txt') as f:
    raw = f.read()

clean = remove_fillers(raw)

with open('transcript-clean.txt', 'w') as f:
    f.write(clean)
Bash one-liner for quick cleanup:
sed -E 's/\b(um|uh|like|you know|I mean|basically|literally)\b//gi' transcript.txt
Strengths: Zero cost. Millisecond processing. Easily scriptable for batch jobs (`for file in *.txt; do ... done`). Weaknesses: Doesn't handle context. Removes "basically" even when the word is being used legitimately ("the basically free option"). Doesn't fix grammar. Doesn't restructure restart phrases.
The regex approach has a known false-positive problem with intensifiers like "basically", "literally", "actually", and "honestly". These words are sometimes fillers and sometimes meaningful — "the literally last option" is a valid use of "literally", but "literally I just" is a filler use. Regex can't distinguish context. The pragmatic solution: keep the regex pattern aggressive for clear fillers (um, uh, hmm) and conservative for intensifiers. Move the intensifier removal to the AI post-processing pass (Method 2), which understands context. For batch processing with mixed needs, run the regex first for the cheap clear-filler wins, then send the partially-cleaned output through AI for the context-sensitive intensifiers. This two-stage pipeline is dramatically faster than running everything through AI while still handling the edge cases. The cost reduction matters at scale: for a podcaster transcribing 5 hours of audio per week, regex pre-pass cuts API costs by roughly 70% versus pure-AI cleanup, with comparable final output quality on the intensifier cases.

Method 4: Manual Cleanup in Word or BBEdit

For one-off transcripts where you want full control, manual cleanup with find-and-replace in any text editor works. This is the fallback when other methods produce edge cases you need to fix anyway. In Microsoft Word:
  1. Press Cmd+F to open the Find pane
  2. Click the gear icon → Advanced Find & Replace
  3. Enable Use wildcards in the search options
  4. Find: \b(um|uh|hmm|er|ah)\b → Replace: (empty) → Replace All
  5. Repeat with discourse particles: \b(like|you know|I mean)\b
  6. Manual pass for remaining edge cases
In BBEdit (Mac code editor, free for basic features):
  1. Cmd+F for Find
  2. Enable Grep for regex search
  3. Search pattern: \b(um|uh|hmm|like|you know|I mean)\b
  4. Replace with empty string
  5. Use Replace All
Manual cleanup is slowest but most precise. For high-stakes transcripts (legal depositions, published interviews, broadcast scripts) where every word choice matters, manual is the only safe option — AI cleanup can introduce subtle interpretation changes that automated methods don't flag.

What Counts as a Filler Word in Different Contexts?

The same word can be filler in one context and content in another. Context-sensitive filler removal:
WordFiller useContent use
like"It was, like, really cold""I like coffee"
basically"Basically, I mean, you know""This is basically a regex engine"
literally"I literally just woke up""Translated literally from French"
actually"It's, actually, kind of fine""That's actually wrong; the correct answer is X"
honestly"Honestly, I don't, you know""He answered honestly about the bug"
you know"It was, you know, complicated""You know my brother, right?"
AI cleanup (Method 2) handles this distinction well because context-sensitive language understanding is the LLM's core capability. Regex (Method 3) struggles. Manual cleanup (Method 4) is most accurate but slowest.
For high-stakes content where the distinction between filler and content matters — published journalism, academic writing, legal briefs — the safest workflow is two-pass: run AI cleanup first to get 95% of fillers, then manual review to verify the remaining ambiguous cases. The AI pass alone may strip a meaningful "literally" or "actually" that changes the sentence meaning. Manual-only is too slow for transcripts longer than a few hundred words. The hybrid approach catches both the high-frequency cheap wins (clear fillers) via AI and the low-frequency expensive risks (context-sensitive intensifiers) via human review. This pattern applies whether you're using Word's M365 Transcribe as the source or any other Whisper-based output. For low-stakes routine writing (emails, Slack messages, internal docs), pure AI cleanup is fine because the cost of a rare false-positive removal is low — you'll catch it on read-through. The hybrid two-pass approach pays off when content has legal, journalistic, or academic stakes that make false positives expensive.

How Many Filler Words Does the Average Person Use?

Linguistics research suggests filler frequency varies widely by speaker, context, and language. Per Wikipedia's overview of filler words in linguistics, English speakers use fillers at rates of roughly 2-5% of total spoken words in casual conversation, dropping to 1-2% in formal presentations and rising to 5-8% in spontaneous unprepared speech. Practical implication: a 1,000-word raw Whisper transcript of casual conversation contains 20-50 filler words. A presentation transcript contains 10-20. An unprepared brainstorm transcript can contain 50-80. The cleanup workload scales accordingly. For voice-to-text users who dictate routine work content (Slack, emails, notes), the filler rate tends to land around 3-4% — which means roughly 30-40 fillers per 1,000 words of dictation. Removing them by hand takes 2-3 minutes per 1,000 words; automatically via Clean mode or AI takes seconds.
Filler word frequency by speaking context chart showing percentages for formal presentation casual conversation brainstorm anxious speech Mac voice-to-text

What's the Difference Between Filler Removal and Rewrite Mode?

Two distinct operations that often get confused: Compare:
Original (Raw): "so um I was thinking like we should probably uh move the deadline you know to next Friday because um the design team needs like more time"

Clean mode: "I was thinking we should move the deadline to next Friday because the design team needs more time."

Rewrite mode: "I'd like to propose extending the deadline to next Friday. The design team requires additional time to deliver quality work."
Clean mode keeps your conversational voice ("I was thinking we should..."). Rewrite mode makes it more formal ("I'd like to propose..."). Use Clean for Slack messages and casual emails; use Rewrite for client communication and published documentation.

How Do Other Voice-to-Text Apps Handle Filler Removal?

A quick survey of how the major Mac voice-to-text apps approach filler word removal in 2026: The pattern: apps that ship with built-in cleanup as a default mode are dramatically easier to use day-to-day than apps requiring manual post-processing. The friction of "transcribe, then open ChatGPT, then paste, then prompt, then copy back" is enough to make most users skip cleanup entirely on routine writing. This is why MetaWhisp's Clean mode runs the GPT pass automatically in the same hotkey-to-text pipeline — users never see the cleanup step, they just see clean transcripts appearing where their cursor was.
Mac voice-to-text apps filler removal comparison matrix showing MetaWhisp Wispr Flow SuperWhisper Otter MacWhisper Apple Dictation built-in cleanup capabilities

Does Filler Removal Hurt Transcript Accuracy?

Filler removal is a post-processing step that doesn't affect the underlying Whisper transcription accuracy. The model still hears "um" and outputs it; the cleanup pass then removes it. The Whisper word error rate (5-7% on clean English per OpenAI's model card) measures core transcription accuracy, separate from filler retention. What can hurt accuracy is over-aggressive filler removal that strips meaningful words. The intensifier problem (literally, actually, basically as both filler and content) is the most common source of accidental content loss. The fixes: For most users running MetaWhisp Clean mode or AI post-processing, the false-positive rate on intensifiers is low enough that the cleanup is net-positive without manual review. For high-stakes content, build review into your workflow. The general principle: filler removal is one of the cheapest improvements you can make to a Whisper transcript, with the most consistent reader-experience payoff. A cleaned 1,000-word transcript reads roughly 30-40% faster than the raw verbatim version, which translates directly to time saved for any downstream reader.

Frequently Asked Questions About Filler Word Removal

How do I automatically remove filler words from Whisper transcripts?

Use a voice-to-text app with built-in cleanup like MetaWhisp's Clean mode. It runs your audio through Whisper for accurate transcription, then through a lightweight GPT pass that removes fillers (um, uh, like, you know) and fixes grammar without changing your voice. Setup takes 5 minutes. Cost is roughly $0.01-0.05 per day on free tier with your own OpenAI key, or included in Pro tier.

Why does Whisper keep "um" and "uh" in transcripts?

Whisper is designed for verbatim accuracy. The training data (680,000 hours of audio with paired transcripts) preserved fillers because professional captioning retains them for accuracy. Whisper's neural decoder doesn't have an explicit filler-skipping layer like older rule-based ASR systems. Post-processing is the only way to remove fillers from Whisper output, which is what Clean mode and AI cleanup do.

What's the cheapest way to clean up Whisper transcripts?

For occasional cleanup: paste the transcript into free ChatGPT or Claude with a cleanup prompt. Zero cost for most users. For high-volume: write a regex script in Python or sed — millisecond processing, zero cost. For ongoing daily dictation: MetaWhisp Clean mode with your own OpenAI API key costs roughly $1-1.50 per month for typical use.

Should I use regex or AI for filler removal?

Use regex for clear hesitation markers (um, uh, hmm, er, ah) — fast, free, no false positives. Use AI (ChatGPT, Claude, or MetaWhisp Clean mode) for context-sensitive intensifiers (literally, actually, basically) where the word might be filler or content. Two-pass workflow combines both: regex first for cheap wins, AI second for ambiguous cases.

Does filler removal change my voice in the transcript?

Clean mode and similar "filler removal only" methods preserve your voice — they remove hesitations and fix grammar but keep your original sentence structure and vocabulary. Rewrite mode (a different operation) does change your voice by restructuring sentences and upgrading vocabulary. Use Clean for casual writing where authenticity matters; use Rewrite for formal client communication where polish matters more than voice.

Can I batch-process many Whisper transcripts at once?

Yes. For batch processing, use a regex script in Python or sed that loops over your transcript files. Each file processes in milliseconds. For higher quality batch processing with context awareness, write a small Python script that calls the OpenAI API or Anthropic API with the cleanup prompt for each file. Cost via API is roughly $0.001-0.005 per 1,000-word transcript.

What about filler words in other languages?

Each language has its own fillers. Spanish: "eh", "este", "o sea". French: "euh", "ben", "tu vois". German: "ähm", "halt", "äh". Russian: "ну", "это", "как бы". Whisper preserves these too. AI cleanup with GPT or Claude handles non-English fillers because the LLMs are multilingual. Regex needs language-specific patterns. MetaWhisp's Clean mode handles fillers for 99 Whisper-supported languages via its GPT post-processing.

How accurate is automatic filler removal?

For clear hesitation markers (um, uh, hmm), automatic removal is essentially 100% accurate — these are never content words. For discourse particles (like, you know), automatic removal hits 90-95% accuracy depending on the method. For intensifiers (literally, actually, basically), accuracy drops to 80-90% because context determines whether the word is filler or content. AI-based methods score higher on intensifiers than regex-based methods.

About the Author

Andrew Dyuzhov is the solo founder and CEO of MetaWhisp, a free on-device voice-to-text app for macOS that runs Whisper large-v3-turbo on Apple Neural Engine. He built MetaWhisp's Clean mode and Rewrite mode to handle filler removal and text polish for users who want dictated content that doesn't sound dictated. The four methods in this article reflect his hands-on experience tuning the Clean mode prompts, testing regex scripts against real Whisper transcripts, and working with users who needed transcript cleanup for journalism, legal, and academic workflows. Connect on X or GitHub.

Related Reading