πŸŽ™οΈ

Zoom audio β†’ BlackHole β†’ pyannote.audio + Whisper β†’ diarized transcript

Yes, speaker-diarized Zoom transcription on Mac is doable locally β€” but only as a DIY project combining pyannote.audio for "who spoke when" with OpenAI Whisper for "what was said," typically wired together by WhisperX. MetaWhisp does not ship speaker labels today, and this post walks through what the real alternatives are for each kind of user.

Can you transcribe Zoom calls locally on a Mac with speaker labels?

Yes, but only as a DIY project. The two real open-source pieces are pyannote.audio for speaker diarization and OpenAI Whisper for transcription, typically glued together by WhisperX. You also need a way to capture Zoom's audio locally β€” BlackHole, Loopback, or a system audio route. None of this is a one-click install, and MetaWhisp does not ship speaker labels in its current build.
The rest of this post walks through what that DIY stack actually is, what running it on Apple Silicon looks like in practice, and the honest tradeoffs. I will also say plainly where MetaWhisp fits and where it does not, because pretending we already do this would be a worse disservice than admitting we do not. If you only need a transcript without speaker labels, MetaWhisp's free local mode handles that today β€” see the Zoom transcription on Mac guide. The rest of this article is for the people who specifically need lines like "Alex: I'll handle the API" in the output.

What speaker diarization actually is

Speaker diarization is the "who spoke when" half of the problem. It is a separate discipline from speech recognition, which is the "what was said" half. Whisper does diarization poorly on purpose β€” it is an ASR system, not a diarization system, and treating it like one is the single most common reason DIY pipelines fail.
Speaker diarization is the "who spoke when" half of the problem. It is distinct from speech recognition, which is the "what was said" half. They get conflated constantly, and that confusion is the single biggest reason people get stuck. Whisper, on its own, does not produce speaker labels. It produces a flat stream of words with optional timestamps per segment. If two people take quick turns, Whisper will not tell you who said what. The OpenAI paper "Robust Speech Recognition via Large-Scale Weak Supervision" makes this clear β€” it is an ASR system, not a diarization system. (See github.com/openai/whisper for the model card and paper link.) To go from "this is what was said" to "Alex said this, then Jamie said that," you bolt on a separate model that segments the audio by speaker and aligns those segments to the transcript. That is what pyannote.audio does.

The honest DIY stack: pyannote.audio + Whisper on Apple Silicon

The community-standard open-source stack for this is three pieces, in this order: a virtual audio driver (BlackHole) to capture Zoom audio, pyannote.audio for diarization, and WhisperX as the wrapper that aligns pyannote's segments to Whisper's words. All three are real, free, public open-source projects β€” but "free" here means free as in money, not free as in time.
The community-standard open-source stack for this is three pieces, in this order: 1. Audio capture. A virtual audio driver on macOS that routes Zoom's output to a file. BlackHole is the popular free option; Loopback by Rogue Amoeba is the paid one. Both are real projects, both have public docs, both work. 2. Diarization. pyannote.audio, the open-source library maintained by HervΓ© Bredin and the pyannoteAI team. Their pretrained pipeline on Hugging Face takes a WAV file and returns an RTTM file: start time, end time, speaker label. It is a real public project β€” read the repo's README, not a third-party summary, before you commit to a version. 3. Transcription + alignment. Either raw OpenAI Whisper, or WhisperX, a fork that runs Whisper and then forces word-level timestamps to align with pyannote's segments. WhisperX is what most people actually use, because matching "Speaker 2" to the right words is finicky with vanilla Whisper. You can do this on Apple Silicon. pyannote's pipeline is PyTorch-based and runs on CPU or MPS. Whisper runs on CPU, MPS, or, in our case at MetaWhisp, on the Apple Neural Engine via WhisperKit. Neither path needs an NVIDIA GPU.

What it takes to get this running on macOS

Plan on a half-day if you already have Python and a Hugging Face account, and more like a weekend if you do not. The "free" part stops at the software license. You also need disk space (the model weights are several gigabytes), a Hugging Face access token, and the audio file in 16 kHz mono WAV for the best CPU behavior.
The "free" part stops at the software license. In time, expect a half-day if you already have Python and a Hugging Face account, and more like a weekend if you do not. Specifically: - Python 3.10+ and a working PyTorch install on your Mac. - A Hugging Face account and a read-access token, because pyannote's gated models require you to accept the model's terms on the model card before downloading. This is a manual step people miss. - Disk space. pyannote's pipeline, the Whisper large-v3 model, and the alignment model together are several gigabytes. - The audio file from your Zoom call, captured as 16 kHz mono WAV. If you record in stereo 48 kHz, pyannote will still work, but you will save a lot of CPU by downsampling.
Pro tip: Downsample to 16 kHz mono before you run the pipeline, not after. pyannote's diarization quality is unchanged at 16 kHz, and the CPU cost roughly halves. A one-line ffmpeg -i input.wav -ac 1 -ar 16000 output.wav saves you a real amount of waiting.
If you are not comfortable with pip install, Hugging Face tokens, and the occasional ffmpeg invocation, the next section is your real starting point.

How to route Zoom audio into a local recorder

You need Zoom's audio to leave the app, go into a "device" your recorder can see, and end up in a file. On macOS that means a virtual audio device. The free path is BlackHole; the paid path is Loopback. Either way, the audio never leaves your Mac.
This is the part most guides skip and most people get wrong. You need Zoom's audio to come out of the app, go into a "device" that your recorder can see, and then into a file. On macOS, that means a virtual audio device. The free path: install BlackHole, create a Multi-Output Device in Audio MIDI Setup (a built-in macOS app) that sends Zoom's audio to both your speakers and BlackHole, then set the recorder's input to BlackHole. The result is that you hear the call normally and your recorder also captures it. The paid path: Rogue Amoeba's Loopback does the same thing with a friendlier UI and per-app routing. Check their site for current pricing β€” I have not verified the latest dollar figure, so go straight to the vendor page rather than trust a number from this blog post. Either way, hit Record in your capture app, run the Zoom meeting, stop recording, then feed the WAV into your pyannote + WhisperX script. The whole meeting becomes a file you control. No Zoom bot, no third-party cloud, no "this transcript will be deleted in 30 days."

Why MetaWhisp doesn't ship speaker diarization (yet)

I get asked about this every week. Here is the honest version, founder-to-user. Diarization is a separate model family from ASR. We ship Whisper large-v3-turbo on the Neural Engine via WhisperKit β€” see our on-device transcription page for the architecture. Adding a pyannote-quality diarization layer means another model pipeline, another set of model weights, and another set of failure modes: off-by-one word alignment, overlapping speech, more than four speakers in a room, kids and dogs on the call. I do not want to ship it until it works at the level our users expect from a paid app. It is on the roadmap. I am not putting a date on it. The way I make product decisions is: does this work in my own weekly use, on a real Mac, on real audio, at least 19 times out of 20. We are not there yet for diarization. What we ship today is plain transcription with high accuracy. Our own LibriSpeech test-clean run came in at 2.76% WER (97% accuracy). That is the only first-party number I will publish, because it is the only one I can point you to a real run of.

What MetaWhisp actually does for meeting audio

If you have a Zoom recording saved as an audio or video file, you can transcribe it in MetaWhisp's local mode for free. Drop the file in, pick the language (or use auto-detect across 99 supported languages), and you get a clean transcript on your Mac. No upload, no account, no time cap. Audio never leaves the machine. The same is true for live dictation while the meeting is in progress β€” hold the global hotkey, talk, paste into Notes, Mail, Slack, wherever. That workflow is what I personally use for my own calls. For meeting workflows specifically, the most useful processing modes are: None of these modes identify the speaker. They are not magic. They are honest text transformation that runs on your own hardware.

Honest comparison: four real ways to label speakers on Mac

Here is the comparison I wish someone had shown me before I spent a weekend on the DIY path. Every cell is a public fact from the cited source, "β€”" means not published, and I have not invented a number to fill a gap.
Approach Speaker labels Where audio goes Setup effort Recurring cost
DIY pyannote + Whisper Yes, generic IDs Your Mac High (Python, models, ffmpeg) Free
WhisperX (wrapper) Yes, generic IDs Your Mac Medium (one repo install) Free
Zoom built-in transcript "Speaker 1, Speaker 2…" β€” no real names Zoom cloud None Requires a paid Zoom plan β€” see Zoom's docs
MetaWhisp (current build) No Your Mac in local mode Low (download, install, dictate) Free local; Pro $30/year for cloud polish
A note on Zoom's built-in transcript: it does label speakers, but as generic numbers, and the transcript lives on Zoom's servers. For many people that is fine. For people who need the audio to stay on their laptop, it is a non-starter.

Picking the right approach for your situation

If you need speaker labels and you do not mind a weekend project, the DIY path is the only fully-local option that actually exists in 2026. If you just need a meeting transcript without speaker labels, MetaWhisp's local mode is the fastest path. If you need labels but do not have time to DIY, there is no fully-local one-click app for that on Mac today β€” pay a third party with your money or your time.
If you need speaker labels and you do not mind a weekend project, the DIY path is the only fully-local option that actually exists in 2026. The wrapper that will save you the most time is WhisperX, not raw pyannote + Whisper, because the alignment is what kills you otherwise. Use BlackHole for audio, not Loopback, until you have a reason to pay. If you need a meeting transcript and you do not actually need speaker labels, the meeting transcription without a bot workflow gives you a clean local file in minutes, no Python required. That is what I use for most of my own calls. If you need a transcript with speaker labels and you do not have the time to DIY, the honest answer is: there is no fully-local one-click app for that on Mac today, in this build of the ecosystem. You either pay a third party with your money or your time. That is the state of things and I would rather tell you that plainly than pretend otherwise.

MetaWhisp for Mac β€” free, on-device, no account.

Free download for macOS 14+

Frequently asked questions

❓

Does MetaWhisp do speaker diarization?

No. The current build transcribes audio to text with high accuracy but does not identify who is speaking. Speaker diarization is on the roadmap but I will not put a date on it until it works reliably in my own weekly use. When it does ship, it will run locally on the Neural Engine with no audio leaving the Mac.

❓

Can MetaWhisp transcribe a live Zoom meeting?

Indirectly. MetaWhisp does not join your Zoom call as a bot. You can capture the meeting audio locally (BlackHole, Loopback, or a screen recorder) and feed the file into MetaWhisp, or you can dictate your own notes during the meeting with the global hotkey. The first path gives you a transcript of the call. The second gives you your own notes, instantly pasted into any app.

❓

Is pyannote.audio free for personal use?

The library is open-source and the diarization model weights are available on Hugging Face. You need a free Hugging Face account and to accept the model's terms on the model card before downloading. The pretrained pipeline is free for non-commercial use under their license; commercial use has a separate license from pyannoteAI β€” check their current terms on the repo before deploying it in a paid product.

❓

What is the easiest way to get speaker-labeled Zoom transcripts on Mac?

There is no fully-local one-click app for this in 2026. The fastest path that keeps audio on your Mac is: install BlackHole, capture the Zoom audio to a WAV file, install WhisperX (which wraps Whisper and pyannote.audio), and run it. Plan on a half-day setup if you already have Python, longer if you do not. There is no honest shortcut past that today.

❓

Will MetaWhisp add speaker diarization?

It is on the roadmap. I do not publish ship dates for features that have not cleared my own weekly-use bar. When it does ship, it will run locally on the Neural Engine with no audio leaving the Mac, because that is the bar the rest of the product sets. Until then, the DIY path above is the only fully-local option I can recommend in good conscience.

❓

Can Whisper and pyannote run on Apple Silicon (M1/M2/M3/M4)?

Yes, both run on Apple Silicon. Whisper runs on CPU, MPS (Metal Performance Shaders), or via WhisperKit on the Neural Engine. pyannote.audio runs on CPU or MPS via PyTorch. Neither needs an NVIDIA GPU. On an M-series Mac, a one-hour meeting typically finishes in roughly the length of the meeting for diarization, and faster for transcription alone.

❓

Is there a free Mac app for meeting transcription with speaker labels?

Not one that keeps audio fully local today. Zoom's built-in transcript is included with a paid Zoom plan but is processed on Zoom's cloud. The open-source stack (WhisperX + pyannote + BlackHole) is free but takes real setup. The honest answer for 2026 is: free + local + speaker labels is still a DIY weekend project. If that is not your weekend, you are paying someone β€” with money or with audio.


About the author. Andrew Dyuzhov is the solo founder of MetaWhisp. He is a marketer and builder with ADHD who assembled MetaWhisp on top of open-source Whisper with AI coding tools. He is not an ML researcher, lawyer, or doctor, and the only first-party accuracy number he will publish is the 2.76% WER LibriSpeech test-clean run referenced above. He dictates this blog in Russian and English, mostly while waiting on a long pyannote job to finish.

Related reading.