Recently a post about generating audiobooks started trending on hn, and some people in the comments wished they could clone their voice and narrate text without sending it off their machine. It’s never been easier!

For this example, we only need a mac, uv (modern python package manager), ffmpeg for audio processing and optionally chatgpt for transcribing your voice (but you can do it manually or use mlx-whisper, for example). We will be using F5-TTS-MLX, an open-source speech synthesis implementation of F5 TTS model in Apple Silicon array framework

The final result: Compare to the original source video with him. The source quality is very important!

Step by step guide

  1. Install ffmpeg and uv via brew install ffmpeg uv
  2. Create some directory and initialize a uv package mdiir voiceclone && cd voiceclone && uv init. In theory you could use uvx and skip this step like I do below for yt-dlp, but I wasn’t able to get it working her, so we will be using pip-like package.
  3. I won’t be using my own voice, but I liked pg’s essays, so I will use this video with him. Let’s download it using yt-dlp like `uvx yt-dlp -x —audio-format wav -o out.wav https://www.youtube.com/shorts/S0gEZ72uBWU
  4. We have to convert it to a suitable format with ffmpeg -i out.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 sample.wav
  5. We also should get a reference text for this audio. You only need it once, but it’s very important, so you can either do it manually or throw sample.wav into ChatGPT/Whisper/Gemini and get a transcription.
  6. Now, run the following command, replacing the --ref-audio with a text from step 5 and --text with your desired text: uv run -m f5_tts_mlx.generate --ref-audio sample.wav --ref-text "If you want to start up one day, if you want to start a startup one day, what do you do now in college? There are only two things you need initially." --output gen.wav --text "As for where political correctness began, if you think about it, you probably already know the answer. Did it begin outside universities and spread to them from this external source? Obviously not; it has always been most extreme in universities.”
  7. Your output file is gen.wav, and there are lots of customization options. F5 can generate audio segments up to 30 seconds long (this includes the 10-second reference audio), so working with shorter sentences is generally easier

Going beyond

The original implementation has 🇬🇧🇺🇸🇫🇮🇫🇷🇮🇳🇮🇹🇯🇵🇨🇳🇷🇺🇪🇸 language support. The MLX implementation also includes a duration predictor specifically for English, which simplifies the creation of natural-sounding audio, but it’s not available in other languages. Personally I’m interested in German, and someone fine-tuned F5 for it, so additionally training a duration predictor can help to translate audio to it, similar to what Lex Fridman did for the interview with Javier Milei, but open source.