So, I recently cooked up this side project called pdf-to-podcast.com. It's pretty simple: you toss in a PDF, and it spits out an audio podcast you can listen to while doing the dishes or whatever. But behind the scenes, there's some cool tech stuff happening.
The Tech Stack
- Google Gemini (the LLM): This is Google's large language model. It’s like the brains of the operation. It takes the dry, boring text from your PDF and turns it into a conversation between multiple people. Like magic, but with math.
- OpenAI TTS-1 (the Voice): This is OpenAI's text-to-speech model. It gives the podcast a voice (or multiple voices, actually). You can pick from different options, like a deep, authoritative voice or a more upbeat one.
- **Promptic (my Little Helper):** I made this little Python library to make talking to LLMs like Gemini easier. It's like a translator that helps my code and the LLM understand each other.
- Tenacity (the Try-Hard): Ever tried the old trick of turning your computer off and on again when it glitches? That's what Tenacity is like. It keeps trying to get things to work if there's a hiccup. It's particularly useful with LLMs because they can sometimes be a bit finicky.
- Concurrent.Futures (the Multitasker): This is a Python tool that lets me do multiple things at once. It's like having a bunch of little chefs working together to make a meal faster. In this case, it speeds up the audio generation process.
- **Gradio (the Face of the Operation):** This is the user interface you see when you use pdf-to-podcast.com. It's pretty straightforward – you upload a PDF, click a button, and boom, you get a podcast.
Why Gemini and OpenAI?
Now, you might be wondering why I went with Google Gemini for the LLM and OpenAI for TTS. Here's the deal:
- Cost Efficiency: These two play really well together. Gemini is surprisingly affordable, and OpenAI's TTS pricing is pretty reasonable. This means I can offer pdf-to-podcast.com without breaking the bank (or yours).
- Quality: Both models are top-notch. Gemini creates some really engaging conversations from dry text, and OpenAI's TTS sounds natural and engaging.
The Secret Sauce (aka How It Works)
- Upload Your PDF: It's as easy as drag-and-drop.
- Gemini Does Its Thing: The LLM reads your PDF and crafts a podcast-style conversation.
- Multi-Voice Magic: Each part of the conversation gets assigned a different speaker, and OpenAI TTS gives them their own voice.
- Audio Mixing: All the voices are blended into one audio file.
- Podcast Time: You get your brand-new podcast, ready to listen to!
Prompt Generation Pro-tip
A pro tip for generating prompts is to use Anthropic’s prompt generation tool. You may need to create an account with them to use it. You’ll find it after logging in to the dashboard.