Falling asleep is more fun with an AI in your ear

When I was a teenager, I had a CD player in my room, and I used to listen to fairy tales to fall asleep. The narrator’s voice would relax me and I’d fall asleep quickly. Fast forward to yesterday, I was playing with Google Text-To-Speech for an unrelated project, and had gotten one of their code samples to generate some speech for me. I had also played around with OpenAI’s GPT-3, which I had found wonderfully surrealist, and it had stuck in my mind, so I thought I should combine the two and create a podcast of nonsensical stories that you could listen to to help you fall asleep more easily.

Having already played with Google’s speech synthesis, I thought it would be pretty quick and easy to create this, as all I’d have to do is generate some text with GPT-3 and have Google speak it. GPT-3 is an AI model that can generate very convincing text from a sample. You basically give it a topic and a few sentences, and it continues in the same vein, writing very natural-sounding prose. Half an hour later, I had an AI-generated logo, AI-generated soundscapy background music, an AI-generated fairytale, and an AI-narrated audio file. A day later, I have seven:

The Deep Dreams podcast.

Here’s how I did it:

A brief history

I started playing with GPT-3 and some code yesterday afternoon. By today, I had generated a few episodes, and I posted a link to Hacker news, where people seemed to like the project and had many good suggestions, some of which I implemented. In this writeup, I will detail what I tried, in separate sections, and mention what I tried initially and what I tried later, after I got some feedback, so the writeup will not be in a strictly chronological order.

Let’s dive in!

GPT-3

Not pictured: The robot hand that is holding the pen.

First, I started with GPT-3. This part was pretty straightforward, GPT-3 includes a “playground” where you can type your prompt, select a few parameters, and have the model try to complete a few sentences at a time. Immediately, the results were pretty usable.

I started with a prompt that looked similar to this:

The storyteller started telling the children an old fairytale, to help them fall asleep. The fairytale was very calming, pleasant and soothing, and it was about a princess, her fairy godmother, and her evil stepmother. It went like this:

Once upon a time,

The model immediately continued the story with what is now the contents of episode 1 of the podcast, sentence by sentence. One small issue that I encountered was that the model would often repeat itself a lot. Luckily, there’s a parameter that you can tweak to penalize repetition, so the model is more likely to come up with more novel stuff rather than just repeat the same sentences over and over.

You can also help the model along if it sticks to the same topic too much, by slightly changing some of the text in the middle of generation. For example, if the model insists on having the main character go back to the forest and chop wood every paragraph, you can change “and he went back to the woods” to “and he went to the city”, and the model will take that and run with it. It might still have the character go to the forest near the city, but at least it’s something.

Another issue that I still haven’t solved is the model’s tendency to stop early. It seems that, sometimes, it just runs out of things to say, and then it starts adding “and they lived happily ever after” to the text, or changing the subject completely, and it’s hard to get it to write more if it doesn’t want to. That’s also the main reason why the episodes are less than ten minutes long.

A caveat

I want to point out here a huge caveat of GPT-3 that you need to be aware of, and that I wasn’t, and it caused my costs to be many times larger than they could have been: GPT-3 charges you both for the generated text and for the prompt! That means that if you use the playground in the default configuration, as I did, you might end up writing a prompt of 10 words, then getting another 10 generated, then another 10, then another 10. However, in that case, GPT-3 will charge you for 100 words total, even though you only generated 30. That’s because, when you press “continue”, it considers the previously-generated sentence a part of the prompt now, and charges you for it again, and again, and again, every time you generate another sentence, so you end up paying for each generated sentence N^2 times.

A better way to do that would be to have the API generate hundreds of words, and then use some of those as the prompt for the next generation.

Speech synthesis

Tara was somehow comforted by the familiar voice that ordered her to kill all humans.

The next step after generating the text was to get it narrated. Luckily, I had some fresh experience with using Google’s speech synthesis API, which has an array of fairly convincing voices.

I wrote some code to generate an MP3 file from the text that GPT-3 wrote, and I had my first narration. Unfortunately, Google’s API will only allow you to send up to 5000 characters to narrate at a time, so there were some issues with the longer stories, but with the power of Python and some handy libraries, I had the API narrate parts of the story and then stitched them up into a complete whole.

This is how episodes one to five are generated, but the voices that Google offered weren’t a great fit for this use case, and people remarked on how grating it was to hear this particular voice, which is something you definitely don’t want if you’re trying to fall asleep. After looking at Amazon Polly and Microsoft’s Azure text-to-speech, I decided that the latter was much more pleasant-sounding and had a few voices that were much better than Google’s.

A few minutes later, I had my generation code using the new service, and episodes six and seven sound much more pleasing. I’m not sure whether this API still has the 5000 character limitation, but the splitting works so well that I didn’t bother to find out whether it can generate the entire episode in one go.

The code that generates audio from SSML (SSML is a markup language that adds annotations to text so that the text-to-speech engine knows how to pronounce certain things) is pretty straightforward, and basically comes straight from the Azure docs:

speech_config = SpeechConfig(
    subscription=mykey,
    region="westeurope",
)

speech_config.set_speech_synthesis_output_format(
    SpeechSynthesisOutputFormat["Audio48Khz192KBitRateMonoMp3"]
)
synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)

result = synthesizer.speak_ssml_async(ssml_text).get()
stream = AudioDataStream(result)
stream.save_to_wav_file(outfile)

This will take the script and write the narrated audio to an MP3 file.

Background music

If you’ll notice, however, you’ll see that there’s more to an episode than just the narration. Episodes also include strategic pauses, background music, and fades.

For the background music, I wanted something that’s algorithmically-generated, to fit with the general theme. I found a site which generates soundscapy audio, and it has a paid export function, which I used to buy a ten-minute-long MP3 of the sounds. (UPDATE: It turns out that the audio files I bought from that site are copyrighted, so I have to take down all the episodes and recreate them again with the new background music.) I imported the audio into Audacity, added the narration and background tracks, added some silences and fades, and the first episode was ready!

This way of doing things was fine for one episode, but if I was going to make a second one I didn’t want to have to manually mix tracks again. I looked around for a Python library that could do it for me, and I found pydub.

pydub is an audio manipulation library, it allows you to easily manipulate volume, generate silence, add fades, and cut/join tracks, which was everything I needed. I wrote a few lines of code to shorten the background track to the duration of the episode, to add pauses before and after the episode ends, and to fade the background track in and out, and another part of the episode generation was done.

Here’s some of the code:

aepisode = pydub.AudioSegment.from_mp3(episode)
abackground = pydub.AudioSegment.from_mp3(background)

# Add silence to the start and end.
apadded_episode = (
    pydub.AudioSegment.silent(duration=7000)
    + aepisode
    + pydub.AudioSegment.silent(duration=8000)
)
apadded_episode.export(tempepisode, format="mp3")

# Cut the background track to the length of the narration.
acut_bg = abackground[: apadded_episode.duration_seconds * 1000].fade_out(5000)

# Lower the background track volume.
alower_volume_cut_bg = acut_bg - 20

# Export a temporary background track.
alower_volume_cut_bg.export(tempbg, format="mp3")

After this, another step uses ffmpeg to mix the two tracks.

At that point, I could run the script and have it generate the entire audio file, start to finish, from the script, without any other manual work.

Logo

The Deep Dreams podcast logo.

Since everything else fit with the theme, I wanted the logo to be AI-generated too. I found a page online that would let you generate images using an AI, and quickly created the logo there. Unfortunately, I don’t remember the name of the site, but it doesn’t matter much anyway.

I’m quite happy with the logo (seen here on the right), it’s vague enough to be perfect for the podcast, and it’s AI-generated, which is very fitting with this whole effort.

Costs

A human child donates money to a panhandling robot.

One point I was curious about in this whole process was what the costs would be. The paid services I use are GPT-3 for the text generation, and Azure Text-to-Speech. GPT-3 cost about $2 for all seven episodes, and Azure Text-to-Speech cost little enough that I don’t think I can estimate yet (Microsoft has a $200 free tier for three months). I think it’s a few cents per hour, though, so it’s very cheap for what I’m doing.

All in all, I estimate that, to produce these episodes, I spent around $2 in services and $1500 in time, which is basically par for the course for side-projects. Obviously, you can’t really price the time I spent like that, because I enjoyed doing this, but this way of thinking about it helps me remember that $2 (or whatever I spend out of pocket on my projects) is nothing compared to the enjoyment I get from building these things.

Epilogue

That’s more or less the entire process I followed in building this. The first episode took half an hour to an hour to make, the second episode took two or three hours (because that’s when I wrote the autogeneration code), and the rest of the episodes took a few minutes each.

If you want to see how the sausage is made, all my code and assorted things are here:

https://gitlab.com/stavros/deep-dreams

Also, if you have any feedback, comments, episode requests, or whatever, feel free to Tweet or toot at me, or email me directly.