I’ve been playing with the text to speech capabilities built into Apple Mac OS X recently. There are many voices to choose from, and a handful of high quality voices that are very close to almost being indistinguishable from human speech (Siri quality). Text to speech is available by selecting text you want spoken, right clicking and choosing “Speech” -> “Start Speaking”. You can even have it read this text. Try it!
Hello, this is a test of the Apple Text to Speech capability. Some of the voices sound better than others. Apple has given control over speaking speed as well. The weather is sunny and cool. How did this sound?
Pauses
Right away I found it wasn’t quite perfect. Sentences are run together with nary a pause. This can be mitigated to some extent by using “…” at the end of each sentence. The preceding paragraph can be re-written as follows and between sentence pauses are evident which makes it much easier to listen and follow:
Hello, this is a test of the Apple Text to Speech capability... Some of the voices sound better than others... Apple has given control over speaking speed as well... The weather is sunny and cool... How did this sound?
But there is still a problem with long pauses between paragraphs. Any amount of whitespace can be added between paragraphs but it is ignored by the parser. If you insert “…” on a blank line it will speak “horizontal eclipses”. Likewise for “,” (comma) and “-” (dash). The secret to pausing lies in the parsers ability to process embedded speech API commands inline with the text. In this case you want to use the silence command. To place a pause you add “[[slnc x]]” where x is the number of milliseconds to pause. For a 2 second pause you would add “[[slnc 2000]]”. To add a 3.5 second pause between paragraphs in the preceding text, it would now look like this:
Hello, this is a test of the Apple Text to Speech capability... Some of the voices sound better than others... Apple has given control over speaking speed as well... [[slnc 3500]] The weather is sunny and cool... How did this sound?
Now its much easier to follow.
Numbers
Sometimes you may want it to speak each digit of a number individually rather than speaking the numbers value. Let’s say you have the number “5551234” and want it read as “five five five one two three four” instead of “5 million five hundred fifty one thousand two hundred thirty four”. To tell the parser you want the number read literally insert “[[nmbr LTRL]]” before the number. To return to the normal parsing method add “[[nmbr NORM]]”.
[[nmbr LTRL]] 5551234 [[slnc 2000]] [[nmbr NORM]] 5551234
Emphasis
Maybe you want to really stress a particular word like “do NOT try this at home!”. Simply capitalizing has no impact. Have a listen:
Do NOT try this at home!
You can tell the parser to speak particular words with more emphasis by using the emphasis command “[[emph +]]”. Turn it off after the word with “[[emph -]]”.
Do [[emph +]] NOT [[emph -]] try this at home!
Others
You can also the context interpretation which helps when a word has two different pronunciations depending on context, example: coordinates as in a map location, and coordinates as in coordinating an event. You can also change the pitch, speed, and volume, among others.
Here is a link to the Apple Speech Synthesizer API for customizing the output: https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SpeechSynthesisProgrammingGuide/FineTuning/FineTuning.html
Pingback: S1E16 Atari Timewise | Inverse ATASCII
Thanks a million for this, I’ve been trying to insert longer pauses between TTS sentences for ages and finally know how.
Yeah, thanks for this little discussion. Very useful.
Adding to the thanks here – being able to add pauses and emphasis has been incredibly helpful. Thanks so much for sharing this!