The problem with Linux text-to-speech (TTS)

Register or Login to like
Register or Login to like
Penguins with space and stars overlay

Original image from Wikimedia Commons. Modified. CC BY-SA 4.0.

Ken Starks is the executive director of Reglue (Recycled Electronics and GNU/Linux Used for Education), which accepts broken or decommissioned computers to refurbish and place into the homes of financially disadvantaged kids in and around the Austin, Texas area.

In 2015, the Free Software Foundation presented Reglue the Award for Projects of Social Benefit. After losing his larynx to cancer, Ken used text-to-speech (TTS) software to present at LibrePlanet 2015. In this interview, Ken tells us about his Texas Linux Fest talk, Text-to-speech and Linux.

What are some practical implementations of text-to-speech software?

Text-to-speech software is most often used for two purposes. TTS can be, and is often used for screen reading for the vision-challenged. While it's often confused with speech recognition software, the lines are fairly blurred between the two. The most highly defined use for TTS is allowing someone who is unable to speak to communicate. I lost my voice to cancer in January of this year, and I rely upon TTS to communicate or lecture to a group, or to communicate in real time.

Are you satisfied with the current situation of text-to-speech software available for Linux? Is there room for improvement?

Succinctly answered, no and yes. But I've never been known to be succinct, or for having any intention of becoming succinct (nod to Firefly's Jayne Cobb). Text-to-speech software in the Linuxsphere is in shambles. What I've discovered is that while there are many choices in Linux for this software, said software is not even close to being ready for the EDCU, the everyday computer user. I am carpal tunnel challenged, so my acronym list is growing daily. Here's what I've discovered:

On the 16th of January, 2015, I woke up knowing that I would soon be asleep again, and that when I woke up this time, my larynx would be gone, along with my ability to speak. And for the record, I'll note that some people do not necessarily count that as a bad thing. My biggest mistake was in assuming the software I would need for TTS was ready and waiting for my use. What I failed to note as I skimmed through the TTS options available to me was that none of that software was going to work out of the box.

So I jumped into the world of TTS in the Linuxsphere and found out that the water was over my head almost immediately. Now I could go on and give you example after example to make my point. I think that one illustration in particular will do that for me nicely. When I first searched for my options for getting a TTS app ready for use, this is the first thing I found (source):

Installing the enhanced CMU Arctic voices

These voices were developed by the Language Technologies Institute at Carnegie Mellon University. They sound much better than both the diphone and the MBROLA voices ... The drawback is that each voice takes over a hundred megs on disk, and with six English voices to choose from, that can take up a lot of bandwidth to download and depending on how much disk space you have to work with, 600 plus megs of space might be a bit much for voice data. However, the HTS voices discussed in the next section may in fact provide equal or better quality synthesis, and are only less than 2% of the size.

Downloading the voices

We will download everything we need for the English voices into a temporary directory (total download size is approximately 600 megs—you might want to go brew some coffee or something, lots of it...we might be here a while):

Code:

mkdir cmu_tmp
cd cmu_tmp/
://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_awb_arctic-0.90-release.tar.bz2
wget -c htt
wget -c htt
pp://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_bdl_arctic-0.95-release.tar.bz2
e.tar.bz2
wget -c http://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_jmk_arctic-0.95-relea
wget -c http://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_clb_arctic-0.95-relea sse.tar.bz2
wget -c http://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_rms_arctic-0.95-release.tar.bz2
wget -c http://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_slt_arctic-0.95-release.tar.bz2

Note: You can add the option "--limit-rate" to wget to set a maximum transfer speed (e.g., "wget -c --limit-rate=60K ..." to limit the download rate to 60KB/s).

Really? This is the "solution" for TTS on Ubuntu or derivatives? You gotta be kidding me. I mean, if you search further, you might find something marginally easier; but not by much. First off, let me mention that if you want to include the voices from MBROLA, they are not open source. They are not any easier to install either, just so it's said. Now, to be fair, there are some attempts for TTS as browser extensions that can be found in Chrome. Speakit seemed like it was going to be my solution, but it is limited in function and realm of use (for me, anyway). My needs boil down to one thing. I need to be able to communicate in real time or even close to real time. I've increased my "swype" WPM to almost 70, so it comes down to having a decent frontend for the TTS software.

This is the problem we are facing, boiled down to the simplest of terms: The software that makes the voices produce actual words and the voices themselves are most often two different entities, sometimes three. My first experience with Orca wasn't pleasant. I thought I was hearing a scratched record or speaker crackle somewhere, and it was bugging the hell out of me. It turns out it was Orca programmed to start reading the first screen upon boot. It was absolutely horrible voice representation, and I don't see how that can even be released as user-ready software—not with that "voice." So the user who is trying to use Festival for example, will want to add different voices as the default ones are too robotic. Here's where the problems begin.

The developers from each application start out working just fine together, but eventually one of the devs will do something on his or her end that changes, let's say, critical file paths. That's fine, unless you are the dev for the software that needs to access those files. No one told her or him about this update. Suddenly the TTS app is broken because it's looking for voice files in one directory when the dev has changed those file paths and didn't bother to tell that dev. Sadly, sometimes these apps are just plain abandoned. Who suffers? The end user. That's why I'm trying to bring attention to this area specifically.

I fully agree that there are TTS solutions to be found within the Android/Chromebook and the iExperience. I actually use an app named Speech Assistant on my Nexus 7. I use it on my tablet when I can, and it is everything that I believe a TTS application should be—not only in the mobile market, but in Linux as well. And that's the focus of my work here. The mobile market has decent TTS apps. Linux doesn't. Telling someone to just forget about Linux and use the mobile app is a dodge, plain and simple. It's a lazy way to walk away from a challenge.

That recommendation is often made by people that have no clue as to what they are talking about nor do they understand or care about the user's particular needs. My computer and laptop are my eyes and voice facing the entire world. We don't need a cramped or clumsy mobile app when we are working. We need this solution on the desktop and we need it as soon as possible. In my previous articles and blogs concerning this topic, people from all over the world have asked me to keep them informed if I should find a decent solution to this problem. There is a global interest in this software.

It gave me a lot of satisfaction to read a comment made by Mr. Marcel Gagne. Marcel is a journalist and an author I highly admire, and I read his work weekly. Marcel made a comment in response to my FOSS Force article pertaining to this matter. Marcel mentioned that he had written an article 10 years ago bemoaning the state of TTS in Linux. Marcel said he was way past being disappointed by the fact that this problem still existed an entire decade later. That would make two of us, it would seem.

I'm not really sure about the difference between the two aside from screen reading capabilities. What can a Linux community do to fix this problem?

I've received scathing emails for saying this previously, concerning a phrase most all of us use daily. That phrase is "The Linux Community." It's been my firsthand experience to note that there really isn't anything close to a "Linux Community." On our best day, we are a large number of warring factions, verbally slaughtering each other and leaving bloody trails as we run and gun across the Internet battlefield. But a more condensed Linux community, a community or group focused on a particular problem or a particular matter, can go a long way to fix this problem.

The first thing anyone needs to do in order to fix a problem is to bring it to the attention of people capable of fixing it. That's not always an easy thing to do. Actually, it's rarely an easy thing to do. The task you face is convincing a developer that a large number of people need and will use their software. The term "developer" gets used a lot in our day-to-day dealings within FOSS. My opinion is that this offhand identifying term glosses over what a developer is and how difficult her or his job can be. It takes years and years to become efficient and skilled at programming, and that includes all of the programming languages. Asking a software developer to take time from his professional and personal life in order to work on your project should be done as humbly as possible. Asking Michelangelo to climb down from his scaffolding to paint your garage takes finesse and your hat clutched firmly in your hands.

Are there any noticeable projects trying to fix this problem that we should be keeping an eye on in the near future? If not, what's stopping the further development of Text-to-Speech software on Linux platform?

Noticeable? Not yet, but there will be in the next few sentences. My FOSS Force article talking about the need for a easy-to-use front end for the open source Java app MaryTTS brought results. At this time, there is a focused effort by three Java developers to develop this GUI. MaryTTS is an amazing application, but as with many TTS programs it's a PITA to use. Much of the time, the user interface is the command line. For 95 percent of the EDCU, this isn't going to work and I'll reference my example cited earlier.

The three developers are working across two continents and three time zones in order to apply their skills to this project. We have an alpha, or at least a proof of concept, now. Basic text-to-speech function is working, but there are a lot of other features that will be built into the GUI. And again, so as there isn't any confusion, our project is a frontend for MaryTTS. As for a name, we'd like it to be easily recognizable as the frontend that it is. Right now, we are working with Voices4MaryTTS, but other names are also being explored. We welcome suggestions that we can use to name this application. Here is an example of what we are producing at this time:

Audio file

You can download it here. Again, this is barely an Alpha production, so keep that in mind.

Is the Linux platform currently falling behind its other competitors in this area?

It has never even been a serious contender in the race. In my opinion, most TTS applications in Linux have remained in hobbyist mode since inception. And I'm sure that statement will chap the ass of many, but a simple comparison between all of the Linux programs using TTS vs. Mac, Windows, and even the mobile market will bear me out. Hopefully we can raise enough awareness to at least see some forward movement on TTS in Linux. Hopefully.

Texas Linux Fest
Speaker Interview

This article is part of the Speaker Interview Series for Texas Linux Fest. Texas Linux Fest is the first state-wide, annual, community-run conference for Linux and open source software users and enthusiasts from around the Lone Star State.

Aleksandar Todorović
I'm a part of the tech department for an awesome investigative journalism network called OCCRP. I'm really passionate about open source software, artificial intelligence and information security. My open source contributions are now merged with projects like reddit, elementary OS and the Tor Project. I'm running a personal blog where I share my personal stories.

4 Comments

The needs of the two groups that use TTS software are different enough that the state of the software in Linux for the vision-challenged is not nearly as bad as it is for the speech-challenged.

For the vision-challenged a command line interface is a plus; for the speech challenged, it's a minus. The vision-challenged are quite willing to sacrifice naturalness of voice for speed of reading; the speech-challenged are just the opposite, wanting naturalness and not speed.

It seems that a lot of the current state of TTS software on Linux is because the developers have been targeting the vision-challenged and not paying any attention to the speech challenged at all. Let's hope that Ken's efforts lead to this shortcoming being properly addressed.

As far as i know there is also ibmtts (the default TTS of JAWS under Windows, its also known as Voxin or eveloquence) and Pico (the TTS of google, this is also used with TalkBack)
- IBMTTS costs about 4 - 5 bucks on http://voxin.oralux.net/get.php but the website is currently down for maintaince.
- Pico is open source
maybe this helps

This is not entirely accurate, this article. I'm not writing to criticize, but to point out all of your options. I'm a blind person, and as such I don't need or necessarily want high quality, which often translates to enormously sized, voices taking up space on my hard drive. To my ears the naturalness sounds more like a sophisticated but not very correct algorithm for running the speech together in an attempt to mimick how a human speaks. The stress is often off, sometimes by only a little, sometimes hilariously so, and it messes up the flow of the text. I'm much more used to voices that use a formant, rather than a concatinative, as the natural voices do, method of speech. It's not as natural but it's faster and certainly smaller. I've been working for the better part of 3 years to improve espeak, available at http://espeak.sf.net, to bring it's US english up to snuff. If you're judging the state of linux tts by how easy festival is to get going then I'm not at all surprised you found it hard. I've never succeeded in getting that piece of software working, and I've tried more than a few times. I'll be more than willing to help improve the state of tts in linux if you'll just direct me on what needs improving. I do, however, take issue with your statement that it's not easy enough. I've never quite understood the windows users complaints that there are no graphical installers, click through wizards, that kind of thing for linux. We do things differently here. What I think you might want is a good natural voice that sounds more human than computer. You should be able to have that. There are commercial quality voices for linux that might fit your needs. I know of two such companies that make them. There's cepstral, from I believe nuance, and ivona, I'm not sure exactly which company manufactures that. I've used windows as new as windows version ten and I have yet to find anything significantly easier or better about that platform that linux cannot do. I find just the opposite as a matter of fact. I don't know if there is any good open source voice dictation software for linux. I've heard that there is, and then I've heard that there is not. If there is not any, or if what's out there isn't good, then that needs to be fixed because you have just as much a right as anyone else to use your computer without paying insanely high prices for software or hardware. I'll be more than willing to help, I can even point you to communities, mostly blind and visually impaired specific it's true, but nevertheless communities where you can get your issues fixed. I've often struggled to get my voice heard when it comes to accessibility. I never imagined that it would be just as hard for you.

man I wish I'd taken the time to read my comment more thoroughly before I hit submit. You don't need voice dictation software, you need good quality tts voices. Well at least I sort of covered that. Your basic options are, espeak, ibmtts or voxin, the names are numerous for the good old eloquence synthesizer jaws and window eyes uses. That's about it for the fast responsive voices. After that comes the natural voices. There's festival, which you've already had much more trouble with than you should have to get going, pico, which google uses in android, cepstral and ivona natural voices, and mary tts, which orca can't support yet. There is in fact a standardized speech API similar to sapi4 and 5 for windows, it's called speech-dispatcher. The main problem here is because we're so spread out, not all of the disabled communities have much to do with each other. I wasn't even aware there were people who depended on tts who weren't blind before I read your article. We need to form one big, committed disabled community and focus on improving the software we depend on, instead of being in our own little circles and complaining that windows and mac get all the press and credit. I'm as guilty as anyone else at this sometimes. I wonder if the linux foundation would consider adding some sort of tts effort to their core infrastructure initiative? It's not exactly core in the sense that enterprises use it, but it's certainly the cornerstone of software for most disabled people along with orca, the main screen reader. Added to that, we as a community need to get more people on the ground attending linux conferences and getting the word out that there are people who use linux every day that depend on software that maybe not everyone else has heard of. I've been trying to find one to attend for a while, but I live in east texas, and am usually severely low on funds, so can't usually afford the milage to attend conferences when most of the big ones are out of state or out of the country. We *can* improve this, but we have to put the work into it. We can't simply complain that windows has it better and expect that to get anywhere. And I mean me as well. I do this sometimes when I get discouraged. Linux has some amazing qualities, one of which is the ability to be run live and that it talks at launch, allowing you to install eyes free, a huge huge difference from windows, and I'm determined to make it better.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.