The problem with Linux text-to-speech (TTS)

Image by:

Original image from Wikimedia Commons. Modified. CC BY-SA 4.0.

Ken Starks is the executive director of Reglue (Recycled Electronics and GNU/Linux Used for Education), which accepts broken or decommissioned computers to refurbish and place into the homes of financially disadvantaged kids in and around the Austin, Texas area.

In 2015, the Free Software Foundation presented Reglue the Award for Projects of Social Benefit. After losing his larynx to cancer, Ken used text-to-speech (TTS) software to present at LibrePlanet 2015. In this interview, Ken tells us about his Texas Linux Fest talk, Text-to-speech and Linux.

What are some practical implementations of text-to-speech software?

Text-to-speech software is most often used for two purposes. TTS can be, and is often used for screen reading for the vision-challenged. While it's often confused with speech recognition software, the lines are fairly blurred between the two. The most highly defined use for TTS is allowing someone who is unable to speak to communicate. I lost my voice to cancer in January of this year, and I rely upon TTS to communicate or lecture to a group, or to communicate in real time.

Are you satisfied with the current situation of text-to-speech software available for Linux? Is there room for improvement?

Succinctly answered, no and yes. But I've never been known to be succinct, or for having any intention of becoming succinct (nod to Firefly's Jayne Cobb). Text-to-speech software in the Linuxsphere is in shambles. What I've discovered is that while there are many choices in Linux for this software, said software is not even close to being ready for the EDCU, the everyday computer user. I am carpal tunnel challenged, so my acronym list is growing daily. Here's what I've discovered:

On the 16th of January, 2015, I woke up knowing that I would soon be asleep again, and that when I woke up this time, my larynx would be gone, along with my ability to speak. And for the record, I'll note that some people do not necessarily count that as a bad thing. My biggest mistake was in assuming the software I would need for TTS was ready and waiting for my use. What I failed to note as I skimmed through the TTS options available to me was that none of that software was going to work out of the box.

So I jumped into the world of TTS in the Linuxsphere and found out that the water was over my head almost immediately. Now I could go on and give you example after example to make my point. I think that one illustration in particular will do that for me nicely. When I first searched for my options for getting a TTS app ready for use, this is the first thing I found (source):

Installing the enhanced CMU Arctic voices

These voices were developed by the Language Technologies Institute at Carnegie Mellon University. They sound much better than both the diphone and the MBROLA voices ... The drawback is that each voice takes over a hundred megs on disk, and with six English voices to choose from, that can take up a lot of bandwidth to download and depending on how much disk space you have to work with, 600 plus megs of space might be a bit much for voice data. However, the HTS voices discussed in the next section may in fact provide equal or better quality synthesis, and are only less than 2% of the size.

Downloading the voices

We will download everything we need for the English voices into a temporary directory (total download size is approximately 600 megs—you might want to go brew some coffee or something, lots of it...we might be here a while):

Code:

mkdir cmu_tmp cd cmu_tmp/ ://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_awb_arctic-0.90-release.tar.bz2 wget -c htt wget -c htt pp://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_bdl_arctic-0.95-release.tar.bz2 e.tar.bz2 wget -c https://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_jmk_arctic-0.95-relea wget -c https://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_clb_arctic-0.95-relea sse.tar.bz2 wget -c https://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_rms_arctic-0.95-release.tar.bz2 wget -c https://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_slt_arctic-0.95-release.tar.bz2

Note: You can add the option "--limit-rate" to wget to set a maximum transfer speed (e.g., "wget -c --limit-rate=60K ..." to limit the download rate to 60KB/s).

Really? This is the "solution" for TTS on Ubuntu or derivatives? You gotta be kidding me. I mean, if you search further, you might find something marginally easier; but not by much. First off, let me mention that if you want to include the voices from MBROLA, they are not open source. They are not any easier to install either, just so it's said. Now, to be fair, there are some attempts for TTS as browser extensions that can be found in Chrome. Speakit seemed like it was going to be my solution, but it is limited in function and realm of use (for me, anyway). My needs boil down to one thing. I need to be able to communicate in real time or even close to real time. I've increased my "swype" WPM to almost 70, so it comes down to having a decent frontend for the TTS software.

This is the problem we are facing, boiled down to the simplest of terms: The software that makes the voices produce actual words and the voices themselves are most often two different entities, sometimes three. My first experience with Orca wasn't pleasant. I thought I was hearing a scratched record or speaker crackle somewhere, and it was bugging the hell out of me. It turns out it was Orca programmed to start reading the first screen upon boot. It was absolutely horrible voice representation, and I don't see how that can even be released as user-ready software—not with that "voice." So the user who is trying to use Festival for example, will want to add different voices as the default ones are too robotic. Here's where the problems begin.

The developers from each application start out working just fine together, but eventually one of the devs will do something on his or her end that changes, let's say, critical file paths. That's fine, unless you are the dev for the software that needs to access those files. No one told her or him about this update. Suddenly the TTS app is broken because it's looking for voice files in one directory when the dev has changed those file paths and didn't bother to tell that dev. Sadly, sometimes these apps are just plain abandoned. Who suffers? The end user. That's why I'm trying to bring attention to this area specifically.

I fully agree that there are TTS solutions to be found within the Android/Chromebook and the iExperience. I actually use an app named Speech Assistant on my Nexus 7. I use it on my tablet when I can, and it is everything that I believe a TTS application should be—not only in the mobile market, but in Linux as well. And that's the focus of my work here. The mobile market has decent TTS apps. Linux doesn't. Telling someone to just forget about Linux and use the mobile app is a dodge, plain and simple. It's a lazy way to walk away from a challenge.

That recommendation is often made by people that have no clue as to what they are talking about nor do they understand or care about the user's particular needs. My computer and laptop are my eyes and voice facing the entire world. We don't need a cramped or clumsy mobile app when we are working. We need this solution on the desktop and we need it as soon as possible. In my previous articles and blogs concerning this topic, people from all over the world have asked me to keep them informed if I should find a decent solution to this problem. There is a global interest in this software.

It gave me a lot of satisfaction to read a comment made by Mr. Marcel Gagne. Marcel is a journalist and an author I highly admire, and I read his work weekly. Marcel made a comment in response to my FOSS Force article pertaining to this matter. Marcel mentioned that he had written an article 10 years ago bemoaning the state of TTS in Linux. Marcel said he was way past being disappointed by the fact that this problem still existed an entire decade later. That would make two of us, it would seem.

I'm not really sure about the difference between the two aside from screen reading capabilities. What can a Linux community do to fix this problem?

I've received scathing emails for saying this previously, concerning a phrase most all of us use daily. That phrase is "The Linux Community." It's been my firsthand experience to note that there really isn't anything close to a "Linux Community." On our best day, we are a large number of warring factions, verbally slaughtering each other and leaving bloody trails as we run and gun across the Internet battlefield. But a more condensed Linux community, a community or group focused on a particular problem or a particular matter, can go a long way to fix this problem.

The first thing anyone needs to do in order to fix a problem is to bring it to the attention of people capable of fixing it. That's not always an easy thing to do. Actually, it's rarely an easy thing to do. The task you face is convincing a developer that a large number of people need and will use their software. The term "developer" gets used a lot in our day-to-day dealings within FOSS. My opinion is that this offhand identifying term glosses over what a developer is and how difficult her or his job can be. It takes years and years to become efficient and skilled at programming, and that includes all of the programming languages. Asking a software developer to take time from his professional and personal life in order to work on your project should be done as humbly as possible. Asking Michelangelo to climb down from his scaffolding to paint your garage takes finesse and your hat clutched firmly in your hands.

Are there any noticeable projects trying to fix this problem that we should be keeping an eye on in the near future? If not, what's stopping the further development of Text-to-Speech software on Linux platform?

Noticeable? Not yet, but there will be in the next few sentences. My FOSS Force article talking about the need for a easy-to-use front end for the open source Java app MaryTTS brought results. At this time, there is a focused effort by three Java developers to develop this GUI. MaryTTS is an amazing application, but as with many TTS programs it's a PITA to use. Much of the time, the user interface is the command line. For 95 percent of the EDCU, this isn't going to work and I'll reference my example cited earlier.

The three developers are working across two continents and three time zones in order to apply their skills to this project. We have an alpha, or at least a proof of concept, now. Basic text-to-speech function is working, but there are a lot of other features that will be built into the GUI. And again, so as there isn't any confusion, our project is a frontend for MaryTTS. As for a name, we'd like it to be easily recognizable as the frontend that it is. Right now, we are working with Voices4MaryTTS, but other names are also being explored. We welcome suggestions that we can use to name this application. Here is an example of what we are producing at this time:

You can download it here. Again, this is barely an Alpha production, so keep that in mind.

Is the Linux platform currently falling behind its other competitors in this area?

It has never even been a serious contender in the race. In my opinion, most TTS applications in Linux have remained in hobbyist mode since inception. And I'm sure that statement will chap the ass of many, but a simple comparison between all of the Linux programs using TTS vs. Mac, Windows, and even the mobile market will bear me out. Hopefully we can raise enough awareness to at least see some forward movement on TTS in Linux. Hopefully.

Texas Linux Fest
Speaker Interview

This article is part of the Speaker Interview Series for Texas Linux Fest. Texas Linux Fest is the first state-wide, annual, community-run conference for Linux and open source software users and enthusiasts from around the Lone Star State.

4 Comments

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.