Listen Up: Democratizing Voice Technology for More Personal Human-Machine Interaction

German composer and conductor Richard Strauss once said, “The human voice is the most beautiful instrument of all, but it is the most difficult to play.” Although Strauss was likely commenting on one of his opera singers, today’s technology firms are also encountering the highs and lows of employing human voice to create more personalized interactions.

Voice recognition technologies can help create engaging user experiences by providing a natural, seamless way to interact with a broad range of devices. Voice is easier to use than typing or touch screens, lowering the barrier between us and our devices.

Most people are familiar with voice-enabled technologies such as Siri from Apple, Amazon’s Alexa, Microsoft Cortana, and Google Voice. They use these technologies to find information, order food, schedule appointments, play music, or watch a television program. In just a decade, voice technologies have evolved from an entertaining novelty to a commonly used tool for many consumer applications and certain enterprise solutions.

Inclusivity Limits

Despite this evolution, progress in deploying voice is not as great as many of us had hoped. The majority of voice technology initiatives are under the control of a few leading companies, who started early and have dedicated extensive financial resources to development. Most voice data is expensive and proprietary. And developers need an enormous amount of data to build voice recognition applications.

To maximize their potential user base and monetization opportunities, those companies focused on developing technologies in dominant languages such as English and Spanish. They built data sets to support voice recognition by machines. However, most of the voice samples were created by trained speakers – males communicating in their native language.

As a result, voice technology has not been particularly inclusive – in terms of the variety of speakers and the languages spoken. For example, an Austrian friend who speaks both English and German has trouble with her smart speaker. Because of her heavy Austrian accent, the virtual assistant doesn’t understand her when she speaks either language – no matter how clearly she enunciates each word. She relies on her children, native English speakers, to give commands.

Most of the leading companies will not build technologies for smaller, underrepresented languages. This is unfortunate, because language is important to people‘s cultural and political identities. While English has become the lingua franca of the Internet, it’s not the same as having technologies in your own language.

By focusing on a handful of dominant languages for voice technologies, we risk losing much of the cultural richness of our interactions with the world. On a more practical level, extending the reach of voice recognition to less popular languages could open the doors to new innovation. Think about the regions where literacy rates are still relatively low and interacting in writing is an inhibitor to technology use. Our hypothesis is that voice technology could unlock huge digital potential for an audience that hasn’t been broadly included in digital knowledge until now. The potential upside for the digital economy could be profound.

Open Source of Innovation

Our solution to this challenge is to call on the open source community to help democratize voice technologies by improving voice recognition and natural-language processing algorithms. But there are clear barriers to open source innovation.

Developers need a voice technology stack, including a training database that teaches machines how to understand language. Included in the database must be training data – the more, the better. The established companies have this data, and developers can license it. But it is typically available only in a limited range of languages. And if an application becomes successful, the licensing costs for the data become prohibitively expensive.

To address these challenges, Mozilla created a technology stack to help make voice recognition open and accessible to everyone. Our database, Common Voice, is being created using an online platform that allows volunteers to read a sentence in their language.

The platform collects the voice samples into a single data set. Other volunteers check the work of contributors to verify and improve the quality of the collection. The data set currently includes hundreds of thousands of sentences – all validated samples – from more than 51,000 voices. And this data set is available to open source programmers. Any programming community that wants to begin building a language corpus in their native tongue can use this data set or even add to it.

As part of our effort to bridge the digital speech divide, we’ve also created an open-source automatic speech recognition engine, Deep Speech. The technology was developed to make speech recognition technology and trained models available to open source developers. By making voice data freely and publicly available, and ensuring that data represents the diversity of real people, we hope to make voice recognition technology better for everyone.

The results of our projects are encouraging and some- times surprising. In the early stages of the project, for example, we began working in English. As we opened the project to other languages, we expected to see the fastest pace of growth in commonly spoken languages such as German, Spanish, and French.

But remember, language can be political. About two years ago, our fastest-growing language was Catalan – which is spoken in northeastern Spain, in Catalonia, where political conflict was high. One way the Catalan culture has always surfaced is by using Catalan instead of Spanish. Our project showed a big community movement rallying around Common Voice and contributing language samples in Catalan. That’s something we hadn’t expected.

We shouldn’t have been surprised. Software always has a strong cultural element, whether or not it’s recognized. Software is supposed to be neutral, but we should recognize that it also makes implicit cultural value statements and judgments.

A Level Playing Field

Whether we build business-to-business or business- to-consumer software, voice recognition technologies will be an essential part of technology for the foreseeable future. That becomes problematic when only a few companies hold the resources needed to deploy voice recognition into applications. What’s more, the fact that their large consumer base consistently interacts with their devices means that these companies are light years ahead of the competition in terms of collecting more-diverse language samples.

Companies with no in-house voice technology become increasingly dependent on the four or five leaders. These firms can set the price of their technology and they can determine whether or not users should have any expectation of privacy for their data. That’s an issue – for Mozilla and for me personally.

We know that voice recognition technologies can listen to us at any time, even though they are only supposed to respond to “wake words.” Anonymized cloud recording is not always anonymous. We know that samples are collected and stored even when users have not spoken the wake word. In some cases, police have been given access to recordings for legal proceedings. All companies – and all individuals – should be concerned about these privacy issues.

Having an open source alternative that does not contribute to the dominance of these companies would be worthwhile. Technology becomes more innovative when it isn’t controlled by a small group of companies on a predetermined path. From our experience building the Firefox open source Web browser over the last 20 years, we know that some of the best product innovation comes from users and open source developers. Successful open source technology is, by its nature, available to a broad range of developers and entrepreneurs. It’s the fast track to a more vibrant ecosystem and a brighter digital future.


About Horizons by SAP

Horizons by SAP is a future-focused journal where forward thinkers in the global tech ecosystem share perspectives on how technologies and business trends will impact SAP customers in the future. The 2020 issue of Horizons by SAP focuses on Context-Aware IT, with contributors from SAP, Microsoft, Verizon, Mozilla, and more. To learn and read more, visit www.sap.com/horizons.

Read more Horizons by SAP stories on the SAP News Center.


Katharina Borchert is chief open innovation officer at Mozilla Corporation.