Say What?! Voice Tech Speaks Business Fluently Now. Is Your Enterprise Ready?

Alexa’s recent and, thankfully, brief habit of breaking into unprompted maniacal laughter freaked out some users and made for a good story. Turns out the devices thought they heard the prompt “Alexa, laugh.”(The company disabled that prompt.)

A funny story, but one that neatly encompasses the state of voice tech. It’s cool, and it kind of works, but sometimes it doesn’t.

With sales of voice devices predicted to hit over 50 million this year, voice is being touted as the biggest consumer tech disruption since the smartphone. And where consumer tech goes, enterprise follows, as we saw with BYOD, social media, and tablets. The global voice tech industry is expected to reach US$126.5 billion by 2023.

Businesses are taking note. Last January, JPMorgan Chase hired VaynerMedia as its agency of record for voice technology to help the finance giant set up its customer voice strategy. In late 2017, Amazon introduced its Alexa for Business service, which uses Alexa devices and workplace software. Your future employees will have used voice tech since kindergarten. The voice revolution is coming.

Enterprise might be lagging behind consumer voice tech adoption, but 2018 could very well be the year when it begins to make itself felt in workplaces across the world. Why should executives care? One word: productivity. The ability of computers to convert voice to text using techniques like machine learning has quietly gained near-perfect accuracy. A study by researchers from Stanford University, the University of Washington, and Baidu USA found that voice input was nearly three times faster than typing and that the difference in error rate between the two types of input was nearly indistinguishable.

Further, voice is emerging as a powerful enabler for two other technologies that are hovering around the edges of the enterprise: augmented reality (AR) and virtual reality (VR). AR-equipped glasses have already made inroads into places like the warehouse, where a peek into the upper corner of a lens lets pickers find packages while leaving their hands free to work faster. Companies are already adding voice to the picture, ratcheting up productivity even further. Mixed reality apps—a combo meal of voice, AR, and VR, if you will—will hit $9 billion by 2022, according to Juniper Research, and most of that will be driven by voice.

So voice isn’t just for laughs anymore. But how far will it go in business and how will those changes manifest themselves? Will we say goodbye to keyboards, ciao to paper, never again to resetting a password?

There are still some pretty serious barriers to adoption that need to be tackled before voice becomes truly integrated into a business environment. And while spontaneous evil laughter can be funny, the room for error in consumer products simply does not exist in the enterprise sphere. That’s why it’s important to look at voice’s promise and challenges now, before unprompted hilarity causes a business disaster.

The Impending Enterprise Voice Tech Tsunami

Recent advances in voice tech mean that it can enable a more natural way to interact with computers and machines. Most people are already used to chatbots. Improvements in machine learning, artificial intelligence, and natural language all lead to voice. And its potential for improving accessibility for those with disabilities is enormous (see “Voicing a New Level of Accessibility”).

Enterprise technology trails consumer tech by about 18 months, says Mark Plakias, ex-vice president of knowledge transfer at telecom company Orange Silicon Valley, meaning voice is on track to hit offices this year. “The technology is going to continue to improve, the algorithms will improve, and there will be more functionality shipping with these devices because there will be more third-party apps,” he says.

Voice won’t replace all the other technologies you’re already using; rather it will likely be an add-on. The future of user experience will be multimodal, involving a combination of screens, AR, VR, voice, chat, stylus pens, and gestures.

An example is giving directions—a combination of text, voice, and visual works best, with visual leading the cast as the primary interface. How best to combine the various elements should be a case-by-case decision, says Alexander Rudnicky, a professor at Carnegie Mellon University and part of its Speech Group and Language Technologies Institute. The key is to ensure, as with giving directions, that you’re making the best choice of primary and secondary interfaces for each scenario. “Certain people like me need to always step back and think about what it is that the human actually needs in this situation rather than what sounds cool.”

Not every situation calls for the full menu of interface options, however. In cases where users are choosing between reading and voice, chatter won’t always win out. In some circumstances, reading will still be more efficient or convenient; in others, voice will be. For example, visually scanning an e-mail inbox is still the best way to figure out what’s important—does anyone want to listen to every e-mail? But voice might be the tool of choice when responding. The ability to switch between the two modes could help tame the e-mail beast.

As with any new technology in the workplace, there will be a period of building trust that it will be reliable and be a help and not a hindrance. Voice comes with additional complications: How comfortable will employees be speaking an important memo to a voice-to-text application instead of typing it? Perhaps not much, at first. And especially not in front of colleagues.

Voicing a New Level of Accessibility

For some, voice tech might seem a convenience. For those with disabilities, it could change their entire career.

Voice tech in the workplace could transform accessibility—and propel some specialized innovations out of research labs and into offices to help people with problems such as vision impairments. Yet while voice tech is looking very promising as an assistive technology, developers must think about all aspects of accessibility from the get-go or entire protected classes of employees could be excluded. For those for whom English is not their first language, for example, not being understood by a voice system could be a big disadvantage at work. “I think we have to be abundantly aware of the unintended consequences and in particular how some people might be left behind with voice,” says Sara Holoubek, CEO of New York–based innovation and strategy consultancy Luminary Labs.

The Way You Say

We might take language for granted, but it’s actually an extremely complex activity. One of the big challenges of language technology design is predicting the type and style of language used within a particular context, says Rudnicky. We tend to talk in particular ways in specific instances. We also have a very large variability of language—we can talk about the same things in different ways. Most of us speak one way at home and another at work, and then that can also differ depending on who we’re speaking to—colleague? boss?—and what about. What researchers used to do is simulate a context and conversation, and use the resulting transcript as language and grammar models for systems to learn from. “Unfortunately,” says Rudnicky, “that would be something that never really ends because there’s always some other way of saying something.”

There are newer techniques that are more streamlined, using the distance between the old and new iterations to understand meaning. In other words, matching the intent and the correct language. While machine learning, for example, has minimized the challenge of data analysis for language models, “you still need to know what people are talking about and their intent,” he says, which means general conversation remains a challenge (see “How Tech Learns to Talk”).

How Tech Learns to Talk

Giving voice to ones and zeros requires a combination of methods.

Teaching tech to talk is no small thing. Here are the major elements that make it happen:

  • Natural language processing. NLP is a field in AI that sits at the intersection of computer science and computational linguistics.
  • Statistical models. Researchers used to go by a set of rigid language rules embodied in grammars but now use a more flexible statistical models approach that can assign probabilities to different interpretations of voice–in other words, a more realistic way of thinking about language and how we use it.
  • Language components. To improve recognition accuracy and devise appropriate spoken responses, voice tech analyzes various aspects of how we talk, including grammar, syntax, word choice, sentiment analysis, semantics, vocabulary, use in context, and error identification and correction.
  • Conversational interfaces. These are systems that can manage an interaction with a human.
  • Natural language understanding. One of the biggest challenges for AI, NLU needs to deal with the messiness of language—all the slang, mistakes, and new words we exchange and invent.
  • Interactive language learning. A newer approach and a move away from statistical models, interactive learning uses interactions with humans to teach AI.
  • The Turing test. Computer scientist Alan Turing’s original test judges whether a machine is good enough to fool people into thinking it’s human.
  • The Winograd Schema Challenge. This update to Turing’s test, launched in 2016, is a multiple-choice quiz for machine learning intelligence. At the inaugural test, the highest score was 58%.

Call centers have been using voice recording and mining and sentiment analysis for years, says Plakias, but it only works because it deals with a limited range of conversations. Moving from call centers to business meetings is the next challenge. A voice system needs to identify speakers, what’s important, what’s chatter, what directions are given, and many other variables. Plus, as Plakias points out—no surprise here—it’s not unusual for people to be highly distracted in meetings.

“The thing with meetings is they’re not like fixed tasks that you can predict,” says Rudnicky. “People talk about whatever they’re going to talk about, and trying to understand what happens in a meeting is a more difficult problem.”

Right now, voice tech can listen in and take some commands. The next step will be voice tech with the ability to summarize an entire meeting on its own. That’s really difficult to do. “Most AI experts will say that kind of level of reasoning is years away,” says Plakias.

And if to err is human, the good news is that our errors are valuable to researchers like Rudnicky. “Errors are really important because they keep happening,” he says.

But errors are difficult for AI and voice tech because they a) need to be identified and b) need to be corrected. Implicit confirmation—a verbal repetition, like how a waiter repeats an order back to us before heading off to the kitchen—is one method of working with errors. “You want people to have some idea of what’s going on in the machine’s mind just like you do when you’re talking with somebody,” says Rudnicky. During a conversation with another human, “you want to keep track of what they’re thinking, which you do by basically inferring from what they’re saying.” The AI running voice tech needs to be able to do the same thing.

Designing voice for the best user experience is unfamiliar territory because we’re so used to seeing and touching, and UX has been designed around those actions. Information on a screen is presented with context—text, graphics, and so on—which helps anticipate what the user might want to do next, or guide the next action, or provide some sort of focus. But voice is like a blank canvas, so user design must work around those cues.


The point of voice technology is to make working better—more efficient, less stressful, safer, and maybe even fun—and coming innovations in voice will be part of the future of work.


Discerning a user’s true intent is tricky. We’re used to telling systems what to do, but we shouldn’t reflexively go to the other extreme where the system takes the lead and anticipates every action. Microsoft’s Clippy serves as an early example of something that thought it knew what you wanted to do and rarely did (and thus became a fail meme ahead of its time). There’s middle ground that takes into account that machines are much better at learning than before (and continue to improve) but humans are very, very good at learning. Researchers recently judged that the smartest AI has the IQ of a six-year-old.

Sara Holoubek, CEO of New York–based innovation and strategy consultancy Luminary Labs, thinks the skills that are available and popular with consumer voice tech, like travel planning and information retrieval, could lead to the development of enterprise skills. Today, Alexa, which has over 10,000 skills, might manage a playlist; tomorrow it might manage a company’s digital asset management system or sort through many resumes to find candidates.

Voice tech will make it easier to file a report or make a request, particularly in an environment, such as healthcare or construction, where you need to keep both hands free. Physicians are leaders in using voice in the workplace, developing smart speakers for information on symptoms, treatments, and patient records (although it should be noted that consumer voice devices are not yet HIPAA compliant). Hospitals, including Boston Children’s Hospital and Beth Israel Deaconess Medical Center, have voice initiatives that are looking for ways to use voice to help patients during their hospital stays. And it’s already being used in ambulances to help medics determine treatment protocols on the way to the ER, for example. “Any type of search function inside the organization would benefit greatly from voice,” says Holoubek.

Voice is also moving beyond its grindingly annoying role as gatekeeper of call tree hell. It is becoming a more active and helpful part of the customer experience with voice-enabled products and services, like helping with product information. It will also reduce the friction of querying a database, like a CRM system, because search by voice will deliver more results, enable complex data dives, and perform more quickly than by keyboard. Using voice for search is also a very natural thing to do—it is, after all, how we ask questions of each other.

“It’s going to take a lot of work and improvement for voice technology to get to the point where we can try an application like database searching, which is precisely why new technology always starts on the consumer side,” says Holoubek. “There’s a lot less risk in piloting something with a consumer base.”

One problem is our world of data. It’s like the final warehouse scene in the movie Indiana Jones. So many boxes—financial, logistics, supply chain, CRM all isolated in their own crates. Today, entire analytics teams or data operation teams work on data to make it useful. For voice to work, data will need to be open access and organized in a way that meaning can be made for a variety of searches and applications.

Different consumer devices already have a “walled garden” of ecosystems, says Dan Miller of Opus Research, which specializes in voice technology, with skills developed for specific operating systems. He thinks the creation of systems that enable integration and customization with “killer skills” will likely come from outside the current big players. Earlier this year, Amazon relaunched its developer skills console to make it easier for developers to create and test skills. Remember “There’s an app for that”? Now, skills are where it’s at.

Fair Playback

Bias and emotional intelligence are other challenges for voice tech.

We already know that AI can have a lot of biases because it’s learning from its creators and users. So enterprises need to think about how that could manifest in the workplace, says Holoubek. Recruiters requesting résumés, for example, might overlook a particular group of people because of bias.

The source of machine-based emotional intelligence (EI) has traditionally been sentiment data from call center analytics, says Plakias. But in that scenario people tend to be either happy or upset, a binary emotional landscape. Regular interactions between people don’t tend to be quite so clear cut. Voice tech’s AI will need more emotional intelligence to respond correctly, personalize interactions, and, simply, encourage employees to interact with it. It’s about comfort. We’re only in the beginning phases of creating AI with EI, but a first step, an emotionally intelligent chatbot, was revealed last year.

See Me, Hear Me

But what will really take voice tech to the next level is combining it with artificial and virtual reality. The potential of VR or AR with conversational user experience could create great experiences without the interference of the mouse and keyboard.

Imagine onboarding new employees using voice tech and AR. Two weeks before they start work, the employer sends AR headsets to their houses and whenever they want, they can tell the system to give them a tour of their new buildings, new offices, and do a walk-through of the entire space—“Take me to the coffee station!” The new employees will be able to become familiar with the new environment before they’ve stepped a (real) foot inside.

It could also introduce coworkers, enable training sessions, expedite workspace setup, and remove the dreaded first-day discomfort and confusion (no one really likes asking where the bathrooms are). It will be a completely new experience that will be more efficient and even interesting and exciting for the new employee.

Maintenance projects could be completed more quickly and safely with the use of AR and voice tech. An AR headset could let a maintenance person find out which machine isn’t working properly, diagnose the problem, virtually find and try out replacement parts, find out if they’re on hand, and generate a purchase order for them if not. The entire process could take a handful of minutes.

Privacy and Security Challenges

Yet as voice gathers momentum, it arrives on the scene at a time of heightened concerns about security and how data is used. As Candid Wüest, principal threat researcher at Symantec, says, “If you build it, they will hack it.” Already, fingerprint and voice security systems have been tricked. Spoofing a voice is more difficult, he says, but not impossible. Since doing so would require a voice sample, public-facing executives would be more likely targets than rank-and-file employees. “It is a risk that has to be considered,” he says.

And as with other biometrics, once a voice pattern is copied, there’s no going back. The security priority, says Wüest, is to implement systems that can distinguish between a live voice and a recording. Using randomization—randomly generated voice snippets or phrases used only once—is a good idea. Most current voice applications are for authentication—using voice in sensitive environments might call for security combinations like PIN and voice, he says.

Will security and privacy concerns mean the return of the private office? Apart from the noise factor, enterprises need to consider where sensitive information is accessed via voice. Microphones have gotten smaller and better, and open-plan offices reign.

It will be challenging to try and determine what is private and what can or should be shared, both internally and with outside vendors and clients.

Virtual agents work better the more they’re used (that is, the more they know about an individual), which means considering how information is used, guarded, and owned, says Miller. “We’re mapping to treat the spoken word as an asset.”

The new General Data Protection Regulation rules that went into effect in May contain rules on biometric patterns. Data that can be used to identify an individual must be secure, and sensitive personally identifiable data has more stringent security requirements around access and storage, says Wüest. “This hopefully will increase the security around how this information is stored and handled,” he says.

Transparency is the best policy, adds Wüest, which means being open about when microphones are on or off; what’s stored, for how long, and so forth. Both employees and clients might be sensitive to the storage of their voice (only a few attributes of their voice are actually stored, but still). “A very important part is that they inform all the users and clients openly,” he says. “Telling them what they are going to store and how it’s going to be used because if it’s just kept secretly then everyone will have their suspicions and kind of think, ‘Oh, they’re recording everything I say.’”

It Pays to Speak Up

The ultimate test for enterprise voice tech might not be the high-level, big-picture applications, but something simpler yet just as important—employee satisfaction. Voice could help restore some much-needed work-life balance by helping employees become much more proactive, better able to plan out their work day, and work more productively. The hope is that as voice assistants become smarter, they’ll do many quotidian tasks, and integrations across data siloes will improve cross-organizational communication, cooperation, and efficiency. No more late nights slogging over data reports, in other words.

The point of voice technology is to make working better—more efficient, less stressful, safer, and maybe even fun—and coming innovations in voice will be part of the future of work. “Voice should not be used as a way to engineer humans out,” says Holoubek. “It should be a way to lift them up and to use all that is great about humanity to do business and do business well.”


Samir Patel is lead product manager for SAP Conversational AI.
Timo Elliott is an innovation evangelist for SAP.
Danielle Beurteaux is a New York–based writer. 


This story originally appeared in the Digitalist Magazine, Executive Quarterly.