Voice Input: The Interface Problem – UX Collective

It’s not that I don’t believe in voice input. I like the idea. Lots of people like the idea. We persist in our unsatisfying and solipsistic conversations with Siri, Alexa, and Cortana because we like the idea. “Ok Google… okay… fine, I guess.” An estimated 47.3 million Americans have access to a smart speaker. So something is going on with voice input technologies. But let’s be very, very blunt: conversation is not happening. Understanding is not happening. Oh, you can brute force your way to buying laundry detergent with a sufficiently advanced neural network. But you’re not talking.

My issue with voice input is not the technology, it’s the interface. I’m not talking about Amazon’s SDK system of “skills” or any of the other similar systems offered by Microsoft, Google, or Apple. There is something charming about the logical and mechanistic approach to language implied by the design of voice input systems. (There are subtle differences between each company’s approach but the fact that companies exist that can syndicate skills out to multiple systems indicates that they are basically built on the same model.)

In these systems, the intent of the spoken command may be high or low utility. The user’s utility doesn’t matter. High utility tasks are focused and specific, while low utility tasks are vague and hard to decipher. The system determines the utility of the user’s words according to its own needs. Utterances themselves may be highly variable, but they are all acceptable if they contain certain keywords that move the dialogue towards a definite objective. Try speaking the following question to your smart speaker: “Alexa, snorft is ra weather im New York today?” Alexa has no more idea what “snorft” means than I do, but she will dutifully answer. Interactions are quickly herded into pre-existing objectives and, provided the necessary slots are filled — weather, New York, today — your smart speaker will disregard the meaningless words, because most of your words are not intelligible to smart speakers.

These systems don’t deal in language as most linguists would define it. It’s a false model — it’s the type of logical theory that some gentleman-philosopher in London’s Royal Society might have created in the age of Newton. But who really cares if that’s not accurate? Language is weird and not optimized for commercial purposes. Most of the companies developing applications on their system would prefer that user interactions be reduced to the simple Boolean outcome of purchase (true) or not-purchase (false). They pursue a false model of language in which each statement should have a logical flow and intent because it serves their purpose in structuring a transaction of goods or information. (To understand the problems with this approach, I recommend Josh Ziegler’s “Communication is hard”. You can find it here , here, and here.)

But real people don’t talk that way. This incompatibility between the transactional needs of the systems and the non-logical, unstructured form of human speech leads to errors and misunderstandings. Eliminating error and misunderstanding is the biggest challenge faced by the designers of voice input systems. Companies designing apps for Alexa and her sisters are encouraged to create a strong error strategy — to assume that misunderstanding, lack of understanding, and not hearing are the default outcomes.

If you design your system to prevent the possibility of misunderstanding, you’re actually recreating a breaching experiment. Sociologist Harold Garfinkel conducted breaching experiments in the 1960’s to study the structure of social norms. In a breaching experiment, social norms were flagrantly violated by an experimenter in order to examine other people’s reactions. In one infamous example, Garfinkel encouraged his graduate students to experiment on their friends by asking for repeated clarification on statements with obvious meanings. Here’s a selection from a transcript from one of these conversations:

S: Hi Ray. How is your girlfriend feeling?

E: What do you mean, “How is she feeling?” Do you mean physical or mental?

S: I mean how is she feeling? What’s the matter with you?

E: Nothing. Just explain a little clearer. What do you mean?

S: Skip it. How are your Med School applications coming?

E: What do you mean, “How are they?”

S: You know what I mean!

E: I really don’t.

S: What’s the matter with you? Are you sick?

Garfinkel noted that the unwitting participants in such experiments (his student’s friends) typically reacted with rage to the repeated requests for clarification. An important aspect of all human conversations is the assumption of shared understanding. People get really angry when they are asked to clarify statements that seem obvious to them.

This is exactly what voice input technologies are designed to do. Because they need all interactions to lead to a transaction, they attempt to clarify the details of those transactions. In the process, they must strike a balance between clarifying everything or disregarding those aspects of a statement they don’t understand.

Natural Language — A Crappy Interface

Voice input platforms are making the best of a bad situation. They have an interface problem. The interface that isn’t working is natural language. Natural human speech is a crappy interface for voice input technologies. The reason is simple: human speech was developed to be understood by another human being. Attempting to get a computer to understand speech at the level of sugar-addled toddler is fiendishly difficult.

Most people imagine that human speech is a series of complete sentences intended to convey a coherent thought to an interlocutor, like dialogue in a novel or an Aaron Sorkin script. Here’s an a transcript of a scene from the television show The West Wing in which fictional President Jeb Bartlett speaks with a soon-to-retire supreme court justice named Joseph Crouch:

BARTLET: We’ll make our announcement on Thursday.

CROUCH: You’ve decided on Harrison.

BARTLET:I haven’t made a decision yet, Joseph.

CROUCH: You’ve made the call. [beat] Did you even consider Mendoza?

BARTLET: Mendoza was on the short list.

CROUCH: Mendoza was on the short list so you can show you had an Hispanic on the short list.

BARTLET: That’s not true, Joseph.

If that was the form of most human speech, recognizing the content and mirroring the structure would be pretty easy. Most people think they actually talk this way. But we don’t. We trail off. We drop words. We speak in short bursts full of word-soup, cliché, and verbal ticks. We are functionally incoherent, except to other human beings.

Here is the transcript of an actual President talking to his actual lawyer:

COHEN: And, I’ve spoken to Allen Weisselberg about how to set the whole thing up with …

TRUMP: So, what do we got to pay for this? One-fifty?

COHEN: … funding. Yes. Um, and it’s all the stuff.

TRUMP: Yeah, I was thinking about that.

COHEN: All the stuff. Because — here, you never know where that company — you never know what he’s —

TRUMP: Maybe he gets hit by a truck.

COHEN: Correct. So, I’m all over that.

Sadly, the reality of presidential conversation does not live up to the fictional version. Any natural language processing technology would struggle to parse a single detail from that conversation. The entire thing depends on context and shared knowledge of unstated assumptions. What is the intent? What is the flow?

Voice input platforms try to solve this problem by focusing on certain key details called “slots” and clarifying them before moving on to the next key detail. The result is an interaction that has the most superficial form of human speech — alternating interlocutors — without any of the deep structures that makes human speech meaningful or useful. The only context is the assumption that the interaction should end with either a simple command (“Play the Talking Heads”) or a commercial transaction (“Buy Tide detergent.”)

The solution to this problem is simple, but remains unacceptable to the companies that build these platforms: abandon natural language.

Beeps, Hums & Clicks

When I remove a metrocard from my wallet and swipe my entrance to the New York City subway, I am treated to three possible noise outputs. A single beep tells me to proceed. A double beep tells me to swipe again. And a triple beep tells me that I have insufficient funds on my card to proceed. This is, I think we can all agree, a pretty crappy communication system. In the naturally sound-rich environment of a New York subway station, I am unlikely to be able to distinguish the number of beeps my swipe elicited. Many sound designers have pointed out that varied tones would improve comprehension, just as the same sentence can be either a statement or a question based on whether the voice rises at the end.

Tone communicates a huge amount to human beings. If you have ever listened to people speaking a language you did not understand, you can gain a huge degree of comprehension through tone alone. Guitarists use these quirks of tone in language to make their guitars appear to “speak.” A few notes played in the correct order can handle the simple communications of voice output more efficiently than Alexa’s disturbing absence of vocal affect. Smart speakers attempt to create a rough approximation of the rise and fall of human speech. In the process, they create a kind of tonal uncanny valley.

What if, instead of words, smart speakers allowed users to interact via other vocalizations like beeps, hums and click? Trust me, I know how stupid that sounds. Most people would agree that no user would willingly learn a bunch of weird whistles or throat clicks just to order pizza or play a song.

As crazy as it may sound, I’m going to question that assumption. Let’s look at the most successful new interface of the last 20 years — the touchscreen on your smartphone. In order to interact with this interface, you learned a new system of gesture. These gestures only seem intuitive in retrospect. You needed to learn to pinch and swipe, just like you needed to learn to use a mouse or a keyboard. Each one demands users master an artificial system of learned behaviors.

Would it really be so difficult to click or hum to interact with a smart speaker? Is that any more articial than doing a reverse-pinch on a picture? The Khoisan language family of sub-Saharan Africa offers more noises than any other language, including the variable clicks of the !Kung language. Vocalizing beeps, hums, or clicks would seem weird and arbitrary at first. But there’s a benefit to this weirdness. The artifice of the act would remind us we were interacting with a computer and not a person.

Spoken language existed for hundreds of thousands of years before writing was invented. Writing has always been artificial. It resembles spoken language but it was designed to communicate at a distance. Intent, utility, and flow are meaningful concepts in written language. There was always a reason to write something down. But spoken language is different. We speak the way dogs sniff at each others nethers. We fill the silence. We reassure. Spoken language is a reflection of our social nature.

If you want voice input to work really well, you need to remind users that they aren’t using spoken language. They are vocalizing with a purpose. Just like writing is an artificial form, this type of vocalizing needs to be artificial. Throw a couple beeps and clicks into a sentence and you know you’re not “just chatting.”

African Talking Drums

I doubt that any company is going to follow my advice on introducing non-speech vocalizations to voice input. It’s just too weird. All of the major players in the market have already committed themselves intellectually to the idea of natural language processing.

One of the dirty secrets of the technology industry is that companies don’t pursue the technology, they pursue ideas. Once they have committed themselves to an idea, they will not reverse course even if the idea proves itself wrong. Spoken human language integrates poorly with the structure and dependencies of computer code, but Amazon, Google, Microsoft, and Apple will continue to kluge together an approximation of human speech. They are unlikely to reverse-course and agree that they are building sound-input interfaces rather than language-input interfaces. It’s an ego thing.

I still believe that platform-specific vocalizations will naturally emerge, just as every communications technology has evolved unique forms of expression. From telegraph messages to Slack discussions, the form and limitations of the technology always stimulates our normal human inventiveness around language. Maybe it won’t be clicks and hums, but we will not end up talking to smart speakers like we talk to people.

Right now, all of these technologies operate on a similar model to Morse code. The sound is translated into written speech. Morse code is not a language like American Sign Language or French. It is a way of translating the alphabet into easily transmissible sounds or pulses. Similarly, Alexa takes sound waves and attempts to translate their patterns into written language which the system then uses to distill intent and fill slots. Not so long ago, this would have been close to impossible, but modern machine learning is very good at recognizing patterns in sound waves.

There are a lot of steps from spoken language to written language and then back to spoken language. Even if both the interlocutors were human, the potential for error and misunderstanding exists just as error and misunderstanding was relatively common during the telegraph era.

Which brings me to African talking drums. For what follows, I am indebted to James Gleick’s “The Information: a History, a Theory, a Flood.” In this fascinating book on the history of information, Gleick talks about the pre-modern transmission of information on the African continent by means of drumming. Naturally, most Europeans visitors to Africa during those years considered the drumming merely a primitive affectation of the natives. They had no idea that information was being communicated long distances via sound. Eventually, a missionary named Roger T. Clarke made the effort to learn the language of the drums.

The first thing that Clarke noted once he was able to understand the drumming was that the messages seemed unnecessarily poetic and lengthy. For example, when a drummer wanted to communicate “come back home” to a friend or family member, he would drum the following words: “Make your feet come back the way they went, make your legs come back the way they went, plant your feet and your legs below, in the village which belongs to us.” Was this due to a peculiarity of the native African language or culture(following Sapir-Whorf)? Or was this rhythmic verbosity just a function of the inherently musical nature of drumming. Actually, it was neither.

Even for a talented drummer, there was a limit to the number of sounds that can be played and carried over long distances. This meant that certain tones and beats needed to represent more than one thing, just like our homonyms and homophones. We can differentiate “two, to, and too” in spoken language by listening to the context. The “speakers” using African talking drums created context for their percussive homonyms through redundancy. Even if the listener missed a couple beats, redundancy allowed them to get the flavor of the message that was being sent. So “don’t be afraid” was rendered as “Bring your heart back down out of your mouth, your heart out of your mouth, get it back down from there.”

Redundancy doesn’t work so well as a solution to the myriad misunderstandings of an Alexa or a Siri. One of the ostensible advantages of these technologies is that they save you time. Instead of laboriously typing out a task, you just “Ok Google” and you’re done. Personally, I doubt that saving time is actually a benefit for voice input. Convenience seems to me a stronger motivator. Sometimes my hands are busy and sometimes I would prefer not to break the flow of social interaction with the inherently anti-social action of bending over my phone. We don’t talk to each other to save time (sometimes we talk to each other to waste time) so I doubt users would expect verbal efficiency from a smart speaker.

Redundancy might be useful if only we could get users to repeat their requests in two or three different ways. That’s unlikely. But redundancy doesn’t need to be expressed by the user alone. You can build redundancy into the interactions by spreading it across both interlocutors. Listen to the way that your smart speaker already repeats back your requests with slight variations. It is creating redundancy and increasing your opportunities to spot any errors in understanding. The more context the system can place around the interaction, the greater the likelihood that misunderstanding is avoided. The designers of these systems try to eliminate redundancy. They use machine learning to try to figure out what you mean on first utterance. It would be better to use machine learning to figure out how much redundancy is required to eliminate most errors. Smart speakers are going to become more important and be used for more and more things. We need to abandon our illogical focus on efficiency if voice input is ever going to be as useful as typing.

Imagine that redundancy in communication takes the form of a wave. Small redundancies would have a slight amplitude — little difference between the peak and the trough of the wave. The smallest redundancy that a smart speaker could offer is simple repetition.

“Alexa, get a new package of kleenex.”

“Okay, I will get a new package of kleenex.”

This tiny redundancy should be sufficient to eliminate many errors of communication. But in the event of misunderstanding, it actually compounds the problem. It leaves both the user and the system with the impression that they have been heard and understood when, in fact, the misunderstanding persists.

To eliminate miscommunication you need to increase the amplitude of the redundancy by adding variations. This can be done by the substitution of synonyms — “okay, I will purchase a package of tissues.” Or it can be done by adding contextual information to the response — “okay, I will purchase the same kind of tissues you bought last month from Amazon.” The greater the difference between the initial request and the redundant reply — the larger the amplitude of the redundancy — the easier it is to spot misunderstanding. In the event of miscommunication, a high amplitude redundancy will appear so bizarre to the user that they will immediately stop and reassess.

Alexa’s SDK encourages people developing apps for its system to use simple, conversational words. It’s tempting to imagine that this would eliminate misunderstanding. However, the words we use most in conversation actually have the widest range of meanings. The word “get”, a common word by any measure, has upwards of 40 different definitions in most dictionaries. Therefore you would want the highest amplitude of redundancy for apparently simple words — make, have, give, go. When these words are used, repetition will always compound the misunderstanding since the user will assume that the system gets what she means by get.

The amplitude of redundancy could be adjusted depending on the number of misunderstandings and miscommunications observed by the system administrators. Right now, many systems are just trying to do a better job guessing what the user meant. High amplitude redundancy can reduce this guesswork. The amplitude should be proportional to the range of possible meanings each word or sound could possess. Only when the input is unambiguous in meaning should repetition be used. Never forget that for the user, simple repetition is annoying. Simple repetition is annoying.

What Success Sounds Like

Recently, Amazon reprogrammed Alexa to encourage kids to say “please” and ‘thank you.” This reflected feedback they were getting from concerned parents who worried that their children were not developing good manners because they were used to issuing commands to smart speakers without these verbal niceties.

As the parent of two young children, I can assure you this is nonsense. Young children have always struggled to learn to say “please” and “thank you.” It’s not Alexa’s fault that kids are kids. Saying please to a smart speaker makes about as much sense as saying thank you to your keyboard after you type a sentence. (Thank you keyboard for that apt simile.)

But this concern does raise an interesting issue: most popular technologies end up affecting human language and behavior. Hashtags get thrown into casual conversations between friends. Acronyms like “LOL” and mistypings like “pwned” have wormed their way into the Oxford English Dictionary. The contrivances that started out platform-specific end up being used in new situations. Sometimes this starts as a self-conscious joke (#selfconsciousjoke), but soon enough these forms are used without irony. Some social scientists have suggested that the need to be brief in telegraph messages transformed written English from the florid prose of the Victorian era to its current blunt form.

So what effects are voice input technologies like Alexa having on language? Actually, they haven’t had any effects. Because these platforms insist they are using natural language, they have failed to created any platform-specific patterns of language or speech. In their desperation to fit into culture, they haven’t had any effect on culture.

To the product managers of Siri or Alexa, this probably sounds like success. Why reinvent the language when you can simply master it? But the new words and grammatical forms that have been introduced by the typewriter, telegraph, radio, television, and internet reflect the impact that those technologies have had on human civilization. The changes they demanded from our culture and language helped to make them sticky. A technology that demands nothing of us isn’t convenient, it’s invisible. Unless voice input technologies are willing to abandon the clunky interface of natural language, they will never affect culture. And the list of revolutionary technologies that had no effect on culture is a short one.

Author: Nathan Hunt

Collect by: uxfree.com

Comment

Top