Voice Content and Usability

We’ve been conversing for a long time. Whether to present information, perform transactions, or just to check in on one another, people have yammered aside, chattering and gesticulating, through spoken discussion for many generations. Only recently have conversations started to be written, and only recently have we outsourced them to the system, a device that exhibits a significantly higher affinity for written communications than for the vernacular rigors of spoken language.

Laptops have trouble because between spoken and written speech, talk is more primitive. Machines must wrestle with the complexity of human statement, including the disfluencies and pauses, the gestures and body speech, and the variations in expression choice and spoken dialect, which may impede even the most skillfully crafted human-computer interaction. In the human-to-human situation, spoken language also has the opportunity of face-to-face call, where we can easily interpret visual interpersonal cues.

In contrast, written language develops its own fossil record of dated terms and phrases as we commit to recording and keeping usages long after they are no longer relevant in spoken communication ( for example, the salutation” To whom it may concern” ). Because it tends to be more consistent, smooth, and proper, written word is necessarily far easier for devices to interpret and know.

Spoken speech is not a pleasure in this regard. Besides the visual cues that mark conversations with emphasis and personal context, there are also linguistic cues and vocal behaviors that modulate conversation in nuanced ways: how something is said, never what. Our spoken language conveys much more than the written word could ever contain, whether it be rapid-fire, low-pitched, or high-decibel, sarcastic, stilted, or sighing. So when it comes to voice interfaces—the machines we conduct spoken conversations with—we face exciting challenges as designers and content strategists.

Voice-to-text interactions

We interact with voice interfaces for a variety of reasons, but according to Michael McTear, Zoraida Callejas, and David Griol in The Conversational Interface, those motivations by and large mirror the reasons we initiate conversations with other people, too ( ). We typically strike up a conversation as a result:

we need something done ( such as a transaction ),
we seek knowledge of something ( some kind of information ), or
we are social beings and want someone to talk to ( conversation for conversation’s sake ).

These three categories, which I refer to as transactional, informational, and prosocial, also apply to essentially every voice interaction: a single conversation that starts with the voice interface’s first greeting and ends with the user leaving the interface. Note here that a conversation in our human sense—a chat between people that leads to some result and lasts an arbitrary length of time—could encompass multiple transactional, informational, and prosocial voice interactions in succession. In other words, a voice interaction is a conversation, but it must not be one particular voice interaction.

Purely prosocial conversations are more gimmicky than captivating in most voice interfaces, because machines don’t yet have the capacity to really want to know how we’re doing and to do the sort of glad-handing humans crave. Additionally, there is ongoing debate about whether users actually prefer the type of organic human conversation that starts with a prosocial voice and progresses seamlessly into new ones. In fact, in Voice User Interface Design, Michael Cohen, James Giangola, and Jennifer Balogh recommend sticking to users ‘ expectations by mimicking how they interact with other voice interfaces rather than trying too hard to be human—potentially alienating them in the process ( ).

That leaves two different types of conversations we can have with one another that a voice interface can also have easily, including one that is transactional and one that is informational, teaching us something new ( “discuss a musical” ).

Transactional voice interactions

When you order a Hawaiian pizza with extra pineapple, you’re typically having a conversation and a voice interaction when you’re tapping buttons on a food delivery app. Even when we walk up to the counter and place an order, the conversation quickly pivots from an initial smattering of neighborly small talk to the real mission at hand: ordering a pizza ( generously topped with pineapple, as it should be ).

Alison: Hey, how are things going?
Burhan: Hi, welcome to Crust Deluxe! It’s chilly outside. How can I help you?
Alison: Can I get a pizza from Hawaii with extra pineapple.
Burhan: Sure, what size?
Large, Alison.
Burhan: Anything else?
Alison: No, that’s it.
Burhan: Something to drink?
I’ll have a bottle of Coke, Alison.
Burhan: You got it. That will cost$ 13.55 and take about fifteen minutes.

Each progressive disclosure in this transactional conversation reveals more and more of the desired outcome of the transaction: a service rendered or a product delivered. Transactional conversations exhibit a few key characteristics: they’re direct, to the point, and economical. They quickly dispense with pleasantries.

Informational voice interactions

Meanwhile, some conversations are primarily about obtaining information. Alison might only want to place an order at Crust Deluxe, but she might not want to leave without a pizza at all. She might be just as interested in whether they serve halal or kosher dishes, gluten-free options, or something else. Even though we have a prosocial mini-conversation once more at the beginning to practice politeness, we are after much more.

Alison: Hey, how are things going?
Burhan: Hi, welcome to Crust Deluxe! It’s chilly outside. How can I help you?
Alison: Can I ask a few questions?
Burhan: Of course! Go right ahead.
Do you have any halal options available on the menu, Alison?
Burhan: Absolutely! On request, we can make any pie halal. We also have lots of vegetarian, ovo-lacto, and vegan options. Do you have any other dietary restrictions in mind?
Alison: What about gluten-free pizzas?
Burhan: For both our deep-dish and thin-crust pizzas, we can definitely make a gluten-free crust for you, without a problem. Anything else I can answer for you?
Alison: That’s it for now. Good to know. Thank you.
Burhan: Anytime, come back soon!

This dialogue is entirely different. Here, the goal is to get a certain set of facts. Informational conversations are research expeditions to gather data, news, or facts in search of the truth. Voice interactions that are informational might be more long-winded than transactional conversations by necessity. Responses are typically longer, more in-depth, and carefully communicated so that the customer is aware of the important lessons.

Voice Interfaces

Voice interfaces, in essence, use speech to assist users in accomplishing their objectives. But simply because an interface has a voice component doesn’t mean that every user interaction with it is mediated through voice. We’re most concerned in this book with pure voice interfaces because multimodal voice interfaces can lean on visual components like screens as crutches, which are completely dependent on spoken conversation and lack any visual component, making them much more nuanced and challenging to deal with.

Though voice interfaces have long been integral to the imagined future of humanity in science fiction, only recently have those lofty visions become fully realized in genuine voice interfaces.

IVR ( interactive voice response ) systems

Though written conversational interfaces have been fixtures of computing for many decades, voice interfaces first emerged in the early 1990s with text-to-speech ( TTS ) dictation programs that recited written text aloud, as well as speech-enabled in-car systems that gave directions to a user-provided address. We became familiar with the first real voice interfaces that could actually be spoken to without having to deal with overburdened customer service representatives as a result of the development of interactive voice response ( IVR ) systems.

IVR systems allowed organizations to reduce their reliance on call centers but soon became notorious for their clunkiness. These systems, which are commonplace in the corporate world, were primarily intended as metaphorical switchboards to direct customers to real phone agents (” Say Reservations to book a flight or check an itinerary” ), and it is likely that when you call an airline or hotel conglomerate, you will have the opportunity to have a conversation with one. Despite their functional issues and users ‘ frustration with their inability to speak to an actual human right away, IVR systems proliferated in the early 1990s across a variety of industries (, PDF).

IVR systems have a reputation for having less scintillating conversations than we’re used to in real life ( or even in science fiction ), despite being extremely repetitive and monotonous conversations that typically don’t veer from a single format.

Screen readers

The invention of the screen reader, a tool that converts visual content into synthesized speech, was a development of IVR systems in parallel. For Blind or visually impaired website users, it’s the predominant method of interacting with text, multimedia, or form elements. Perhaps the closest thing we have today to an out-of-the-box implementation of content delivered through voice is represented by screen readers.

Among the first screen readers known by that moniker was the Screen Reader for the BBC Micro and NEEC Portable developed by the Research Centre for the Education of the Visually Handicapped (RCEVH) at the University of Birmingham in 1986 ( ). The first IBM Screen Reader for text-based computers was created by Jim Thatcher in the same year, which was later recreated for computers with graphical user interfaces ( GUIs ) ( ).

With the rapid growth of the web in the 1990s, the demand for accessible tools for websites exploded. Screen readers started facilitating quick interactions with web pages that ostensibly allow disabled users to traverse the page as an aural and temporal space rather than a visual and physical one with the introduction of semantic HTML and especially ARIA roles in 2008, enabling speedy interactions with the pages. In other words, screen readers for the web “provide mechanisms that translate visual design constructs—proximity, proportion, etc. in A List Apart, writes Aaron Gustafson, “into useful information.” ” At least they do when documents are authored thoughtfully” ( ).

There is a big draw for screen readers: they’re challenging to use and relentlessly verbose, despite being incredibly instructive for voice interface designers. The visual structures of websites and web navigation don’t translate well to screen readers, sometimes resulting in unwieldy pronouncements that name every manipulable HTML element and announce every formatting change. Working with web-based interfaces takes a cognitive toll for many screen reader users.

In Wired, accessibility advocate and voice engineer Chris Maury considers why the screen reader experience is ill-suited to users relying on voice:

I disliked the operation of Screen Readers from the beginning. Why are they designed the way they are? It makes no sense to present information visually before converting it to audio only after that. All of the time and energy that goes into creating the perfect user experience for an app is wasted, or even worse, adversely impacting the experience for blind users. __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

In many cases, well-designed voice interfaces can speed users to their destination better than long-winded screen reader monologues. After all, users of the visual interface have the advantage of freely scurrying around the viewport to find information without worrying about it. Blind users, meanwhile, are obligated to listen to every utterance synthesized into speech and therefore prize brevity and efficiency. Users with disabilities who have long had no choice but to use clumsy screen readers might find that voice interfaces, especially more contemporary voice assistants, provide a more streamlined experience.

Voice assistants

Many of us immediately associate voice assistants with the subset of voice interfaces that are now commonplace in living rooms, smart homes, and offices with the film HAL from 2001: A Space Odyssey or Majel Barrett’s voice as the omniscient computer in Star Trek. Voice assistants are akin to personal concierges that can answer questions, schedule appointments, conduct searches, and perform other common day-to-day tasks. And they’re quickly gaining more attention from accessibility advocates for their assistive potential.

Before the earliest IVR systems found success in the enterprise, Apple published a demonstration video in 1987 depicting the Knowledge Navigator, a voice assistant that could transcribe spoken words and recognize human speech to a great degree of accuracy. Then, in 2001, Tim Berners-Lee and others created their vision for a” semantic web agent” that would carry out routine tasks like” checking calendars, making appointments, and finding locations” ( hinter paywall ). It wasn’t until 2011 that Apple’s Siri finally entered the picture, making voice assistants a tangible reality for consumers.

There are a lot of variations in the programmability and customization of some voice assistants compared to others ( Fig. 1 ). As a result of the breadth of voice assistants available today ( Fig. 1 ). At one extreme, everything except vendor-provided features is locked down, for example, at the time of their release, the core functionality of Apple’s Siri and Microsoft’s Cortana couldn’t be extended beyond their existing capabilities. There are no other means by which developers can interact with Siri at a low level, aside from predefined categories of tasks like sending messages, hailing rideshares, making restaurant reservations, and other things, which are still unavoidable today.

At the opposite end of the spectrum, voice assistants like Amazon Alexa and Google Home offer a core foundation on which developers can build custom voice interfaces. For this reason, developers who feel constrained by the limitations of Siri and Cortana are increasingly using programmable voice assistants that are extensibable and customizable. Amazon offers the Alexa Skills Kit, a developer framework for building custom voice interfaces for Amazon Alexa, while Google Home offers the ability to program arbitrary Google Assistant skills. Users of the Amazon Alexa and Google Assistant ecosystems can choose from among the thousands of custom-built skills available today.

As businesses like Amazon, Apple, Microsoft, and Google continue to occupy their positions, they are also selling and open-sourcing an unheard array of tools and frameworks for designers and developers, aiming to make creating voice interfaces as simple as possible, even without code.

Often by necessity, voice assistants like Amazon Alexa tend to be monochannel—they’re tightly coupled to a device and can’t be accessed on a computer or smartphone instead. In contrast, many development platforms, such as Google’s Dialogflow, have omnichannel capabilities that allow users to create a single conversational interface that then becomes a voice interface, textual chatbot, and IVR system upon deployment. I don’t prescribe any specific implementation approaches in this design-focused book, but in Chapter 4 we’ll get into some of the implications these variables might have on the way you build out your design artifacts.

Voice content

Simply put, voice content is content delivered through voice. Voice content must be free-flowing and organic, contextless and concise—everything written content isn’t enough to preserve what makes human conversation so compelling in the first place.

Our world is replete with voice content in various forms: screen readers reciting website content, voice assistants rattling off a weather forecast, and automated phone hotline responses governed by IVR systems. We’re most concerned with the content in this book being delivered auditorically, not as an option but as a necessity.

For many of us, our first foray into informational voice interfaces will be to deliver content to users. There is only one issue: any content we already have isn’t in any way suitable for this new environment. So how do we make the content trapped on our websites more conversational? And how do we create fresh copy that works with voice-activated text?

Lately, we’ve begun slicing and dicing our content in unprecedented ways. Websites are, in many ways, massive vaults of what I call macrocontent: lengthy prose that can last for miles in a browser window while being viewed in microfilm format in newspaper archives. Back in 2002, well before the present-day ubiquity of voice assistants, technologist Anil Dash defined microcontent as permalinked pieces of content that stay legible regardless of environment, such as email or text messages:

An example of microcontent can be a day’s weather forecast [sic], the arrival and departure times for an airplane flight, an abstract from a lengthy publication, or a single instant message. __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

I would update Dash’s definition of microcontent to include all instances of bite-sized content that transcends written communiqués. After all, today we encounter microcontent in interfaces where a small snippet of copy is displayed alone, unmoored from the browser, like a textbot confirmation of a restaurant reservation. The best way to learn how to stretch your content to the limits of its potential is through microcontent, which will inform both established and new delivery methods.

As microcontent, voice content is unique because it’s an example of how content is experienced in time rather than in space. We can instantly look at a digital sign for an instant and be informed when the next train is coming, but voice interfaces keep our attention captive for so long that we can’t quickly evade or skip, a feature that screen reader users are all too familiar with.

Because microcontent is fundamentally made up of isolated blobs with no relation to the channels where they’ll eventually end up, we need to ensure that our microcontent truly performs well as voice content—and that means focusing on the two most important traits of robust voice content: voice content legibility and voice content discoverability.

Fundamentally, how voice content manifests in perceived time and space both affect the legibility and discoverability of our voice content.