Voice Content and Usability

We’ve been conversing for many thousands of years. Whether to present information, perform transactions, or just to check in on one another, people have yammered aside, chattering and gesticulating, through spoken discussion for many generations. Only recently have we begun to write our conversations, and only recently have we outsourced them to the system, a system that exhibits a far greater affection for written communications than for the vernacular rigors of spoken speech.

Speech is more important in servers because it is more important than written speech in spoken and written writing. To have productive conversations with us, machines may struggle with the messiness of mortal speech: the disfluencies and pauses, the gestures and body language, and the variations in word choice and spoken dialect that is stymie even the most carefully crafted human-computer interaction. Speaking English also has the advantage of face-to-face contact, which enables us to perceive visual social cues in the human-to-human scenario.

In contrast, written language develops its own fossil record of dated terms and phrases as we commit to recording and keeping usages long after they are no longer relevant in spoken communication ( for example, the salutation” To whom it may concern” ). Because it tends to be more consistent, smooth, and proper, written word is necessarily far easier for devices to interpret and know.

Spoken speech lacks this luxury. There are verbal cues and vociferous behaviors that mimic conversation in complex ways, including how something is said, never what. These are the nonverbal cues that ornament conversations with emphasis and emotional context. Whether rapid-fire, low-pitched, or high-decibel, whether satirical, awkward, or groaning, our spoken speech conveys much more than the written word had ever muster. But as designers and content strategists, we face fascinating challenges when it comes to voice interfaces, the machines we use to perform spoken conversations.

Voice Compositions

We interact with voice interfaces for a variety of reasons, but according to Michael McTear, Zoraida Callejas, and David Griol in The Conversational Interface, those motivations by and large mirror the reasons we initiate conversations with other people, too ( ). We typically strike up a dialogue as a result:

we require something ( such as a transaction ),
we want to know something ( information of some sort ), or
We are sociable creatures, and we need a talk partner.

A second talk from beginning to end that achieves some goal for the consumer, starting with the voice interface’s initial greeting and ending with the user exiting the interface, also fits into these three categories, which I refer to as interpersonal, technical, and prosocial. Note here that a conversation in our human sense—a chat between people that leads to some result and lasts an arbitrary length of time—could encompass multiple transactional, informational, and prosocial voice interactions in succession. In other words, a voice interaction is a conversation, but it must not be one particular voice interaction.

Most voice interfaces are more gimmicky than captivating in pure prosocial conversations because most people find it difficult to trust their machines to actually understand how we’re doing and to give them the kind of glad-handing we crave. There’s also ongoing debate as to whether users actually prefer the sort of organic human conversation that begins with a prosocial voice interaction and shifts seamlessly into other types. In fact, Michael Cohen, James Giangola, and Jennifer Balogh advise sticking to user expectations by imitating how they interact with other voice interfaces rather than trying too hard to be human, which could lead to alienation of them ( ).

That leaves two different types of conversations we can have with one another that a voice interface can also have easily, such as one that focuses on a transactional voice interaction ( buying iced tea ) and another on learning something new ( discuss a musical ).

Transactional voice interactions

When you order a Hawaiian pizza with extra pineapple, you’re typically having a conversation and a voice interaction when you’re tapping buttons on a food delivery app. The conversation quickly shifts from a brief smattering of neighborly small talk to ordering a pizza ( generously topped with pineapple, as it should be ) when we walk up to the counter and place an order.

Alison: Hey, how’s it going?
Burhan: Hello and welcome to Crust Deluxe! It’s chilly outside. How can I help you?
Alison, can I get a pineapple-onion pizza in Hawaii?
Burhan: Yes, but what size?
Alison: Large.
Burhan: Anything else?
Alison: No, that’s it.
Burhan: Something to drink?
Alison, I’ll have a bottle of Coke.
Burhan, you know what. That’ll be$ 13.55 and about fifteen minutes.

A service rendered or a product delivered is the desired outcome of the transaction, and each progressive disclosure in this transactional conversation reveals more and more of it. Conversations that are transactional have certain characteristics: they are direct, concise, and cost-effective. They quickly dispense with pleasantries.

Informational voice interactions

While some conversations are primarily about obtaining information, some are. Though Alison might visit Crust Deluxe with the sole purpose of placing an order, she might not actually want to walk out with a pizza at all. She might be interested in trying kosher or halal dishes, trying gluten-free dishes, or something else entirely. Even though we have a prosocial mini-conversation once more at the beginning to practice politeness, we are after much more.

Alison: Hey, how’s it going?
Burhan: Hello and welcome to Crust Deluxe! It’s chilly outside. How can I help you?
Alison: Can I ask a few questions?
Burhan: Of course! Continue straight ahead.
Alison: Do you have any halal options on the menu?
Burhan: Totally! On request, we can make any pie halal. We also have lots of vegetarian, ovo-lacto, and vegan options. Do you have any other dietary restrictions in mind?
Alison, what about pizzas that don’t contain gluten?
Burhan: We can definitely do a gluten-free crust for you, no problem, for both our deep-dish and thin-crust pizzas. Anything else I can say to you to help?
Alison: That’s it for now. Good to know. Thank you.
Burhan: Anytime, please.

This is a very different dialogue. Here, the goal is to obtain a particular set of facts. Informational conversations are research expeditions to gather data, news, or facts in search of the truth. Voice interactions that are informational might be more long-winded than transactional conversations by necessity. Responses are typically longer, more in-depth, and carefully communicated to ensure that the customer understands the main ideas.

Voice interfaces

At their core, voice interfaces employ speech to support users in reaching their goals. However, just because an interface has a voice component doesn’t mean that every user interaction with it is mediated by voice. We’re most concerned with pure voice interfaces, which are completely dependent on spoken conversation and lack any visual component, making them much more nuanced and challenging to deal with because multimodal voice interfaces can lean on visual components like screens as crutches.

Though voice interfaces have long been integral to the imagined future of humanity in science fiction, only recently have those lofty visions become fully realized in genuine voice interfaces.

IVR ( interactive voice response ) systems

Written conversational interfaces have been a part of computing for many decades, but voice interfaces first started to appear in the early 1990s with text-to-speech ( TTS ) dictation programs that recited written text aloud as well as speech-enabled in-car systems that gave directions to a user-provided address. With the advent of interactive voice response ( IVR ) systems, intended as an alternative to overburdened customer service representatives, we became acquainted with the first true voice interfaces that engaged in authentic conversation.

IVR systems made it easier for businesses to cut down on call centers, but they soon gained notoriety for their clunkiness. Similar to the corporate world, these systems were primarily created as metaphorical switchboards to direct customers to a real phone agent (” Say Reservations to book a flight or check an itinerary” ), and chances are you’ll have a conversation with one when you call an airline or hotel conglomerate. Despite their functional issues and users ‘ frustration with their inability to speak to an actual human right away, IVR systems proliferated in the early 1990s across a variety of industries (, PDF).

IVR systems have a reputation for having less scintillating conversations than we’re used to in real life ( or even in science fiction ), despite being extremely repetitive and monotonous.

Readers of screens

Parallel to the evolution of IVR systems was the invention of the screen reader, a tool that transcribes visual content into synthesized speech. For Blind or visually impaired website users, it’s the predominant method of interacting with text, multimedia, or form elements. Readers of screens represent perhaps the closest equivalent we have today to an out-of-the-box implementation of content delivered through voice.

Among the first screen readers known by that moniker was the Screen Reader for the BBC Micro and NEEC Portable developed by the Research Centre for the Education of the Visually Handicapped (RCEVH) at the University of Birmingham in 1986 ( ). In the same year, Jim Thatcher created the first IBM Screen Reader for text-based computers, which was later reworked for computers with graphical user interfaces ( GUIs ) ( ).

With the rapid expansion of the web in the 1990s, there was an explosion in the demand for user-friendly tools for websites. Thanks to the introduction of semantic HTML and especially ARIA roles beginning in 2008, screen readers started facilitating speedy interactions with web pages that ostensibly allow disabled users to traverse the page as an aural and temporal space rather than a visual and physical one. In other words, web screen readers “provide mechanisms that translate visual design constructs—proximity, proportion, etc. —into useful information,” according to Aaron Gustafson in A List Apart. ” At least they do when documents are authored thoughtfully” ( ).

There’s a big deal with screen readers: they’re difficult to use and relentlessly verbose, despite being incredibly instructive for voice interface designers. Sometimes unwieldy pronouncements that name every manipulable HTML element and announce every formatting change are made because the visual structures of websites and web navigation don’t translate well to screen readers. For many screen reader users, working with web-based interfaces exacts a cognitive toll.

Accessibility advocate and voice engineer Chris Maury examines why the screen reader experience is not appropriate for users who rely on voice in Wired:

I hated the way Screen Readers operated from the beginning. Why are they designed the way they are? It makes no sense to present information visually and then only to have that information translated into audio. All the effort and thought that goes into creating the ideal user experience for an app is wasted, or worse, having a negative effect on blind users ‘ experience. __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Well-designed voice interfaces can often be more effective than long-winded screen reader monologues in guiding users to their destination. After all, users of the visual interface have the advantage of freely scurrying around the viewport to find information without worrying about it. Blind users, meanwhile, are obligated to listen to every utterance synthesized into speech and therefore prize brevity and efficiency. Users with disabilities who have long had no choice but to use clumsy screen readers might find that voice interfaces, especially more contemporary voice assistants, provide a more streamlined experience.

Voice-activated devices

When we think of voice assistants (the subset of voice interfaces now commonplace in living rooms, smart homes, and offices), many of us immediately picture HAL from 2001: A Space Odyssey or hear Majel Barrett’s voice as the omniscient computer in Star Trek. Voice-activated devices are akin to personal concierges that can answer questions, schedule appointments, conduct searches, and perform other common day-to-day tasks. And they’re rapidly gaining more attention from accessibility advocates for their assistive potential.

Before the earliest IVR systems found success in the enterprise, Apple published a demonstration video in 1987 depicting the Knowledge Navigator, a voice assistant that could transcribe spoken words and recognize human speech to a great degree of accuracy. Then, in 2001, Tim Berners-Lee and others created their vision for a” semantic web agent” that would carry out routine tasks like” checking calendars, making appointments, and finding locations” ( hinter paywall ). Apple’s Siri only became a reality until 2011 when it finally made voice assistants a reality for consumers.

Thanks to the plethora of voice assistants available today, there is considerable variation in how programmable and customizable certain voice assistants are over others ( Fig 1.1 ). At one extreme, everything but vendor-provided features are locked down. For instance, at the time of their release, the core functionality of Apple’s Siri and Microsoft’s Cortana couldn’t be expanded beyond their already-existing capabilities. There are no other means by which developers can interact with Siri at a low level, aside from predefined categories of tasks like sending messages, hailing rideshares, making restaurant reservations, and other things, so even now it isn’t possible to program Siri to perform arbitrary functions.

At the opposite end of the spectrum, voice assistants like Amazon Alexa and Google Home offer a core foundation on which developers can build custom voice interfaces. For this reason, developers who feel stifled by the limitations of Siri and Cortana are increasingly using programmable voice assistants that are capable of customization and extensibility. Google Home has the ability to program arbitrary Google Assistant skills, while Amazon offers the Alexa Skills Kit, a developer framework for creating custom voice interfaces for Amazon Alexa. Today, users can choose from among thousands of custom-built skills within both the Amazon Alexa and Google Assistant ecosystems.

As businesses like Amazon, Apple, Microsoft, and Google continue to dominate their markets, they are also selling and open-sourcing an unmatched range of tools and frameworks for designers and developers, aiming to make creating voice interfaces as simple as possible, even without the use of any code.

Often by necessity, voice assistants like Amazon Alexa tend to be monochannel—they’re tightly coupled to a device and can’t be accessed on a computer or smartphone instead. In contrast, many development platforms, such as Google’s Dialogflow, have omnichannel capabilities that allow users to create a single conversational interface that then becomes a voice interface, textual chatbot, and IVR system upon deployment. In this design-focused book, I don’t recommend any particular implementation strategies, but in Chapter 4 we’ll discuss some of the possible effects that these variables might have on how you construct your design artifacts.

Voice Content

Simply put, voice content is voice-transmitted content. Voice content must be free-flowing and organic, contextless and concise in order to preserve what makes human conversation so compelling in the first place. Everything written content is not.

Our world is replete with voice content in various forms: screen readers reciting website content, voice assistants rattling off a weather forecast, and automated phone hotline responses governed by IVR systems. We’re most concerned with the audiobook content being delivered as a requirement rather than an option.

Our initial foray into informational voice interfaces will likely be to provide user content, for many of us. There’s only one problem: any content we already have isn’t in any way ready for this new habitat. How can we make the content on our websites more conversational? And how do we create fresh copy that works with voice-activated text?

Lately, we’ve begun slicing and dicing our content in unprecedented ways. Websites are, in many ways, massive vaults of what I call macrocontent: lengthy prose that can last for miles in a browser window while being viewed in microfilm format in newspaper archives. Microcontent was defined as permalinked pieces of content that could be read in any environment, such as email or text messages back in 2002, well before the present-day ubiquity of voice assistants.

A day’s weather forcast]sic], the arrival and departure times for an airplane flight, an abstract from a long publication, or a single instant message can all be examples of microcontent. __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

I would update Dash’s definition of microcontent to include all instances of bite-sized content that transcends written communiqués. After all, today we encounter microcontent in interfaces where a small snippet of copy is displayed alone, unmoored from the browser, like a textbot confirmation of a restaurant reservation. Informing delivery channels both established and novel, Microcontent provides the best opportunity to find out how your content can be stretched to the limits of its potential.

Voice content stands out as being unique because it illustrates how content is experienced in space as opposed to time. We can glance at a digital sign underground for an instant and know when the next train is arriving, but voice interfaces hold our attention captive for periods of time that we can’t easily escape or skip, something screen reader users are all too familiar with.

We need to make sure that our microcontent truly performs well as voice content because it is essentially composed of isolated blobs without any connection to the channels where they will eventually end up. This means focusing on the two most crucial characteristics of robust voice content: voice content legibility and voice content discoverability.

Our voice content’s legibility and discoverability in general both depend on how it manifests in terms of perceived space and time.