Like A Girl

Pushing the conversation on gender equality.

Code Like A Girl

Why your voice UI likely sucks

This sheep thinks your voice UI sucks. Image: Sheep looking through fence at camera with a serious expression. Source: Pixabay

Your voice UI likely sucks. When using voice recognition becomes a common trope for comedy, you know you have a problem. See the following YouTube clips for examples.

Let’s ignore how problematic this show is in general. In this clip, you can see that Kunal Nayyar’s character is perfectly understood, despite his accented English, while Kripke’s speech disorder causes him a lot of frustration when trying to use Siri.

Why is it so hard to make a good voice UI?

If you’re from the US, have you ever watched a TV show or listened to a podcast from, say, England? Did you have trouble understanding it? They speak English, too, so why is it so hard to understand?

Well, first of all, no language is the same in every country it’s spoken, and it’s not even the same in different regions within a country. There are differences in speech by age, gender, and identity. And it’s not just “accent,” which has to do with phonetics (how sounds are produced) and prosody (rhythm and intonation). It’s also vocabulary and sometimes syntax, or the order words appear in an utterance.*

*We don’t speak in full sentences. Linguists often call chunks of speech “utterances.”

It works this way in other languages, too. Take Spanish, for example. I lived in Spain when I was an undergraduate, and I remember talking to my host mom about Spanish around the world. She told me she had such a hard time understanding Cuban Spanish. Check out the following video to see some examples of Spanish spoken in different parts of the world. (Ignore the first part that focuses just on Spain. Catalán and Euskera are NOT versions of Spanish, but different languages that are spoken in Spain).

If you’d like to explore more videos of Spanish dialects to see how different they are, check out OSU’s Spanish Dialects repository.

Ok, accents are different. Got it. So?

So… who is designing your voice UI? What data are they feeding into it? Voice and speech-related detection systems are biased. They are biased with respect to gender (Google). They are biased with respect to race (Natural Language Processing). And, they are biased with respect to accent (YouTube and Bing). And, given that there’s been some work done on differences in voice quality based on sexuality, they may be biased against non-cis/hetero people as well.

It’s likely pretty safe to assume they’re also biased against disability, even though I don’t have a source for that at the time of writing. (We are generally pretty terrible at inclusive design for disability.) This means that they are being fed biased data. Maybe not intentionally, but it’s still happening, and it’s still a problem. If you have a homogenous group of people with no speech disabilities building a voice UI, and they are not thoughtful about the types of voice data they’re training your voice UI on, it won’t be an inclusive voice UI. And it will suck.

Oh, and let’s not forget that there is no set number of utterances that can exist, and training a voice UI on every single permutation of an expression is no small undertaking. Couple that with the number of ways of pronouncing all the different expressions. Yikes!

It looks like some people are catching on to this, but I don’t trust Silicon Valley and the giant tech magnates to be truly inclusive in their design. (Sorry, friends who work for these companies!) Especially since YouTube is owned by Google, and Google Home Mini claims to work for anyone per their commercial.

Crap. What can I do?

If you design voice UIs, one thing you can do is make sure you are training with data from a wide variety of sources. This is going to take a lot more time and resources, but it is worth it. Talk to your stakeholders about why this is important.

Hint: people won’t use your digital thing with voice UI if it doesn’t work for them. Which means they likely won’t give you money for it. Stakeholders like money, right?

If your voice UI is awful, you will hear about it on social media. You’ll become part of the comedy trope of bad voice UIs. That’s not great for your brand.

Above all, think about how you would feel if you tried to use a cool new toy and it didn’t work for you, but worked for someone else who didn’t have the same {insert characteristic/identity} as you.