VUI Assessment Guidelines


We have created these guidelines to help designers create user-friendly Amazon Alexa skills and Google Home actions.


Functionality What does it do and does it deliver value to the customer? Skills that people will use often, such as multiple times a day, score high. The less frequent the use, the less useful. Skills that have the potential of becoming part of a customer’s routine, or daily life, score high.
Modality fit Is the skill a good fit for voice? For instance, the Bird Song Alexa skill is nearly a perfect fit because it has to do with listening to bird songs. Finding just the right hotel room and booking it is not, because reviewing search results and conducting multiple step interactions is tedious and slow with voice compared to say a desktop web site or a mobile app, where one can quickly scan search results and complete transactions with a few clicks and taps..
Invocation How easy is it for someone to remember how to invoke the skill? The easier to remember, the better.  Not having to remember anything — e.g., “What’s the stock price for IBM — ” is even better.
Closing When you say “Stop” or “Goodbye” – does it stop or does it go on talking some more, as in: “Thank you for using our skill. Please visit us at“? The best is if it says nothing if the user says, “Stop.”  If concluding an experience,  after a natural interaction (say, ordering sandwiches), the less one says the better.
Brevity How long are the prompts? Be concise, but not cryptic.  Ideally, you want to shorten your prompts as the user becomes familiar with the skill.  Don’t offer the same experience regardless whether the user is a first time user or a power user.
Help Does it offer useful help? Provide contextualized, precise, and relevant help that is actionable and that will enable the user get back on track with the conversation.
Questions Does it ask questions clearly and in a way that enables the customer to know what to say? For instance, are there questions where it’s not clear if they are multiple choice or “Yes/No”?
Cognitive Load Does the skill require the customer to remember more than they should? Is the user forced to listen to many choices? Is the user forced to recall a piece of information provided earlier in the conversation?
Language flow How conversational does the interaction feel? Does the multi-turn conversation flow or does it sound robotic?  Does the assistant use markers  (whether vis sounds or phatic expressions, such as “Ok,” “Great,” “Got it” ) to create momentum and sustain a cooperative feel?
Error Strategies How well does it handle no input (the customer doesn’t say anything) and no match (the customer says something out of scope)? Is the user helped recover from errors easily: what does the assistant do when it doesn’t understand the user?  Does it provide an example of what the user can say, or does it just say, “Can you say that again?”
Speech Recognition Accuracy How often does the skill not recognize something that it should recognize? Does the assistant hear something other than what the customer says? Does it have no matches often (just plain couldn’t match what the user said to what it expected)?
Perceived Latency How fast does the skill respond? And if it doesn’t respond fast, does it employ mitigating strategies – such as, “Hang one sec, let me fetch the information for you” and or plays background noise that makes it clear that the skill is in the process of formulating a response. Latency is an important dimension when assessing the quality of a voice experience. Customers will not tolerate voice experiences that respond slowly, or respond in a way that feels slow.  The key operative word here is “perceived”.  The user may need to wait 5 seconds for a response, but if those 5 seconds are covered by some language or some sound as the assistant say, retrieves information, the the perceived latency may be much less insignificant compared to the actual latency.
Text To Speech (TTS) Does the skill pronounce things incorrectly? How often does the assistant mispronounce a word or a phrase?  Is the sentence spoken with the right pauses at the right places in the spoken prompt.  For instance, when speaking a list of items, is there a silence gap between list items?
Information architecture Do the steps go in an order that makes sense?  Is the information presented and offered well structured?  Can the user easily create a mental model and use that model to help themselves navigate the conversation?