A glossary of digital audio and voice first terms

The below is a work in progress. Please feel free to pitch in with your feedback via the form at the bottom of this page.

Acoustic Model : A representation that maps “the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word.”

Adaptive System: A system that adapts its behavior to changing parameters, such as the user’s identity, the time of day, day of week or month, the context of the interaction, etc.

Alexa: A cloud service that powers Amazon’s family of Echo devices. The service is also available to license by third party hardware and software providers using the Alexa Voice Service (AVS).

Alexa Skill: Software that third party developers build to add a new functionality to Alexa. An Alexa skill is to Alexa what a mobile app is to the iOS or to the Android mobile platforms. Developers use the Alexa Skills Kit to build Alexa skills, submit such skills to Amazon for certification, and upon certification and publication of the skills, enable end users of Alexa (through Echo products and products Alexa enabled through AVS) to discover and enable the skills from the Alexa Skills Store.

Alexa Skills Store: The market place where users of Alexa-enabled products may go to search for skills, and enable/disable skills. The Alexa Skill Store is to Alexa skills what the App Store is to iOS apps and what Google Play is to Android apps.

Alexa Voice Service (AVS): A Software Development Kit (SDK) that enables hardware manufacturers and software developers to integrate Alexa to their hardware/software. For instance, a manufacturer of a bluetooth speaker may add microphones to their speaker, use AVS, and then turn their once simple bluetooth speaker into an Echo like device. AVS was made generally available in June 2015.

Alexa Skills Kit (ASK): A Software Development Kit (SDK) that enables developers to build and launch an Alexa skill. For instance, Uber, Starbucks, Meetup, and everyone who has published a skill on the Alexa Skills Store used ASK to build and submit their skills. ASK was launched in June 2015.

Always Listening Device: A device that is always listening for a “wake word” and that sends the audio captured after the wake word has been detected for additional processing.

ASR: Automatic Speech Recognition, or Automatic Speech Recognizer. Usually this refers to software that is able to take audio input and map that input to a word or a language utterance.

ASR Tuning: The activity of iteratively configuring the ASR software to better map, both in accuracy and in speed, the audio input to a word or an utterance.

Barge-in: The ability of the user to interrupt system prompts while those prompts are being played. If barge-in is enabled in an application, then as soon as the user begins to speak, the system stops playing its prompt and begins processing the user’s input.

Bixby: Samsung’s voice assistant, launched in the United States in July 2017.

Confidence Score: A number (usually a fraction between 0.00 and 1.00 – e.g., 0.87) that is returned by the ASR and that reflects the confidence that the ASR has in the result provided. A 1.00 confidence means that the ASR is as certain as it can be that it has returned the correct result. A result with a confidence score of 0.91 is deemed more likely to be correct by the ASR than one with a score of 0.78.

Confidence Threshold: A number (usually a fraction between 0.00 and 1.00 – e.g., 0.87) that sets the mark below which ASR results are ignored. For instance: if the user were to say, “Austin,” and the recognizer were to return, “Austin” with a score of 0.92, “Boston” with 0.87, “Houston” with 0.65, “Aspen” with 0.52, and “Oslo” with 0.43, and the threshold were set at 0.55, the 3-best would be, “Austin,” “Boston,” and “Houston.” If the threshold were set at 0.70, only the first two, “Austin” and “Boston” would be returned. If the threshold were wet to 0.40, the 3-best would still be “Austin,” “Boston,” and “Houston.” The 4-best with a threshold of 0.40 would be “Austin,” “Boston,” “Houston,” and “Aspen.”

The Cooperative Principle: The proposition that listeners and speakers must act cooperatively and mutually accept one another to be understood in a particular way to carry out an effective verbal conversation. As phrased by the British philosopher of language Paul Grice, who introduced it, “Make your contribution such as it is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged.”

Cortana: Microsoft’s voice assistant, launched in the United States in April 2014.

Directed Dialog: interactions where the exchange between the user and the system is guided by the application: the system asks questions or offers options and the user responds to them. Directed dialogs stand in contract to “Mixed Initiative” dialogs, since they require the user to specifically answer the question asked and won’t accept any other piece of information, whether additive (the user provided an answer to the question, but also an additional piece of information) or substitutive (the user provided instead an altogether different piece of information that is relevant and will be asked for by the system at some point).

Discovery: The process of learning what a system can do. Discovery in voice is a non-trivial problem, since, unlike discovering what visual/touch based mobile Apps are available on an iPhone, for instance, by using the primary modality that the iPhone uses in its normal operations (the visual/touch/command/driven interface), in the case of voice, discovering what a voice system can do by voice (the primary modality that the system uses in its normal operations) is cumbersome. The problem is more acute for discovering third-party added functionality. Amazon Alexa provides a visual interface for searching for “Alexa skills,” a methodology that alleviates the discovery problem and enables users to use a mechanism that they are familiar with. The solution, however, is generally deemed a stopgap measure, as it forces the user to resort a non-primary modality (visual/touch) as a crutch to support the primary modality (voice).

Disfluency: Verbal utterances such as “ah,” “hum,” etc., exhibited by speakers when hesitating or when claiming retention of a speaking dialog turn.

Earcon: The audio equivalent of an “icon” in graphical user interfaces. Earcons are used to signal conversation marks (e.g., when the system starts listening, when the system stops listening) as well as to communicate brand, mood, and emotion during a voice first based interaction.

Echo (The Amazon Echo): A Far Field device released by Amazon in November 2014. “Echo” has also come to represent the Amazon branded category of devices (Echo Dot, Echo Tap, Echo Look, Echo Show) that interact with the Amazon Alexa cloud service.

Echo Cancellation: A technique that filters out audio coming out of a device while processing incoming audio for speech recognition into that same device. By being “aware” of the audio signal that it itself is generating, a system processing an audio signal that includes that signal along with, say, spoken audio from a user, would then be able to process more accurately the signal coming from the user.

End-pointing: The marking of the start and the end of a speaker’s utterance for the purposes of ASR processing.

False Accept: An instance where the ASR accepted mistakenly an utterance as a valid response.

False Reject: An instance where the ASR mistakenly rejected an utterance as a invalid response.

Far Field Speech Recognition: Speech recognition technology that is able to process speech spoken by a user from a distance (usually 10 feet away or more) to the receiving device, usually in a context where there is ambient noise. The first performing mainstream Far Field Speech Recognition device and system was the Amazon Echo, which launched its product to the market in November 2014. The Speech Recognition technology that handles speech recognition on hand held, mobile devices (e.g., Siri) is called Near Field Speech Recognition.

Google Action: The equivalent of an Alexa Skill. Google also variably refers to Actions as “Agents,” “Assistants” and “Apps”.

Google Assistant: The cloud service provided by Google that powers Google’s Far Field device (Google Home) as well as other Android based devices (e.g., smartphones, tablets).

Google Home: Google’s Far Field device (the equivalent of the Amazon Echo). The device was launched in October 2016.

Grammar: A shorthand, encoded description of the set of utterances that the ASR can accept.

The Gricean Maxims: A set of specific rational principles observed by people who obey the Cooperative Principle (see above). These principles enable effective verbal conversational communication between humans. British philosopher of langue Paul Grice proposed four conversational maxims: Quality, Quantity, Relevance, and Manner.

The Gutenberg Parenthesis: The proposition that the last 500 years or so — the time between the invention of typeset printing, which ushered the era of the written word as the main mode of communicating knowledge, and the recent arrival of distributed social media — is a short parenthesis in a history of human communication that has relied on informal and decentralized communication, in oral form prior to Gutenberg, and currently via social media and orally.

Houndify: A platform launched in 2015 by music identifier service SoundHoud that enables developers to integrate speech recognition and Natural Language Processing systems into hardware and other software systems.

Invoke: A hardware Echo-like device manufactured by Harman Kardon that enables users to engage Cortana in Far Field conversations. The device is slated to be released generally in the Fall of 2017.

Mixed-initiative Dialog: Interactions where the user may unilaterally issue a request rather than simply provide exactly the information asked for by system prompts. For instance, while making a flight reservation, the system may ask the user, “What day are you planning to flight out?” Instead of answering that question, the user may say, “I’m flying to Denver, Colorado.” A Mixed-initiative system would recognize that the user provided not the exact answer to the question asked, but also (additive), or instead (substitutive), volunteered information that was going to be requested by the system later on. Such a system would accept this information, remember it, and continue the conversation. In contrast, a “Directed Dialog” system would rigidly insist on the departure date and won’t proceed successfully unless it received that piece of information.

Natural Language Processing (NLP): Technology that extracts the “meaning” of a user’s utterance or typed text. A meaning usually consists of an “Intent” and “Name-Value” pairs. The utterance, “I want to book a flight from Washington, DC to Boston,” has the Intent “Book-a-Flight” with the Name-Value pairs being, “Departure City”=”Washington, DC” and “Arrival City”=”Boston, MA”. An NLP system takes the flat sequence of words, “I want to book a flight from Washington, DC to Boston,” and produces a “meaning structure” (usually a JSON object) that boils down the sequence of words to an Intent and Name-Value pairs. The JSON object delivered can then be inspected by what is often called “middleware software” that can now easily extract the information in the object and execute additional business logic (e.g., retrieve available flight information, or ask for additional missing information, e.g., “What date would you be flying out of Washington, DC?”).

N-Best: In speech recognition, given an audio input, an ASR returns a list of results, with each result ascribed a “confidence score” (usually a fraction between 0 and 1 (e.g., “0.87”) or a percentage). N-Best refers to the “N” results that were returned by the ASR and that were above the “confidence threshold”. For instance if the user were to say, “Austin,” and the recognizer were to return, “Austin” with a score of 0.92, “Boston” with 0.87, “Houston” with 0.65, “Aspen” with 0.52, and “Oslo” with 0.43, and the threshold were set at 0.55, the 3-best would be, “Austin,” “Boston,” and “Houston”. If the threshold were set at 0.70, only the first two, “Austin” and “Boston” would be returned. If the threshold were set to 0.40, the 3-best would still be “Austin,” “Boston,” and “Houston.” The 4-best with a threshold of 0.40 would be “Austin,” “Boston,” “Houston,” and “Aspen.”

Near Field Speech Recognition: In contrast to “Far Field” speech recognition, which processes speech spoken by a human to a device from a distance (usually of 10 feet or more), Near Field speech recognition technology is used for handing spoken input from hand held mobile devices (e.g., Siri on the iPhone) that are used within inches or two feet away at most.

No-input Error: A situation where the system did not detect any speech input from the user.

No-match Error: A situation where the system was not able to match the user’s response to the responses that it expected the user to provide.

Out of Scope (OOS) Error: See No-match Error.

Persona: The personality of the system (formal, playful, chatty, aggressive, friendly, etc.) that comes across the way the system engages with the user. The persona is influence by factors such as the perceived gender of the system, the type of language the system uses, and how the system handles errors.

Progressive Prompting: The technique of beginning an exchange by providing the user with minimal instructions and elaborating on those instructions only if encountering response errors (e.g., no-input, no-match, etc.).

Prompt: The instruction or response that a system “speaks” to the user.

Recognition Tuning: The activity of configuring the ASR’s settings to optimize recognition accuracy and processing speed.

Second Orality: “Secondary orality is orality that is dependent on literate culture and the existence of writing, such as a television anchor reading the news or radio. While it exists in sound, it does not have the features of primary orality because it presumes and rests upon literate thought and expression, and may even be people reading written material.”

Siri: A voice based assistant launched by Apple on October 4th, 2011.

Speech To Text (STT): Software that converts an audio signal to words (text). “Speech to Text” is a term that is less frequently used in the industry than “Speech Recognition,” “Speech Reco,” or “ASR.”

Speech Recognizer: See ASR.

Tapered Prompting: The technique of eliding a prompt or a piece of a prompt in the context of a multistep interaction or a multi-part system response. For example, instead of the system asking repetitively, “What is your level of satisfaction with our service?” “What is your level of satisfaction with our pricing?” “What is your level of satisfaction with our cleanliness,” the system would ask: “What is your level of satisfaction with our service?” “How about our pricing?” “And our cleanliness?” The technique is used to provide a more natural and less robotic-sounding user experience.

Text to Speech (TTS): Technology that converts text to audio that is spoken by the system. TTS is usually used in the context of dynamically retrieved information (a product ID), or when the list of possible items to be spoken by the system (e.g., full addresses) is very large, and therefore, recording all of the options is not practical .

Voice First: Interfaces are said to be “Voice First” when the primary interface between the user and an automated system is a voice based one. “Voice First” does not necessarily mean “Voice Only”. A Voice First interface can have an additional, adjunct interface (usually a visual one) that can supplement the experience. For instance, one can ask if the nearest post office is open, receive the answer verbally, and then be provided with additional details about the post office location on a visual interface (mobile app, desktop browser).

Voice Biometrics: Technology that identifies specific markers within a given piece of audio that was spoken by a human being and uses those markers to uniquely model the speaker’s voice. The technology is the voice equivalent of technology that takes a visual finger print of a person and associates that unique finger print with the person’s identity. Voice Biometrics technology is used for both Voice Identification and Voice Verification.

Voice Identification (Voice ID): The capability of discriminating a speaker’s identity among a list of possible speaker identities based on the characteristics of the speaker’s voice input. Voice ID systems are usually trained by being provided with samples of speaker voices.

Voice Verification: The capability of confirming an identity claim based on a speaker’s voice input. Unlike Voice Identification, which attempts to match a given speaker’s voice input against a universe of speaker voices, Voice Verification compares a voice input against a given speaker’s voice and provides a likelihood match score. Voice Verifications are usually done in an “Identity Claim” setting: the user claims to be someone and then is “challenged” to verify their identity by speaking.

Voice User Interface (VUI): The voice equivalent of Graphical User Interface (GUI), VUI is a type of user interface that allows users to interact with electronic devices by speaking and listening to spoken text or “earcons”.

Wake Word: The spoken word or phrase that “wakes up” an always listening device.

The Voice First Lingo

Adaptive System: A system that adapts its behavior to changing parameters, such as the user’s identity, the time of day, day of week or month, the context of the interaction, etc.

Get in touch with us