Glossary / Appendix

Below you will find definitions of terms commonly used throughout this documentation. If anything is unclear, please reach out for clarification.

General Terms

Plugin

A module that encapsulates a process designed to perform a specific task (detect speakers, identify languages, find keywords) in a specific way (using a deep neural network embeddings, i-vectors+DNN bottleneck, etc.). A plugin thus contains the plan that links together a series of components (acoustic front-end, representation, classifier, fusion, calibration) into a pipeline. This generally captures a specific approach (algorithm) or process that uses a data model in the domain to perform the task, though the algorithm is generally independent of the data or audio condition specialization in a plugin's domain(s). For more information on the different types and capabilities, see the Plugins information page and domain, below.

Domain

A domain always resides within a plugin. The domain contains specific information (trained models, parameters, etc.) needed to prepare the plugin for specific operating conditions. Every plugin must have at least one domain, but may have many. Examples include telephone (tel), analog and/or digital push-to-talk (ptt), or distant microphone. Some plugins have a general domain trained on many data types, commonly-called "multi-condition".

Task

In the context of OLIVE, Task typically refers to the goal of a plugin, or the problem it is designed to address. For example, the plugin sad-dnn-v6 has the Task of SAD, or speech activity detection. For more information on the range of Tasks we currently have plugins for, please refer to the Plugins documentation.

Class

A specific target category of interest to be identified or detected by the system. A class can refer to a variety of things, depending on the respective plugin type it belongs to. For example, a 'class' in the context of a speaker identification plugin is an individual speaker of interest; it is a language or dialect in a LID plugin, a keyword in a KWS or QBE plugin, or a topic when referring to a TPD plugin. Classes can be pre-enrolled within a plugin, as is often the case with language identification plugins, but it is often necessary for end users to enroll their own classes of interest, as in the case of speaker identification plugins.

Frame

A frame of audio is a very short, typically 10ms slice of audio. FrameScoring plugins will report a score for each frame of audio submitted.

Plugin Traits (Common API Processes)

A plugin's functionality is defined by the Traits that it implements. Each plugin trait is associated with the set of messages that it is allowed to send, and that must be implemented for proper functionality. Below, these traits and their associated messages are defined.

FrameScorer

A frame scorer provides a score output for every X ms of an audio file or buffer, generally 10 ms. SAD and VTD are currently the only frame scorers in OLIVE.

RegionScorer

A region scorer provides scores for audio sub-segments detected within an audio file or buffer. For example, a KWS plugin would provide a keyword detection, its boundaries in time and score.

GlobalScorer

A global scorer assumes that an audio file or buffer is all of the same class and scores it as a unit. Examples include language identification and speaker verification.

Common API Processes

Adaptable / Adaptation

Adaptation typically uses in-domain data from a specific operational environment to alter the core behavior of the system such that it functions more effectively. Unlike adding data to a class, adaptation is altering the system as a whole and thus produces a new domain. Plugins that are adaptable support either supervised or unsupervised adaptation. Unsupervised adaptation improves performance without human input, using audio examples provided by users or accrued from use in a mission's audio conditions. This type of adaptation is "triggered" either when a user-specific amount of data is accrued or explicitly called by the end user application and applied to the plugin. Unsupervised adaptation does not create a new domain, it alters an existing domain, but it is reversible. Supervised adaptation, however, requires human input. Generally data that is properly annotated with respect to the target plugin. Language labels, for example, are necessary to perform LID adaptation, speech/non-speech labels for SAD adaptation, speaker labels for SID adaptation, and so on. The benefits of adaptation vary between targeted applications, data conditions, and amount of labeled data that is available to run adaptation. Supervised adaptation creates a new domain in most cases.

More details about adaptation can be found in the Adaptation sections of the API Documentation or CLI User Guide, or within individual plugin sections, if appropriate.

Supervised Adaptation

Human assisted improvement of the plugin, generally with feedback to the system in the form of annotations of target phenomena, or error corrections.

Unsupervised Adaptation

Autonomous adaptation using unlabelled data; requires no human labelling or feedback. This is not currently supported by any OLIVE plugins in the traditional sense, but some plugins do support the ability to perform an Update, which is a form of unsupervised adaptation.

Enrollable / Enrollment

Enrollment is the mechanism by which target classes are added to a plugin domain. An enroll able plugin allows users to add new classes and augment existing classes. Examples include speaker detection, language recognition and keyword spotting.

Augmentable / Augmentation

Plugins that support enrollment also support augmentation. Augmentation is simply the process of adding additional data to an existing class.

Updateable / Update

Updating occurs when a user invokes unsupervised adaptation on a plugin/domain by requesting that the plugin use the operational data examples it has accrued throughout normal usage to update the plugin's parameters and models to better fit the usage environment.

Diarization

Diarization is the process of automatically segmenting an audio file or stream based on a set of target phenomena. SAD is diarization based on speech and non-speech segments. Speaker diarization segments a files based on speaker changes.

Audio Vector

An audio vector is a representation of an audio file in a form pre-processed for a specific task. For example, an audio files stored as a speaker vector representation for a speaker detection plugin. This is useful since it is a very compact form of the file that is very small and quick to read into memory and very fast to score against versus reading in a file from disk.