Plugin Documentation
This page contains high level information about OLIVE plugins and their associated concepts - for more detailed information about integrating with plugins, please refer to the more low-level focused Plugin API Integration Details pages.
For more information about a specific plugin, please find its info page link in the list at the bottom of this page.
Anatomy of a Plugin
OLIVE Plugins encapsulate the actual audio processing technologies and capabilities of the OLIVE system into modular pieces that facilitate system upgrades, capability additions, incremental updates, and provide other benefits such as allowing for tuning models to improve processing in targeted audio conditions or allowing for multiple tools or options to choose from to accomplish a given task.
Each plugin has a specific type, which defines the task that it is capable of performing (see Plugin Types below). They consist of two parts, the plugin proper, which contains the recipe or algorithm information on how to perform the task, and one or more domains, which contain the data models used to run the algorithm and perform the function. Since plugins are generally machine learning-based the same algorithm may have multiple strengths based on the specific data used to train it (e.g. telephone audio, push-to-talk audio, high effort vocalization, conversational speech). This separation between the algorithm and the data model allows us to deliver new functionality that is based on training data independently of the algorithm by delivering a new domain. Classes and enrollments (described below) are associated with a domain.
Plugin Types
Plugin function types refer to the core task the plugin performs and have standard abbreviations. Most plugins are designed to perform one specific function (for example, language identification, keyword spotting). We refer to plugins as working on audio segments, since OLIVE can process both audio files (on the file system) and audio buffers (sent as data through the API).
For more information regarding each plugin type, but not a specific plugin or domain, including plugin type definitions, general output formats, and use cases, please click on the name of the plugin in the 'Function Type' column of the table below to visit that plugin type's information page:
Table of Plugin Function Types
Function type | Abbreviation | Scoring type | Classes | Description |
---|---|---|---|---|
Speech Activity Detection | SAD | Frame / Region | speech | Identifies speech regions in an audio segment |
Speaker Identification | SID | Global | Enrolled speakers | Identifies whether a single-talker audio segment contains a target speaker |
Speaker Detection | SDD | Region | Enrolled speakers | Detects a target speaker in audio with multiple talkers |
Speaker Diarization (Deprecated) | DIA | Region | Detect each unlabeled speaker | Segments audio into clusters of unique speakers |
Language Identification | LID | Global | Languages in training data and/or enrolled languages in some plugins | Detect and label a single language per input audio segment or file |
Language Detection | LDD | Region | Languages in training data and/or enrolled languages in some plugins | Detect and label one or more language regions per input audio |
Automatic Speech Recognition | ASR | Region | Creates a text transcription of the input audio | |
Keyword Spotting (Deprecated) | KWS | Region | Enrolled text keywords | Language-specific approach to keyword detection using speech recognition and text |
Query by Example Keyword Spotting | QBE | Region | Enrolled audio keywords | Language independent approach to word spotting using one or more audio examples |
Gender Identification | GID | Global | male, female | Determines whether the audio segment was spoken by a male or female voice. Single output class for a given audio input. |
Gender Detection | GDD | Region | male, female | Detectrs and labels whether speech is likely spoken by a male or female voice. Capable of detecting multiple regions within the input audio. |
Topic Detection (Deprecated) | TPD | Region | Enrolled topics | Detects topic regions in an audio segment |
Speech Enhancement | ENH | N/A | N/A | Reduces noise in an audio segment |
Voice Type Discrimination | VTD | Frame | live-speech | Detects presence of live-produced human speech, differentiating from silence, noise, speech coming from electronic device |
Text Machine Translation | TMT | TextTransformer | Performs a translation of text from the input language specified to the output language specified. Does not currently output timing information. Does not operate on audio. |
Scoring Types
Different function types score audio segments on different levels of granularity. Some plugin functionality differences are essentially differences is how an audio segment is treated -- as a single unit or potentially multiple units. For example, the main difference between speaker identification and speaker detection is how a segment is scored, in that speaker identification assumes that the audio segment sent to it for scoring is homogenous and comes from a single speaker, where speaker detection will instead allow for the possibility of the presence of multiple speakers in a given audio segment. There are three major scoring types:
Frame
- Assigns a score for each 10ms frame of the audio segment submitted for scoring.Region
- Assigns and reports time boundaries defining region(s) within the audio segment, and for each region, an accompanying score for each detected class.Global
- Assigns a single score for the entire audio segment for each of the plugin's classes.
For more information on these scoring types, refer to the Plugin Traits page.
Classes
Certain plugin types have classes as an attribute. These can be common, cross-mission categories that are often pre-trained - like speech, languages, or dialects - or they can be ad-hoc mission-specific classes like speakers or topics. A plugin’s classes may be completely fixed as in gender identification (male, female) or speech activity detection (speech) or an open set as in language identification (English, Spanish, Mandarin, etc.), topics, or speakers. Some plugins allow the user to add new classes or modify existing classes. Some class sets are inherently closed, like SAD and GID, where the plugin is complete and covers the world of possible classes. Others, like LID/SID/TPD plugin will probably never be complete in covering all classes and thus will always need to be able to treat a segment as though it may not be from among the classes the plugin recognizes (i.e. ‘out of set’).
Enrollments
Enrollments are a sub-set of classes that the user can create and/or modify. Both creation of a class and modification of an existing class are class modification requests, where the first class modification request for a given class also has the effect of creating the new class if it does not yet exist. Enrollments may be generated by end users with examples from their own data and can be learned from a single or small number of examples (SID, QBE) to a relatively large number of examples (LID, TID). Speakers are typically enrollments, as are query-based keywords and topics. Languages can also be enrolled and augmented with certain plugins. Since enrollments are dynamic, they may be incrementally updated “on the fly” with new examples.
For integration details regarding enrollments, refer to the Enrollments section of the API Integration page. To determine if a plugin supports enrollments, or to check what its default enrolled classes are (if any), refer to that plugin's details page from the Specific Plugins list below.
Online Updates
Considerable improvements to system accuracy and calibration can be found by updating a plugin post-deployment to better align with the conditions observed in recent history. Several plugins are able to perform unsupervised updates to certain submodules of the plugin. The updates do not require labels or human input and are based on automatically collected information during normal system use. In most cases, a system update must be invoked by the user via the API, and an option to determine if an update is ready to be applied is also provided in the API.
For integration details regarding the update functionality, refer to the Update section of the API Integration page. To check if a plugin supports online updates, refer to its detailed information page from the Specific Plugins list below.
Adaptation
Similarly to online updates, it can be possible to achieve even larger boosts in performance by updating a plugin by exposing it to the mission's audio conditions, or similarly representative audio conditions. Adaptation, however, requires human input, and in some cases, data that is properly annotated with respect to the target plugin. Language labels, for example, are necessary to perform LID adaptation, speech labels for SAD adaptation, speaker labels for SID adaptation, and so on. The benefits of adaptation vary between targeted applications, data conditions, and amount of labeled data that is available to run adaptation.
More details about adaptation can be found in the Adaptation sections of the API Documentation or CLI User Guide, or within individual plugin sections, if appropriate.
Naming Conventions
Plugin Names
Each plugin proper is given a three part name in the following form: function-attribute-version:
function
- three-letter abbreviation from column two of the Plugin Types Table aboveattribute
- string that identifies the key attribute of the plugin, generally the algorithm nameversion
- tracks the iteration of the plug-in in development in the form of v<digit>
For example, sid-embed-v2 is a speaker identification plugin using speaker embeddings algorithm and is the second release or update of the approach. An additonal alphanumeric character may be appended to the version number if a plugin is re-released with bug fixes, but the performance is expected to be the same. For example, sad-dnn-v6a is a modified version of sad-dnn-v6, but the changes were meant to address errors or shortcomings in the plugin, not to change the algorithm or data used.
Domain Names
Domain names typically have two or three parts: condition-version
or language-condition-version
for plugins that have language-dependent domain components. Keyword spotting domains also contain the language for which the domain was trained.
language
- the language for which the domain was trained if language-dependent, or a representation of the set of languages contained within a LID plugin's domaincondition
- the specific audio environment for which the domain was trained, or “multi” if the domain was developed to be condition independentversion
- tracks the iteration of the plug-in in development in the form of v<digit>
Specific Plugins
For additional information about specific plugins, their options, implementation details and other, please refer to the specific plugin pages, accessible from each of the individual Plugin Task pages.