Plugin Documentation

This page contains high level information about OLIVE plugins and their associated concepts - for more detailed information about integrating with plugins, please refer to the more low-level focused Plugin API Integration Details pages.

For more information about a specific plugin, please find its info page link in the list at the bottom of this page.

Anatomy of a Plugin

OLIVE Plugins encapsulate the actual audio processing technologies and capabilities of the OLIVE system into modular pieces that facilitate system upgrades, capability additions, incremental updates, and provide other benefits such as allowing for tuning models to improve processing in targeted audio conditions or allowing for multiple tools or options to choose from to accomplish a given task.

Each plugin has a specific type, which defines the task that it is capable of performing (see Plugin Types below). They consist of two parts, the plugin proper, which contains the recipe or algorithm information on how to perform the task, and one or more domains, which contain the data models used to run the algorithm and perform the function. Since plugins are generally machine learning-based the same algorithm may have multiple strengths based on the specific data used to train it (e.g. telephone audio, push-to-talk audio, high effort vocalization, conversational speech). This separation between the algorithm and the data model allows us to deliver new functionality that is based on training data independently of the algorithm by delivering a new domain. Classes and enrollments (described below) are associated with a domain.

Plugin Types

Plugin function types refer to the core task the plugin performs and have standard abbreviations. Most plugins are designed to perform one specific function (for example, language identification, keyword spotting). We refer to plugins as working on audio segments, since OLIVE can process both audio files (on the file system) and audio buffers (sent as data through the API).

For more information regarding each plugin type, but not a specific plugin or domain, including plugin type definitions, general output formats, and use cases, please click on the name of the plugin in the 'Function Type' column of the table below to visit that plugin type's information page:

Table of Plugin Function Types

Function type	Abbreviation	Scoring type	Classes	Description
Speech Activity Detection	SAD	Frame / Region	speech	Identifies speech regions in an audio segment
Speaker Identification	SID	Global	Enrolled speakers	Identifies whether a single-talker audio segment contains a target speaker
Speaker Detection	SDD	Region	Enrolled speakers	Detects a target speaker in audio with multiple talkers
Speaker Diarization (Deprecated)	DIA	Region	Detect each unlabeled speaker	Segments audio into clusters of unique speakers
Language Identification	LID	Global	Languages in training data and/or enrolled languages in some plugins	Detect and label a single language per input audio segment or file
Language Detection	LDD	Region	Languages in training data and/or enrolled languages in some plugins	Detect and label one or more language regions per input audio
Automatic Speech Recognition	ASR	Region		Creates a text transcription of the input audio
Keyword Spotting (Deprecated)	KWS	Region	Enrolled text keywords	Language-specific approach to keyword detection using speech recognition and text
Query by Example Keyword Spotting	QBE	Region	Enrolled audio keywords	Language independent approach to word spotting using one or more audio examples
Gender Identification	GID	Global	male, female	Determines whether the audio segment was spoken by a male or female voice. Single output class for a given audio input.
Gender Detection	GDD	Region	male, female	Detectrs and labels whether speech is likely spoken by a male or female voice. Capable of detecting multiple regions within the input audio.
Topic Detection (Deprecated)	TPD	Region	Enrolled topics	Detects topic regions in an audio segment
Speech Enhancement	ENH	Audio to Audio	N/A	Reduces noise in an audio segment
Voice Type Discrimination	VTD	Frame / Region	live-speech	Detects presence of live-produced human speech, differentiating from silence, noise, speech coming from electronic device
Text Machine Translation	TMT	TextTransformer		Performs a translation of text from the input language specified to the output language specified. Does not currently output timing information. Does not operate on audio.
Audio Redaction	RED	Audio to Audio	N/A	Replaces selecte audio regions with either 'bleeped' or transformed audio for privacy protection purposes
Deep Fake Audio Detection	DFA	Global	synthetic	Identifies whether audio is likely to be synthetically generated by a deep fake algorithm, or naturally generated by a human talker
Speaker Highlighting	SHL	Region	Highlighted Speaker	Detects additional regions in audio where the seeded speaker is found. Requires human intervention in the form of a selected region where representative speech is present from a speaker to locate additional regions of this speaker in the file/audio.
Face Detection from Image	FDI	BoundingBoxScorer	face(s)	Detect one or more faces in an image. Detects faces in general, not necessarily specific faces.
Face Detection from Video	FDV	BoundingBoxScorer	face(s)	Detect one or more faces in a video. Detects faces in general, not necessarily specific faces.
Face Recognition from Image	FRI	BoundingBoxScorer	enrolled face(s)	Detect one or more specific, enrolled face(s) in an image. Outputs a bounding box where face is detected.
Face Recognition from Video	FRV	BoundingBoxScorer	enrolled face(s)	Detect one or more specific, enrolled face(s) in a video. Outputs a bounding box where face is detected, with accompanying timestamp(s).

Scoring Types

Different function types score audio segments on different levels of granularity. Some plugin functionality differences are essentially differences is how an audio segment is treated -- as a single unit or potentially multiple units. For example, the main difference between speaker identification and speaker detection is how a segment is scored, in that speaker identification assumes that the audio segment sent to it for scoring is homogenous and comes from a single speaker, where speaker detection will instead allow for the possibility of the presence of multiple speakers in a given audio segment. There are three major scoring types:

Frame - Assigns a score for each 10ms frame of the audio segment submitted for scoring.
Region - Assigns and reports time boundaries defining region(s) within the audio segment, and for each region, an accompanying score for each detected class.
Global - Assigns a single score for the entire audio segment for each of the plugin's classes.

For more information on these scoring types, refer to the Plugin Traits page.

Classes

Certain plugin types have classes as an attribute. These can be common, cross-mission categories that are often pre-trained - like speech, languages, or dialects - or they can be ad-hoc mission-specific classes like speakers or topics. A plugin’s classes may be completely fixed as in gender identification (male, female) or speech activity detection (speech) or an open set as in language identification (English, Spanish, Mandarin, etc.), topics, or speakers. Some plugins allow the user to add new classes or modify existing classes. Some class sets are inherently closed, like SAD and GID, where the plugin is complete and covers the world of possible classes. Others, like LID/SID/TPD plugin will probably never be complete in covering all classes and thus will always need to be able to treat a segment as though it may not be from among the classes the plugin recognizes (i.e. ‘out of set’).

Enrollments

Enrollments are a sub-set of classes that the user can create and/or modify. Both creation of a class and modification of an existing class are class modification requests, where the first class modification request for a given class also has the effect of creating the new class if it does not yet exist. Enrollments may be generated by end users with examples from their own data and can be learned from a single or small number of examples (SID, QBE) to a relatively large number of examples (LID, TID). Speakers are typically enrollments, as are query-based keywords and topics. Languages can also be enrolled and augmented with certain plugins. Since enrollments are dynamic, they may be incrementally updated “on the fly” with new examples.

For integration details regarding enrollments, refer to the Enrollments section of the API Integration page. To determine if a plugin supports enrollments, or to check what its default enrolled classes are (if any), refer to that plugin's details page from the Specific Plugins list below.

Online Updates

Considerable improvements to system accuracy and calibration can be found by updating a plugin post-deployment to better align with the conditions observed in recent history. Several plugins are able to perform unsupervised updates to certain submodules of the plugin. The updates do not require labels or human input and are based on automatically collected information during normal system use. In most cases, a system update must be invoked by the user via the API, and an option to determine if an update is ready to be applied is also provided in the API.

For integration details regarding the update functionality, refer to the Update section of the API Integration page. To check if a plugin supports online updates, refer to its detailed information page from the Specific Plugins list below.

Adaptation

Similarly to online updates, it can be possible to achieve even larger boosts in performance by updating a plugin by exposing it to the mission's audio conditions, or similarly representative audio conditions. Adaptation, however, requires human input, and in some cases, data that is properly annotated with respect to the target plugin. Language labels, for example, are necessary to perform LID adaptation, speech labels for SAD adaptation, speaker labels for SID adaptation, and so on. The benefits of adaptation vary between targeted applications, data conditions, and amount of labeled data that is available to run adaptation.

More details about adaptation can be found in the Adaptation sections of the API Documentation or CLI User Guide, or within individual plugin sections, if appropriate.

Naming Conventions

Plugin Names

Each plugin proper is given a three part name in the following form: function-attribute-version:

function - three-letter abbreviation from column two of the Plugin Types Table above
attribute - string that identifies the key attribute of the plugin, generally the algorithm name
version - tracks the iteration of the plug-in in development in the form of v<digit>

For example, sid-embed-v2 is a speaker identification plugin using speaker embeddings algorithm and is the second release or update of the approach. An additonal alphanumeric character may be appended to the version number if a plugin is re-released with bug fixes, but the performance is expected to be the same. For example, sad-dnn-v6a is a modified version of sad-dnn-v6, but the changes were meant to address errors or shortcomings in the plugin, not to change the algorithm or data used.

Domain Names

Domain names typically have two or three parts: condition-version or language-condition-version for plugins that have language-dependent domain components. Keyword spotting domains also contain the language for which the domain was trained.

language - the language for which the domain was trained if language-dependent, or a representation of the set of languages contained within a LID plugin's domain
condition - the specific audio environment for which the domain was trained, or “multi” if the domain was developed to be condition independent
version - tracks the iteration of the plug-in in development in the form of v<digit>

Specific Plugins

For additional information about specific plugins, their options, implementation details and other, please refer to the specific plugin pages, accessible from each of the individual Plugin Task pages.