Skip to content

Plugin Documentation

This page contains high level information about OLIVE plugins and their associated concepts - for more detailed information about integrating with plugins, please refer to the more low-level focused Plugin API Integration Details pages.

For more information about a specific plugin, please find its info page link in the list at the bottom of this page.

Anatomy of a Plugin

OLIVE Plugins encapsulate the actual audio processing technologies and capabilities of the OLIVE system into modular pieces that facilitate system upgrades, capability additions, incremental updates, and provide other benefits such as allowing for tuning models to improve processing in targeted audio conditions or allowing for multiple tools or options to choose from to accomplish a given task.

Each plugin has a specific type, which defines the task that it is capable of performing (see Plugin Types below). They consist of two parts, the plugin proper, which contains the recipe or algorithm information on how to perform the task, and one or more domains, which contain the data models used to run the algorithm and perform the function. Since plugins are generally machine learning-based the same algorithm may have multiple strengths based on the specific data used to train it (e.g. telephone audio, push-to-talk audio, high effort vocalization, conversational speech). This separation between the algorithm and the data model allows us to deliver new functionality that is based on training data independently of the algorithm by delivering a new domain. Classes and enrollments (described below) are associated with a domain.

Plugin Types

Plugin function types refer to the core task the plugin performs and have standard abbreviations. Most plugins are designed to perform one specific function (for example, language identification, keyword spotting). We refer to plugins as working on audio segments, since OLIVE can process both audio files (on the file system) and audio buffers (sent as data through the API).

For more information regarding each plugin type, but not a specific plugin or domain, including plugin type definitions, general output formats, and use cases, please click on the name of the plugin in the 'Function Type' column of the table below to visit that plugin type's information page:

Table of Plugin Function Types

Function type Abbreviation Scoring type Classes Description
Speech Activity Detection SAD Frame / Region speech Identifies speech regions in an audio segment
Speaker Identification SID Global Enrolled speakers Identifies whether a single-talker audio segment contains a target speaker
Speaker Detection SDD Region Enrolled speakers Detects a target speaker in audio with multiple talkers
Speaker Diarization (Deprecated) DIA Region Detect each unlabeled speaker Segments audio into clusters of unique speakers
Language Identification LID Global Languages in training data and/or enrolled languages in some plugins Detect and label a single language per input audio segment or file
Language Detection LDD Region Languages in training data and/or enrolled languages in some plugins Detect and label one or more language regions per input audio
Automatic Speech Recognition ASR Region Creates a text transcription of the input audio
Keyword Spotting (Deprecated) KWS Region Enrolled text keywords Language-specific approach to keyword detection using speech recognition and text
Query by Example Keyword Spotting QBE Region Enrolled audio keywords Language independent approach to word spotting using one or more audio examples
Gender Identification GID Global male, female Determines whether the audio segment was spoken by a male or female voice. Single output class for a given audio input.
Gender Detection GDD Region male, female Detectrs and labels whether speech is likely spoken by a male or female voice. Capable of detecting multiple regions within the input audio.
Topic Detection (Deprecated) TPD Region Enrolled topics Detects topic regions in an audio segment
Speech Enhancement ENH Audio to Audio N/A Reduces noise in an audio segment
Voice Type Discrimination VTD Frame / Region live-speech Detects presence of live-produced human speech, differentiating from silence, noise, speech coming from electronic device
Text Machine Translation TMT TextTransformer Performs a translation of text from the input language specified to the output language specified. Does not currently output timing information. Does not operate on audio.
Audio Redaction RED Audio to Audio N/A Replaces selecte audio regions with either 'bleeped' or transformed audio for privacy protection purposes
Deep Fake Audio Detection DFA Global synthetic Identifies whether audio is likely to be synthetically generated by a deep fake algorithm, or naturally generated by a human talker
Speaker Highlighting SHL Region Highlighted Speaker Detects additional regions in audio where the seeded speaker is found. Requires human intervention in the form of a selected region where representative speech is present from a speaker to locate additional regions of this speaker in the file/audio.
Face Detection from Image FDI BoundingBoxScorer face(s) Detect one or more faces in an image. Detects faces in general, not necessarily specific faces.
Face Detection from Video FDV BoundingBoxScorer face(s) Detect one or more faces in a video. Detects faces in general, not necessarily specific faces.
Face Recognition from Image FRI BoundingBoxScorer enrolled face(s) Detect one or more specific, enrolled face(s) in an image. Outputs a bounding box where face is detected.
Face Recognition from Video FRV BoundingBoxScorer enrolled face(s) Detect one or more specific, enrolled face(s) in a video. Outputs a bounding box where face is detected, with accompanying timestamp(s).

Scoring Types

Different function types score audio segments on different levels of granularity. Some plugin functionality differences are essentially differences is how an audio segment is treated -- as a single unit or potentially multiple units. For example, the main difference between speaker identification and speaker detection is how a segment is scored, in that speaker identification assumes that the audio segment sent to it for scoring is homogenous and comes from a single speaker, where speaker detection will instead allow for the possibility of the presence of multiple speakers in a given audio segment. There are three major scoring types:

  • Frame - Assigns a score for each 10ms frame of the audio segment submitted for scoring.
  • Region - Assigns and reports time boundaries defining region(s) within the audio segment, and for each region, an accompanying score for each detected class.
  • Global - Assigns a single score for the entire audio segment for each of the plugin's classes.

For more information on these scoring types, refer to the Plugin Traits page.

Classes

Certain plugin types have classes as an attribute. These can be common, cross-mission categories that are often pre-trained - like speech, languages, or dialects - or they can be ad-hoc mission-specific classes like speakers or topics. A plugin’s classes may be completely fixed as in gender identification (male, female) or speech activity detection (speech) or an open set as in language identification (English, Spanish, Mandarin, etc.), topics, or speakers. Some plugins allow the user to add new classes or modify existing classes. Some class sets are inherently closed, like SAD and GID, where the plugin is complete and covers the world of possible classes. Others, like LID/SID/TPD plugin will probably never be complete in covering all classes and thus will always need to be able to treat a segment as though it may not be from among the classes the plugin recognizes (i.e. ‘out of set’).

Enrollments

Enrollments are a sub-set of classes that the user can create and/or modify. Both creation of a class and modification of an existing class are class modification requests, where the first class modification request for a given class also has the effect of creating the new class if it does not yet exist. Enrollments may be generated by end users with examples from their own data and can be learned from a single or small number of examples (SID, QBE) to a relatively large number of examples (LID, TID). Speakers are typically enrollments, as are query-based keywords and topics. Languages can also be enrolled and augmented with certain plugins. Since enrollments are dynamic, they may be incrementally updated “on the fly” with new examples.

For integration details regarding enrollments, refer to the Enrollments section of the API Integration page. To determine if a plugin supports enrollments, or to check what its default enrolled classes are (if any), refer to that plugin's details page from the Specific Plugins list below.

Online Updates

Considerable improvements to system accuracy and calibration can be found by updating a plugin post-deployment to better align with the conditions observed in recent history. Several plugins are able to perform unsupervised updates to certain submodules of the plugin. The updates do not require labels or human input and are based on automatically collected information during normal system use. In most cases, a system update must be invoked by the user via the API, and an option to determine if an update is ready to be applied is also provided in the API.

For integration details regarding the update functionality, refer to the Update section of the API Integration page. To check if a plugin supports online updates, refer to its detailed information page from the Specific Plugins list below.

Adaptation

Similarly to online updates, it can be possible to achieve even larger boosts in performance by updating a plugin by exposing it to the mission's audio conditions, or similarly representative audio conditions. Adaptation, however, requires human input, and in some cases, data that is properly annotated with respect to the target plugin. Language labels, for example, are necessary to perform LID adaptation, speech labels for SAD adaptation, speaker labels for SID adaptation, and so on. The benefits of adaptation vary between targeted applications, data conditions, and amount of labeled data that is available to run adaptation.

More details about adaptation can be found in the Adaptation sections of the API Documentation or CLI User Guide, or within individual plugin sections, if appropriate.

Naming Conventions

Plugin Names

Each plugin proper is given a three part name in the following form: function-attribute-version:

  • function - three-letter abbreviation from column two of the Plugin Types Table above
  • attribute - string that identifies the key attribute of the plugin, generally the algorithm name
  • version - tracks the iteration of the plug-in in development in the form of v<digit>

For example, sid-embed-v2 is a speaker identification plugin using speaker embeddings algorithm and is the second release or update of the approach. An additonal alphanumeric character may be appended to the version number if a plugin is re-released with bug fixes, but the performance is expected to be the same. For example, sad-dnn-v6a is a modified version of sad-dnn-v6, but the changes were meant to address errors or shortcomings in the plugin, not to change the algorithm or data used.

Domain Names

Domain names typically have two or three parts: condition-version or language-condition-version for plugins that have language-dependent domain components. Keyword spotting domains also contain the language for which the domain was trained.

  • language - the language for which the domain was trained if language-dependent, or a representation of the set of languages contained within a LID plugin's domain
  • condition - the specific audio environment for which the domain was trained, or “multi” if the domain was developed to be condition independent
  • version - tracks the iteration of the plug-in in development in the form of v<digit>

Specific Plugins

For additional information about specific plugins, their options, implementation details and other, please refer to the specific plugin pages, accessible from each of the individual Plugin Task pages.