dia-hybrid-v2 (Speaker Diarization)

Version Changelog

Plugin Version	Change
v2.0.0	Initial plugin release, functionally identical to v1, but updated to be compatible with 5.0.0
v2.0.1 (latest)	Updated to be compatible with OLIVE 5.1.0

Description

Speaker Diarization plugins segment the submitted audio to determine 'who spoke when.' In constrast to Speaker Detection (SDD) plugins, there is no enrollment functionality or any concept of 'speakers of interest' or 'target speakers.' Instead, speaker clusters are defined automatically with clas names such as 'spk1', 'spk2', etc.

This plugin is based on Variational Bayes Diarization in an i-vector space defined by SRI’s hybrid alignment framework, which is powered by deep neural network bottleneck features.

Domains

multi-v1
- A generic domain trained for close-talking audio conditions (telephony close talking microphone) and various conditions such as distant speech, compressed speech and background noise.

Inputs

An audio file or buffer to be scored.

Outputs

In the basic case, Diarization returns a list of regions labelled with “spkN”. ‘N’ is an integer denoting an unknown speaker, and the maximum N is the total number of unknown speakers in the file. All regions of speaker N are deemed to be spoken by the same speaker. Regions are represented in seconds. The 'score' field is reported as -100.0; this is not a confidence value, but a placeholder to maintain output format compatibility with other region-scoring OLIVE plugins.

    input-audio.wav 0.470 8.210 spk3 1.0
    input-audio.wav 8.320 13.110 spk4 1.0
    input-audio.wav 13.280 29.960 spk3 1.0
    input-audio.wav 30.350 32.030 spk3 1.0
    input-audio.wav 32.310 46.980 spk1 1.0
    input-audio.wav 47.790 51.120 spk2 1.0
    input-audio.wav 51.360 54.290 spk3 1.0
    input-audio.wav 54.340 55.400 spk2 1.0
    input-audio.wav 55.550 58.790 spk2 1.0
    input-audio.wav 58.820 76.340 spk1 1.0

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected speaker of interest and corresponding score for this speaker.
- RegionScorerRequest

Compatibility

OLIVE 5.1+

Limitations

Diarization plugins perform their clustering blind of any specific speaker knowledge. These plugins have no concept of individual speakers of interest or enrolled speaker models. Therefore, the only labels that will come back from the plugin will be 'spk1', 'spk2', and so on, corresponding to proposed individual speakers within the audio. If your target use case involves searching for a known individual speaker or speaker(s), consider using a Speaker Detection (SDD) or Speaker Identification (SID) plugin instead.

Speed and Memory Usage

The current approach to diarization is exceptionally slow (close to real-time). Work is on-going to replace the Variational Bayes Diarization approach with a segmentation by classification approach that will combine segmentation and speaker detection into a single stage and improve system speed.

Speaker Persistence and Labeling Across Files

The definition of spkN for one processing instance is not retained for use in other processing instances. For instance, spk1 in fileA is not necessarily the same as spk1 in fileB. If speaker persistence and inter-file label consistency is required, please consider Speaker Detection (SDD) technology instead.

Minimum Speech Duration

The system will only attempt to perform speaker diarization on submitted audio if there is more than 5 seconds of total speech detected in the file or buffer. In addition, only segments that are 2 seconds or longer will be considered for clustering and given a diarization score and proposed speaker label.

Maximum number of speakers

Currently, the plugin is configured to differentiate a maximum of 6 distinct/unique speakers within any scored audio. If you need to distinguish between a larger set of unknown speakers within files, or if 6 speakers is more than you ever expect to see for your given audio conditions, contact SRI for ways that we can maximize performance for your use case.

Comments

Global Options

This plugin does not currently have user-configurable options, though it is possible for some performance tweaks and configuration changes to be made. If you find this plugin to not perform adequately for your data conditions, or have a specific use case, please get in touch with SRI to discuss how the plugin can be tuned for optimal performance on your data.