sid-dplda-v3 (Speaker Identification)

Version Changelog

Plugin Version	Change
v3.0.0	Initial plugin release, functionally identical to v2.0.2, but updated to be include GPU support with proper configuration. Tested and released with OLIVE 5.5.0

Description

Speaker Identification plugins score a submitted segment of audio against one or more enrolled speakers with the goal of determining whether the speech in the segment in question was produced by one of the enrolled speakers.

This is a SID plugin leveraging dynamic calibration and discrimination via a DNN-powered DPLDA backend. This plugin accounts for the conditions of the trial to provide superior calibration performance out-of-the-box relative to prior plugins.

This plugin features:

Discriminative PLDA: SRI-pioneered approach to modeling of speaker variability using a DNN-trained backend and internal calibration defined by conditions of the audio. This approach is considerably faster and more reliable than the prior approach from SRI termed Trial-based Calibration (TBC).
Multi-bandwidth embeddings: The new embeddings DNN leverages the information in the 8-16kHz bandwidth to provided improved accuracy in audio files above 8kHz. No options are required by the user to define a bandwidth of choice as all audio is resampled to 16kHz prior to processing. The upsampling of 8kHz to 16kHz is suitable for this plugin due to the manner in which it was trained.

Domains

multi-v1
- Multi-condition domain tested heavily on telephone and microphone conditions, multiple languages, distances, and varying background noises and codecs.

Inputs

For enrollment, an audio file or buffer with a corresponding speaker identifier/label. For scoring, an audio buffer or file.

Outputs

Generally, a list of scores, one for each of the speakers enrolled in the domain, for the entire segment. As with SAD and LID, scores are log-likelihood ratios where a score of greater than 0 is considered a detection. SID plugins, in particular, due to their association with forensics are generally calibrated or use dynamic calibration to ensure valid log-likelihood ratios to facilitate detection. Plugins may be altered to return only detections, rather than a list of enrollees and scores, but this is generally done on the client side for the sake of flexibility.

Example output:

/data/sid/audio/file1.wav speaker1 -0.5348
/data/sid/audio/file1.wav speaker2 3.2122
/data/sid/audio/file1.wav speaker3 -5.5340
/data/sid/audio/file2.wav speaker1 0.5333
/data/sid/audio/file2.wav speaker2 -4.9444
/data/sid/audio/file2.wav speaker3 -2.6564

Enrollments

SID plugins allow for class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new speaker. A new enrollment is created with the first class modification, which consists of essentially sending the system an audio sample from a speaker, generally 5 seconds or more, along with a label for that speaker. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same speaker label.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
- GlobalScorerRequest
CLASS_MODIFIER – Enroll new speaker models or augment existing speaker models with additional data.
- ClassModificationRequest
- ClassRemovalRequest

Compatibility

OLIVE 5.1+

Limitations

Known or potential limitations of the plugin are outlined below.

Detection Granularity

All current SID plugins assume that an audio segment contains only a single speaker and may be scored as a single unit. If a given segment contains multiple speakers, the entire segment will still be scored as a unit. Speaker detection (SDD) represents another plugin type with the goal of locating known speakers, but that does not have this assumption, and will instead attempt to locate and label regions consisting of individual speakers within the audio segment.

Minimum Speech Duration

The system will only attempt to perform speaker identification if the submitted audio segment contains more than X seconds of detected speech (configurable as min_speech, 0.5 seconds by default).

Comments

GPU Support

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.

Minimum Speech

The plugin will only process files with at least 0.5 seconds of detected speech (configurable).

Advanced (Experimental) Usage for Exporting Embeddings

A custom option to output embeddings, the speech regions, and duration of speech is available by setting output_ivs_dump_path=$OUTPUT_PATH where OUTPUT_PATH is a user defined directory to store items. During enrollment and evaluation, if the information for an audio file (based on md5sum) exists in this directory, it is loaded instead of re-computed.

The plugin is an audio vectorizer and class exporter/importer.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name	Description	Default	Expected Range
threshold	Detection threshold: Higher value results in less detections being output, but of higher reliability.	0.0	-10.0 to 20.0
min_speech	The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled speakers.	0.5	0.5 - 4.0
sad_threshold	SAD threshold for determining the audio to be used in meteadata extraction	-2.0	-5.0 - 6.0