sid-embed-v6 (Speaker Identification)

Version Changelog

Plugin Version	Change
v6.0.0	Initial plugin release, functionally identical to v5.0.0, but updated to be compatible with OLIVE 5.0.0
v6.0.1	Updated to be compatible with OLIVE 5.1.0
v6.0.2	Compatibility and bug fixes, released with OLIVE 5.2.0

Description

Speaker Identification plugins score a submitted segment of audio against one or more enrolled speakers with the goal of determining whether the speech in the segment in question was produced by one of the enrolled speakers.

The release of the sid-embed-v6 plugin provides a new level of performance and automatic adaptation to new domains with a focus on cross-language and non-English speaker trials.

This plugin features:

Joint PLDA: SRI-pioneered approach to modeling of speaker land language variability jointly in the PLDA modeling space.
Dynamic Mean Normalization: A novel approach to automatically adapting a single parameter (mean of the embedding space) based on the enrollment data available from the domain. This innovation provides a major improvement to performance in new domains while counteracting calibration mismatch prior to PLDA modeling. This simplifies the calibration process enabling the use of a single calibration model across domains. By default this option is turned off.
Multi-bandwidth embeddings: The new embeddings DNN leverages the information in the 8-16kHz bandwidth to provided improved accuracy in audio files above 8kHz. No options are required by the user to define a bandwidth of choice as all audio is resampled to 16kHz prior to processing. The upsampling of 8kHz to 16kHz is suitable for this plugin due to the manner in which it was trained.

Domains

multicond-v1
- Multi-condition domain tested heavily on telephone and microphone conditions, multiple languages, distances, and varying background noises and codecs.
multilang-v1
- A domain specifically optimized for cross-languages comparison trials and trained with a preponderance of non-English data, for enhanced performance on non-English data.

Inputs

For enrollment, an audio file or buffer with a corresponding speaker identifier/label. For scoring, an audio buffer or file.

Outputs

Generally, a list of scores, one for each of the speakers enrolled in the domain, for the entire segment. As with SAD and LID, scores are log-likelihood ratios where a score of greater than 0 is considered a detection. SID plugins, in particular, due to their association with forensics are generally calibrated or use dynamic calibration to ensure valid log-likelihood ratios to facilitate detection. Plugins may be altered to return only detections, rather than a list of enrollees and scores, but this is generally done on the client side for the sake of flexibility.

Example output:

/data/sid/audio/file1.wav speaker1 -0.5348
/data/sid/audio/file1.wav speaker2 3.2122
/data/sid/audio/file1.wav speaker3 -5.5340
/data/sid/audio/file2.wav speaker1 0.5333
/data/sid/audio/file2.wav speaker2 -4.9444
/data/sid/audio/file2.wav speaker3 -2.6564

Enrollments

SID plugins allow for class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new speaker. A new enrollment is created with the first class modification, which consists of essentially sending the system an audio sample from a speaker, generally 5 seconds or more, along with a label for that speaker. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same speaker label.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
- GlobalScorerRequest
CLASS_MODIFIER – Enroll new speaker models or augment existing speaker models with additional data.
- ClassModificationRequest
- ClassRemovalRequest

Compatibility

OLIVE 5.1+

Limitations

Known or potential limitations of the plugin are outlined below.

Detection Granularity

All current SID plugins assume that an audio segment contains only a single speaker and may be scored as a single unit. If a given segment contains multiple speakers, the entire segment will still be scored as a unit. Speaker detection (SDD) represents another plugin type with the goal of locating known speakers, but that does not have this assumption, and will instead attempt to locate and label regions consisting of individual speakers within the audio segment.

Minimum Speech Duration

The system will only attempt to perform speaker identification if the submitted audio segment contains more than X seconds of detected speech (configurable as min_speech, 0.5 seconds by default).

Comments

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name	Description	Default	Expected Range
sad_threshold	SAD threshold for determining the audio to be used in metadata extraction	1.0	-5.0 - 6.0