sid-embedSmolive-v1 (Speaker Identification)
Version Changelog
Plugin Version | Change |
---|---|
v1.0.0 | Initial plugin release with OLIVE 5.4.0 - This plugin is based off of sid-embed-v6 but features model quantization and pruning in addition to a full-resolution model for hardware compatibility when the pruned model cannot operate. |
Description
This plugin is based heavily on its predecessor, sid-embed-v6, with the important distinction that the model has been modified with both quantization and pruning. Quantized models perform computations at a reduced bit resolution (integer8 in our case) than the standard floating point precision allowing for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. Further details can be found at pytorch Pruning aims to make neural network models smaller by removing a subset of the nodes that have minimal impact on the performance. The pruned neural network is fine-tuned on the target task to minimize the performance loss. The result is a much lighter-weight and often faster-performing model that sacrifices little-to-no accuracy.
Speaker Identification plugins score a submitted segment of audio against one or more enrolled speakers with the goal of determining whether the speech in the segment in question was produced by one of the enrolled speakers.
The release of the sid-embed-v6 plugin provides a new level of performance and automatic adaptation to new domains with a focus on cross-language and non-English speaker trials.
This plugin features:
- Joint PLDA: SRI-pioneered approach to modeling of speaker land language variability jointly in the PLDA modeling space.
- Dynamic Mean Normalization: A novel approach to automatically adapting a single parameter (mean of the embedding space) based on the enrollment data available from the domain. This innovation provides a major improvement to performance in new domains while counteracting calibration mismatch prior to PLDA modeling. This simplifies the calibration process enabling the use of a single calibration model across domains. By default this option is turned off.
- Multi-bandwidth embeddings: The new embeddings DNN leverages the information in the 8-16kHz bandwidth to provided improved accuracy in audio files above 8kHz. No options are required by the user to define a bandwidth of choice as all audio is resampled to 16kHz prior to processing. The upsampling of 8kHz to 16kHz is suitable for this plugin due to the manner in which it was trained.
Domains
- multicond-prun-int8-v1
- Multi-condition domain tested heavily on telephone and microphone conditions, multiple languages, distances, and varying background noises and codecs.
Inputs
For enrollment, an audio file or buffer with a corresponding speaker identifier/label. For scoring, an audio buffer or file.
Outputs
Generally, a list of scores, one for each of the speakers enrolled in the domain, for the entire segment. As with SAD and LID, scores are log-likelihood ratios where a score of greater than 0 is considered a detection. SID plugins, in particular, due to their association with forensics are generally calibrated or use dynamic calibration to ensure valid log-likelihood ratios to facilitate detection. Plugins may be altered to return only detections, rather than a list of enrollees and scores, but this is generally done on the client side for the sake of flexibility.
Example output:
/data/sid/audio/file1.wav speaker1 -0.5348
/data/sid/audio/file1.wav speaker2 3.2122
/data/sid/audio/file1.wav speaker3 -5.5340
/data/sid/audio/file2.wav speaker1 0.5333
/data/sid/audio/file2.wav speaker2 -4.9444
/data/sid/audio/file2.wav speaker3 -2.6564
Enrollments
SID plugins allow for class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new speaker. A new enrollment is created with the first class modification, which consists of essentially sending the system an audio sample from a speaker, generally 5 seconds or more, along with a label for that speaker. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same speaker label.
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.
- GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
- CLASS_MODIFIER – Enroll new speaker models or augment existing speaker models with additional data.
Compatibility
OLIVE 5.4+
Limitations
Known or potential limitations of the plugin are outlined below.
Quantized Model Hardware Compatibility
There are certain host/hardware requirements for the quantized models to be able to run; namely support for avx2. To avoid the situation where a lack of this support would cause the plugin to become nonfuntional, a full-bit (float32) model has been included that will be loaded and used in the rare case that the quantized model fails to load. This will use more memory, but allow the plugin to function.
Detection Granularity
All current SID plugins assume that an audio segment contains only a single speaker and may be scored as a single unit. If a given segment contains multiple speakers, the entire segment will still be scored as a unit. Speaker detection (SDD) represents another plugin type with the goal of locating known speakers, but that does not have this assumption, and will instead attempt to locate and label regions consisting of individual speakers within the audio segment.
Minimum Speech Duration
The system will only attempt to perform speaker identification if the submitted audio segment contains more than X seconds of detected speech (configurable as min_speech, 0.5 seconds by default).
Comments
Global Options
The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py
.
Option Name | Description | Default | Expected Range |
---|---|---|---|
sad_threshold | SAD threshold for determining the audio to be used in metadata extraction | 1.0 | -5.0 - 6.0 |