sad-dnn-v7 (Speech Activity Detection)
Version Changelog
Plugin Version | Change |
---|---|
v7.0.0 | Initial plugin release, functionally identical to v6.0.0, but updated to be compatible with OLIVE 5.0.0 |
v7.0.1 | Updated to be compatible with OLIVE 5.1.0, and introduces 'fast-multi-v1' domain |
v7.0.2 | Bug fixes from v7.0.1, released with OLIVE 5.2.0 |
Description
Speech activity detection (SAD, often referred to as voice activity detection) detects the presence of human vocalizations (speech) for each region or frame. In general, SAD outputs are processed for human listening and contain a very short buffer around each detected region to avoid edge effects when listening to the speech regions.
This plugin is a general-purpose DNN-based speech activity detection that is now capable of performing both frame scoring and region scoring.
Domains
- multi-v1
- Multi-condition domain trained on push-to-talk (PTT), telephony, and distant-microphone speech. Also hardened against speech containing music, to have more robust performance when encountering such conditions.
- fast-multi-v1
- Multi-condition domain trained on the same data as the multi-v1 model above, but featuring configuration changes that allow it to process much more quickly, with a possible very slight trade off in accuracy in some circumstances.
Inputs
An audio file or buffer and (optional) identifier and (optional) regions to be scored. If no regions are specific the entire audio file or buffer will be scored.
Outputs
SAD has two possible output formats: frame scores and region scores. Frame scores return log-likelihood ratio (LLR) score of speech vs. non-speech per each 10ms frame of the input audio segment. An LLR of greater than “0” indicates that the likelihood of speech is greater than the likelihood of non-speech and “0” is generally used as the threshold for detecting speech regions. A score of “0” indicated that speech is equally likely as non-speech. Region scores internally post-process frame scores to return speech regions, with some padding and interpolation.
An excerpt example of what frame scores typically look like:
-0.68944
-0.47805
-0.27453
-0.07032
0.13456
0.34013
0.53357
0.97258
1.10885
This can be transformed into a region scoring output, either client-side, or by requesting region scores from the plugin. Region scores output by the plugin will provide the timestamp boundaries, in seconds, of the locations where live speech was detected in the audio.
When the plugin is asked for region scores, the conversion from frame scores to region scores is done internally by applying a threshold to the scores. The contiguous frames that are above a threshold are converted to timestamps representing the start and end of each of these regions. These regions are then extended in duration by subtracting a padding
value (typically of 0.5s or lower) to the region time start and adding that same padding value to the region time end. For example, if a region starts at 10 seconds and ends at 20 second, with a 1-second padding value, the new 'padded' region will range from 9 to 21 seconds. If after region-padding, two regions overlap in time, they will be merged into a single extended region.
An example of SAD region scores:
test_1.wav 10.159 13.219 speech 0.00000000
test_1.wav 149.290 177.110 speech 0.00000000
test_1.wav 188.810 218.849 speech 0.00000000
The final number in this format is a placeholding number to retain formatting compatibility with other region-scoring type plugins.
Adaptation
Many SAD plugins allow for supervised adaptation. Supervised adaptation is the process of retraining the speech and non-speech SAD models with new labeled audio samples, to make them a better match to the operational environment. If while using SAD it is found that too many segments of speech are being missed, or too many false alarms are appearing (hypothesizing speech in regions without true speech), it is suggested to first try changing the threshold to higher values (to reduce false alarms) or to lower values (to reduce the missed speech segments). If changing the threshold is unsuccessful at increasing the performance of the system, then supervised adaptation may be able to address the performance deficiencies. Please check the individual plugin documentation page to verify if this functionality is available for the plugin you are working with. Supervised adaptation is accomplished by providing the system with three inputs:
- audio
- regions, in the form of start and end timestamps, in seconds
- labels for each region
The labels indicate which regions in the audio are speech or non-speech. The system adapts the model using this set of data and labels (‘S’ for speech and ‘NS’ for non-speech) in order to improve performance in target conditions that differ from its original training. Adaptation can substantially improve performance with 6 minutes or more of speech and non-speech region annotations, and performance can improve with as little as one minute. Adaptation durations of less than one minute have not been tested, and therefore results will be uncertain. Inputs to the plug-in should include both S and NS (speech and non-speech, respectively) regions. Inputs do not need to be balanced, but it is preferable that the S and NS regions are of similar durations, since just speech or just non-speech may not provide much performance benefit.
For example:
20131213T071501UTC_11020_A.wav S 72.719000 73.046
20131213T071501UTC_11020_A.wav NS 51.923000 53.379000
Note that it is important to provide the proper full or relative path to each audio file in order for it to be used for processing.
Note also that when performing any operation with OLIVE as of OLIVE 5.1, regions are required to be in seconds - this is in contrast to previous versions of OLIVE where some operations, such as adaptation through the Enterprise API, used to require milliseconds.
For more details about integration of adaptation in the API Integration, or for performing adaptation using the OLIVE command line tools, see the appropriate section in the CLI User Guide.
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.
- FRAME_SCORER - Score all submitted audio, returning a score corresponding to each 10ms frame of the file, representing the likelihood that each respective frame contains a detection.
- REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region represents timestamp boundaries where speech was detected.
- SUPERVISED_ADAPTER - Allow users to perform domain adaptation to enhance performance in new audio conditions by providing the system with new, labeled audio data to learn from. Performance shows substantial improvement with six minutes of annotated speech and non-speech.
This SAD plugin is capable of accepting Annotation Regions as part of the Audio message that each of the above messages include. When these optional Annotation Regions are provided, SAD will return results only for those specified regions.
Compatibility
OLIVE 5.1+
Limitations
Any known or potential limitations with this plugin are listed below.
Speech Disclaimer
A SAD plugin detects any and all speech, including singing, whether it is intelligible or not, live or recorded or even machine-produced.
DTMF False Alarms
It is possible for this plugin to produce false alarms when presented with audio containing DTMF tone signals and certain other signals with a similar structure to speech.
Minimum Audio Length
A minimum waveform duration of 0.31 seconds is required to produce a meaningful speech detection.
Comments
Speech Region Padding
When performing region scoring with this plugin, a 0.3 second padding is added to the start and end of each detected speech region, in order to reduce edge-effects for human listening, and to otherwise smooth the output regions of SAD.
Supervised Adaptation Guidance
When performing supervised adaptation with this plugin, the user must provide adaptation audio and annotations with both speech and non-speech segments, preferably of similar duration each. Suggested usage for supervised adaptation in this context is when performing in new acoustic environments where SAD is found to be not working as expected - either encountering too many true speech regions that are being missed, or too many non-speech segments being falsely identified as speech.
We suggest adapting with a minimum of 6 minutes total, which based on experimental results should provide satisfactory baseline adaptation when encountering stationary background noises, through to a minimum of 30 minutes total if dealing with non-stationary or music-like background environments. As little data as one minute can be used for adaptation and still provide performance improvements. Adaptation durations lower than 60s have not been tested, and a warning will be triggereed if adaptation is performed with fewer than 60s of adaptation annotation regions.
A minimum of 3 seconds of annotations must be provided in order for adaptation to be performed. If fewer than 3 seconds are provided, adaptation will halt and an error will be reported.
Global Options
The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_sad_config.py
.
Option Name | Description | Default | Expected Range |
---|---|---|---|
threshold | Detection threshold: Higher value results in less detections being output. Increase the threshold value to reduce the number and duration of false speech segments. Reduce the threshold value if there are too many missed speech segments. | 0.0 | -4.0 to 4.0 |