Skip to content

vtd-dnn-v7 (Voice Type Discrimination)

Version Changelog

Plugin Version Change
v7.0.1 Initial plugin release, this plugin shares a codebase with sad-dnn-v7.0.1, with only the models and parameters configured for live speech detection. Released with OLIVE 5.1.0
v7.0.2 Bug fixes from v7.0.1, released with OLIVE 5.2.0

Description

Voice type discrimination (VTD) plugins are designed to detect the presence of speech coming from a live human talker. The goal of VTD is to be able to distinguish not only live-produced, human speech from silence or noise, but also from speech being played over an electronic speaker, such as from a television or phone. When live speech is detected, it is labeled with the timestamps corresponding to its location in the audio. Like SAD, this plug-in may be used either as a frame scorer or region scorer.

Domains

  • vtd-v1
    • Domain designed to detect live speech indoors, differentiating within room live-speech from background distractors like TV, radio, door sound, telephone ringing, traffic etc.

Inputs

An audio file or buffer and optional identifier and/or optional regions.

Outputs

The current VTD plugin is capable of performing both frame scoring and region scoring.

For frame scoring, typically, a log-likelihood ratio (LLR) score of live-speech vs. non-speech/non-live-speech per each 10ms frame of the input audio segment is output (i.e. 100 audio frames per second). An LLR of greater than “0” indicates that the likelihood of live-speech is greater than the likelihood of non-speech or non-live-speech and “0” is generally used as the threshold for detecting these live-speech regions. A score of “0” indicated that speech is equally likely as non-speech or non-live-speech. VTD plugins may also post-process frame scores to return speech regions, though this is often done on the client-side for flexibility.

An excerpt example of what this typically looks like:

    -0.68944
    -0.47805
    -0.27453
    -0.07032
    0.13456
    0.34013
    0.53357
    0.97258
    1.10885

This can be transformed into a region scoring output, either client-side, or by requesting region scores from the plugin. Region scores output by the plugin will provide the timestamp boundaries, in seconds, of the locations where live speech was detected in the audio.

When the plugin is asked for region scores, the conversion from frame scores to region scores is done internally by applying a threshold to the scores. The contiguous frames that are above a threshold are converted to timestamps representing the start and end of each of these regions. These regions are then extended in duration by adding a padding value (typically of 0.5s or lower) to the region time start and adding that same padding value to the region time end. For example, if a region starts at 10 seconds and ends at 20 seconds, with a 1-second padding value, the new 'padded' region will range from 9 to 21 seconds. If after region-padding, two regions overlap in time, they will be merged into a single extended region.

An example of VTD region scores:

    test_1.wav 10.159 13.219 speech 0.00000000
    test_1.wav 149.290 177.110 speech 0.00000000
    test_1.wav 188.810 218.849 speech 0.00000000

The final number in this format is a place holder number to retain formatting compatibility with other region-scoring type plugins.

Adaptation

VTD does not currently support adaptation.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

This plugin is capable of accepting Annotation Regions as part of the Audio message that each of the above messages include. When these optional Annotation Regions are provided, VTD will return results only for those specified regions.

Compatibility

OLIVE 5.1+

Limitations

Any known or potential limitations with this plugin are listed below.

Speech Intelligibility

The VTD plugin detects any and all live speech, whether it is intelligible or not.

Live-speech Detection Difficulties

It is especially difficult to detect live-speech or differentiate live-speech from pre-recorded or electronic-speaker-produced speech 1) when the microphone is placed very close to distractor sources like TV, radio, etc., 2) when these distractors are played at an unusually high volume or 3) in cases where the microphone is very distant from the source and the signal is weak.

Minimum Audio Length

A minimum waveform duration of 0.3 seconds is suggested to produce a meaningful live-speech detection.

Comments

Speech Region Padding

When performing region scoring with this plugin, a 0.3 second padding is added to the start and end of each detected speech region, in order to reduce edge-effects for human listening, and to otherwise smooth the output regions of SAD.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name Description Default Expected Range
threshold Detection threshold: Higher value results in less detections being output. Increase the threshold value to reduce the number and duration of false live-speech segments. Reduce the threshold value if there are too many missed live-speech segments. 0.0 -4.0 to 4.0