Skip to content

qbe-ftdnnSmolive-v1 (Query by Example Keyword Spotting)

Version Changelog

Plugin Version Change
v1.0.0 Initial plugin release with OLIVE 5.4.0 - This plugin is based off of qbe-tdnn-v5 but features a shift to Factorized TDNN model architecture as well as model quantization and pruning in addition to a full-resolution model for hardware compatibility when the pruned model cannot operate.

Description

This plugin is based heavily on its predecessor, qbe-tdnn-v5, with the important distinction that the model has been modified with both quantization and pruning, in addition to the adoption of factorized TDNN models. Quantized models perform computations at a reduced bit resolution (integer8 in our case) than the standard floating point precision allowing for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. Further details can be found at pytorch Pruning aims to make neural network models smaller by removing a subset of the nodes that have minimal impact on the performance. The pruned neural network is fine-tuned on the target task to minimize the performance loss. The result is a much lighter-weight and often faster-performing model that sacrifices little-to-no accuracy.

Query by Example Keyword Spotting plugins are designed to allow users to detect and label targeted keywords or keyphrases that are defined by audio sample enrollments. It has no language model or other constraints that accompany a traditional keyword spotting plugin, so it is language independent.

This is a query by example plugin constructed with Factorized TDNN architecture models, with score calibration and test-adaptive merging of examples. It features dynamic time warping to compensate for speed and cadence differences when detecting keywords. This version also removes Kaldi dependency, replacing a poor-performing bottleneck feature extractor with an in-house developed and trained model, and reduces false alarms by filtering overlapping lower-confidence detections.

Domains

  • multi-int8-v1
    • Multi-condition domain meant for general-purpose audio conditions including telephone, broadband microphone, and other noisy situations without too many digital or PTT distortions.

Inputs

For enrollment, an audio file or buffer with a corresponding keyword or query label. For scoring, an audio buffer or file.

Outputs

When one or more of the enrolled keywords has been detected in the submitted audio, QBE returns a region or list of timestamped regions (in seconds), each with a score for the keyword that has been detected.

The output of QBE follows the format of the traditional KWS output exactly:

<audio_file_path> <start_time_s> <end_time_s> <keyword_id> <score>

Example:

/data/qbe/test/testFile1.wav 0.630 1.170 Airplane 4.37324614709
/data/qbe/test/testFile2.wav 0.350 1.010 Watermelon -1.19732598006

Enrollments

Query by Example plugins allow class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new keyword to enroll. A new enrollment is created with the first class modification request, which consists of essentially sending the system an audio sample of a new keyword or key phrase, along with a label for that query. The label is not used at all by the system for detection, and is only a reference for the user to help recall what the query was. This means that it's completely acceptable to enroll a sample where a speaker is saying something like "buenas noches", and to label it "good night - spanish" or "good night" or even "lorem ipsum" in the system. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same query label.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

Compatibility

OLIVE 5.4+

Limitations

Known or potential limitations of the plugin are outlined below.

Quantized Model Hardware Compatibility

There are certain host/hardware requirements for the quantized models to be able to run; namely support for avx2. To avoid the situation where a lack of this support would cause the plugin to become nonfuntional, a full-bit (float32) model has been included that will be loaded and used in the rare case that the quantized model fails to load. This will use more memory, but allow the plugin to function.

Query/Keyword Recognizability

The longer and more distinct the enrolled keyword or key phrase is, the better it will be recognized. Shorter keywords, that may occur often in speech (for example, enrolling the word 'a') or sound very similar to other words, may cause false alarms. In general, QBE plugins are language and speaker independent, and enrolled queries should be able to find speech from other talkers as well. However, if the enrolled keyword example is spoken by someone with a particular accent or non-standard pronunciation, it may not generalize with one or few samples. It's possible to enroll multiple samples and/or samples from multiple speakers to maximize the coverage of the query's model.

Silence Sensitivity During Enrollment

This version of the plugin is known to be very sensitive to including silence when enrolling new keyword queries, especially at the beginning or end of the query. Care should be taken to ensure that the boundaries of the enrollment submissions are as tight to the actual speech as possible. If excessive amounts of silence are included in the enrollment, the system could confuse this silence as part of the query, and the dynamic-time-warping algorithm may cause this to label keyword detections erroneously including large amounts of silence, and may also drastically increase the keyword search time. Future versions of the plugin will address this.

Comments

Very short keyword queries will be confusable with many other words, since the phonemes they consist of may be common or frequently occur as part of other words, or sound very similar to words or sounds that may commonly occur. The longer and more distinct a keyword is, the lower the likelihood of false alarms.

Global Options

This plugin does not feature user-configurable parameters.