qbe-tdnn-v5 (Query by Example Keyword Spotting)
Version Changelog
Plugin Version | Change |
---|---|
v5.0.0 | Initial plugin release with OLIVE 5.1.0 |
v5.0.1 | Increased threshold to minimize false alarms. Updated for OLIVE 5.1.0 |
Description
Query by Example Keyword Spotting plugins are designed to allow users to detect and label targeted keywords or keyphrases that are defined by audio sample enrollments. It has no language model or other constraints that accompany a traditional keyword spotting plugin, so it is language independent.
This is a query by example plugin constructed with the latest TDNN architecture models, with score calibration and test-adaptive merging of examples. It features dynamic time warping to compensate for speed and cadence differences when detecting keywords. This version also removes Kaldi dependency, replacing a poor-performing bottleneck feature extractor with an in-house developed and trained model, and reduces false alarms by filtering overlapping lower-confidence detections.
Domains
- multi-v1
- Multi-condition domain meant for general-purpose audio conditions including telephone, broadband microphone, and other noisy situations without too many digital or PTT distortions.
Inputs
For enrollment, an audio file or buffer with a corresponding keyword or query label. For scoring, an audio buffer or file.
Outputs
When one or more of the enrolled keywords has been detected in the submitted audio, QBE returns a region or list of timestamped regions (in seconds), each with a score for the keyword that has been detected.
The output of QBE follows the format of the traditional KWS output exactly:
<audio_file_path> <start_time_s> <end_time_s> <keyword_id> <score>
Example:
/data/qbe/test/testFile1.wav 0.630 1.170 Airplane 4.37324614709
/data/qbe/test/testFile2.wav 0.350 1.010 Watermelon -1.19732598006
Enrollments
Query by Example plugins allow class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new keyword to enroll. A new enrollment is created with the first class modification request, which consists of essentially sending the system an audio sample of a new keyword or key phrase, along with a label for that query. The label is not used at all by the system for detection, and is only a reference for the user to help recall what the query was. This means that it's completely acceptable to enroll a sample where a speaker is saying something like "buenas noches", and to label it "good night - spanish" or "good night" or even "lorem ipsum" in the system. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same query label.
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.
- REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected keyword and corresponding score for this keyword.
- CLASS_MODIFIER – Enroll new keyword models or augment existing keyword models with additional data.
Compatibility
OLIVE 5.1+
Limitations
Known or potential limitations of the plugin are outlined below.
Query/Keyword Recognizability
The longer and more distinct the enrolled keyword or key phrase is, the better it will be recognized. Shorter keywords, that may occur often in speech (for example, enrolling the word 'a') or sound very similar to other words, may cause false alarms. In general, QBE plugins are language and speaker independent, and enrolled queries should be able to find speech from other talkers as well. However, if the enrolled keyword example is spoken by someone with a particular accent or non-standard pronunciation, it may not generalize with one or few samples. It's possible to enroll multiple samples and/or samples from multiple speakers to maximize the coverage of the query's model.
Silence Sensitivity During Enrollment
This version of the plugin is known to be very sensitive to including silence when enrolling new keyword queries, especially at the beginning or end of the query. Care should be taken to ensure that the boundaries of the enrollment submissions are as tight to the actual speech as possible. If excessive amounts of silence are included in the enrollment, the system could confuse this silence as part of the query, and the dynamic-time-warping algorithm may cause this to label keyword detections erroneously including large amounts of silence, and may also drastically increase the keyword search time. Future versions of the plugin will address this.
Comments
Very short keyword queries will be confusable with many other words, since the phonemes they consist of may be common or frequently occur as part of other words, or sound very similar to words or sounds that may commonly occur. The longer and more distinct a keyword is, the lower the likelihood of false alarms.
Global Options
This plugin does not feature user-configurable parameters.