kws-end2end-commercial-v1 (Keyword spotting)

Version Changelog

Plugin Version	Change
v1.0.0	Plugin release of a keyword spotting (KWS) system for English which filters the listed keywords in the output of a wav2vec2-based automatic speech recognition (ASR) system, tested and published with OLIVE 6.1.0

Description

This KWS system processes the output of an ASR system that performs speech-to-text conversion of speech contained within a submitted audio segment to create a transcript of what is being said. Currently the outputs are based on word-level transcriptions. ASR plugins are not-performing any translation, but simply speech-to-text in the native language. All KWS domains are language-dependent, and each one is meant to work only with a single, specific language.

The output format is identical to the existing ASR plugin output which is detailed below. This version does not provide a word-level confidence, which will be a future work, instead inserting a placeholder score of "1.0". The input speech is resampled to 16k and processed at this sampling frequency throughout the system.

Each domain in the plugin is language-specific, and is only capable of transcribing speech in that one language. The commercially viable plugin contains a single English domain, while the standard plugin contains the same domains as the asr-end2end plugin. All domains are trained on a diverse set of 8k and 16k training data covering conversational telephone, broadcast, read, distant speech (training data may vary per domain depending on availability), and are expected to perform well on a wide variety of acoustic conditions. The plugin will have a wider domain coverage on high-resourced domains, e.g. English, Mandarin, Spanish, on account of more diverse training data. All domains have also been trained on degraded speech as well as clean speech and they are expected to perform much better on noisy/reverberant recordings. An n-gram language model is used during decoding to improve the KWS performance.

Any input audio at a sample rate different than 16000 Hz will be resampled and processed as 16 kHz audio. Any higher-frequency (> 8 kHz) information will be discarded.

Domains

english-augmented-v1

Inputs

For scoring, an audio buffer or file. There is no verification performed by OLIVE or by KWS plugins that the audio passed as input is actually being spoken in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize.

Outputs

Similar to the ASR plugins, this KWS plugin is a region scorer, and as such will return a list of words, in the order they are spoken. Each detected word consists of timestamp regions in seconds (start and end time pairs), each with an accompanying placeholder score, along with the 'value' of that word. Each word output must be part of the vocabulary that specific language's domain was trained with. At this point, out-of-vocabulary words are not supported, so uncommon words, slang words, names, and some other vocabulary may not be able to be recognized by these plugins. If interested in this feature in the future, please contact us to start a conversation about adding such functionality.

Note that all current KWS plugin domains will output words in their 'native' script. This means that for languages like English and Spanish, each word will be in ASCII text, with the Latin alphabet. Mandarin Chinese, Russian, and Farsi, however, words will be comprised of unicode characters in the native script.

An example output excerpt for the keywords 'going to fly' and 'saint louis':

    input-audio.wav 0.330 0.460 going 1.00000000
    input-audio.wav 0.450 0.520 to 1.00000000
    input-audio.wav 0.510 0.940 fly 1.00000000
    input-audio.wav 1.500 1.660 going 1.00000000
    input-audio.wav 1.650 1.720 to 1.00000000
    input-audio.wav 1.710 1.930 fly 1.00000000
    input-audio.wav 2.100 2.380 saint 1.00000000
    input-audio.wav 2.370 2.950 louis 1.00000000

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected topic of interest and corresponding score for this topic.
- RegionScorerRequest

Compatibility

OLIVE 6.1+

Limitations

As the debut KWS plugin release for OLIVE, there are several known limitations that will impact the usage of this plugin.

KWS is language dependent and also largely audio domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the speech recognition system's acoustic model. The individual words that the plugin is capable of recognizing is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be recognized by a plugin out-of-the-box. Several factors contribute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).

Language Dependence

Each domain of this KWS plugin is language specific, and is only capable of transcribing speech in a single language. There is no filter or any sort of verification performed by OLIVE to ensure that the speech passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input audio if the source language is unknown. An OLIVE Language ID plugin could be used as a front-end to triage out-of-domain languages before passing audio to the KWS plugin.

Overlap

To keep memory usage of the plugin under control, incoming audio is segmented into smaller chunks for processing, preventing large audio files from overwhelming system RAM during processing. This allows the plugin performance to be less dependent on the length of input audio, but transcription errors near these break points become much more likely, especially if a word happens to be split in the process. When chunking the input, the audio is split into chunks that overlap slightly, to minimize the chance of a split word causing errors, and give the recognizer another chance to correctly identify the word or words being spoken near the break points. There is error-reconciliation code to attempt to compensate for these overlapped sections, but it is still possible for issues to be encountered at these "chunk" boundaries.

Minimum Speech Duration

The system will not attempt to perform speech recognition unless 0.31 seconds of speech of more is found in the file.

Comments

GPU Support

This plugin was designed and developed to run optimally on GPU hardware. It is capable of running on CPU in the absence of an available GPU or the proper configuration, but it will do so at a significantly reduced speed.

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file: plugin_config.py.

Option Name	Description	Default	Expected Range
sad_threshold	Speech detection threshold: Higher value results in less speech being found and processed.	0.0	-4.0 to 4.0
unicode_normalization	Enable or disable unicode normalization on the plugin output for Arabic languages.	None	None (no normalization), "NFC", "NFD", "NFKC", or "NFKD".
LM_process	Enable or disable the use of the language model	True	True, False