asr-whisper-v1 (Automatic Speech Recognition)

Version Changelog

Plugin Version	Change
v2.0.0	Initial plugin release of the faster-whisper-based fine-tuned Whisper ASR models, tested and published with OLIVE 5.7.0. Now includes Japanese support.

Description

Automatic Speech Recognition plugins perform speech-to-text conversion of speech contained within a submitted audio segment to create a transcript of what is being said. Currently the outputs are based on word-level transcriptions. ASR plugins are not-performing any translation, but simply speech-to-text in the native language. All ASR domains are language-dependent, and each one is meant to work only with a single, specific language.

This plugin builds on the first release of SRI's whisper-based ASR plugin which uses an in-house fine-tuned whisper model for mapping acoustic features to the letters/characters in the target language, and is now based on the faster-whisper architecture, providing memory usage and speed improvements over the initial release of the plugin. Currently, this plugin supports 7 languages, namely English, Spanish, Russian, Mandarin, Ukrainian, Japanese, and Khmer. The output format is identical to the asr-dynapy output with the exception of asr-whisper returning segment-level output (with segment boundaries) instead of word-level output (with word boundaries). Please see the details below. This first version does not provide a word-level confidence, which will be a future work, instead inserting a placeholder score of "1.0". The input speech is resampled to 16k and processed at this sampling frequency throughout the system.

Each domain in the plugin is specific to a single language, and is only capable of transcribing speech in that one language. See below for the domains (and languages) available for this plugin, along with additional details regarding each.

All domains are trained on a diverse set of training data covering conversational telephone, broadcast, read, distant speech (training data may vary per domain depending on availability), and are expected to perform well on a wide variety of acoustic conditions. The plugin will have a wider domain coverage on high-resourced domains, e.g. English, Mandarin, Spanish, on account of more diverse training data.

Any input audio at a sample rate different than 16000 Hz will be resampled and processed as 16 kHz audio. Any higher-frequency (> 8 kHz) information will be discarded.

All of the current domains are based on the wav2vec v2.0 model architecture. An n-gram language model is used during decoding to improve the ASR performance.

Domains

english-v1 (both 8k and 16k data)
japanese-v1
khmer-augmented-v2
mandarin-augmented-v2
russian-v1
spanish-v1
ukrainian-v1

Inputs

For scoring, an audio buffer or file. There is no verification performed by OLIVE or by ASR plugins that the audio passed as input is actually being spoken in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize.

Outputs

ASR plugins are region scorers, and as such will return a list of words, in the order they are spoken. Each detected word consists of timestamp regions in seconds (start and end time pairs), each with an accompanying placeholder score, along with the 'value' of that word. Each word output must be part of the vocabulary that specific language's domain was trained with. At this point, out-of-vocabulary words are not supported, so uncommon words, slang words, names, and some other vocabulary may not be able to be recognized by these plugins. If interested in this feature in the future, please contact us to start a conversation about adding such functionality.

Note that all current ASR plugin domains will output words in their 'native' script. This means that for languages like English and Spanish, each word will be in ASCII text, with the Latin alphabet. Mandarin Chinese, Russian, and Farsi, however, words will be comprised of unicode characters in the native script.

An example output excerpt for an English domain:

    input-audio.wav 0.00 11.00 of a three days fever from all which considerations we may conclude as a whole that these things which cannot make good the advantages they promise 1.00000000
    input-audio.wav 11.00 15.42 which are never made perfect by the assembly of all good things 1.00000000

An example output excerpt for a Mandarin Chinese domain:

    input-audio.wav 0.00 11.00 没 什么 意见 的 就 收拾 个 绿化 就 收拾 外边 的 绿化 外边儿 那个 绿化 就 有的 请 检查 一点儿 有的 那个 布拉教 不 到 的 有的 就 去 教 然后 1.00000000
    input-audio.wav 11.00 14.12 让 它 那个 花儿 干 粗 的 那个 就 把 它 剔 出来 1.00000000

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected topic of interest and corresponding score for this topic.
- RegionScorerRequest

Compatibility

OLIVE 5.7+

Limitations

ASR is language dependent and also largely audio domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the speech recognition system's acoustic model. The individual words that the plugin is capable of recognizing is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be recognized by a plugin out-of-the-box. Several factors contirbute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).

Regarding performance, ASR plugins are especially sensitive to the tradeoffs between accuracy performance and realistic resource requirement and runtime constraints. Typically larger, more resource-hungry models are capable of achieving greater accuracy overall, but at the expense of longer runtimes/slower performance, and higher memory requirements. SRI is able to tune our domains to balance these constraints and performance based on customer needs.

Language Dependence

Each domain of this ASR plugin is language specific, and is only capable of transcribing speech in a single language. There is no filter or any sort of verification performed by OLIVE to ensure that the speech passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input audio if the source language is unknown. An OLIVE Language ID plugin could be used as a front-end to triage out-of-domain languages before passing audio to the ASR plugin.

Overlap

To keep memory usage of the plugin under control, incoming audio is segmented into smaller chunks for processing, preventing large audio files from overwhelming system RAM during processing. This allows the plugin performance to be less dependent on the length of input audio, but transcription errors near these break points become much more likely, especially if a word happens to be split in the process. Often when chunking audio like this, the audio is split into chunks that overlap slightly, to minimize the chance of a split word causing errors, and give the recognizer another chance to correctly identify the word or words being spoken near the break points. Typically this requires some sort of duplicated recognition resolution or merging logic, for reconciling differences in recognition output within the overlapped sections. This feature has not yet been added to the plugin, which will be addressed in a future release. Due to the lack of overlap/conflict resolution, this plugin is currently configured to have 0 overlap between consecutive segments, and there may be transcription errors resulting from these audio chunk break points.

Minimum Speech Duration

The system will not attempt to perform speech recognition unless 0.31 seconds of speech of more is found in the file.

Comments

GPU Support

This plugin was designed and developed to run optimally on GPU hardware. It is capable of running on CPU in the absence of an available GPU or the proper configuration, but it will do so at a significantly reduced speed.

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file: plugin_config.py.

Option Name	Description	Default	Expected Range
sad_threshold	Speech detection threshold: Higher value results in less speech being found and processed.	0.0	-4.0 to 4.0
unicode_normalization	Enable or disable unicode normalization on the plugin output for Arabic languages.	None	None (no normalization), "NFC", "NFD", "NFKC", or "NFKD".