Skip to content

asr-end2end-v1 (Automatic Speech Recognition)

Version Changelog

Plugin Version Change
v1.0.0 Initial plugin release of the end-to-end models, tested and published with OLIVE 5.5.0
v1.0.1 Updated plugin, improved overall stability and performance, updated sample rate configuration to allow boosted performance when using 16 kHz audio, bug fixes and workflow-compatibility fixes.

Description

Automatic Speech Recognition plugins perform speech-to-text conversion of speech contained within a submitted audio segment to create a transcript of what is being said. Currently the outputs are based on word-level transcriptions. ASR plugins are not-performing any translation, but simply speech-to-text in the native language. All ASR domains are language-dependent, and each one is meant to work only with a single, specific language.

This plugin is the first release of SRI's end-to-end ASR plugin which uses a wav2vec v2.0 model for mapping the microphone samples to the letters/characters in the target language. Currently, this plugin supports 10 languages which overlaps with the language capabilities of the alternative asr-dynapy-v4 plugin. The output format is identical to the asr-dynapy output which is detailed below. This first version does not provide a word-level confidence, which will be a future work, instead inserting a placeholder score of "1.0". The input speech is resampled to 16k and processed at this sampling frequency throughout the system.

Each domain in the plugin is specific to a single language, and is only capable of transcribing speech in that one language. See below for the domains (and languages) available for this plugin, along with additional details regarding each.

All domains are trained on a diverse set of training data covering conversational telephone, broadcast, read, distant speech (training data may vary per domain depending on availability), and are expected to perform well on a wide variety of acoustic conditions. The plugin will have a wider domain coverage on high-resourced domains, e.g. English, Mandarin, Spanish, on account of more diverse training data.

Any input audio at a sample rate different than 16000 Hz will be resampled and processed as 16 kHz audio. Any higher-frequency (> 8 kHz) information will be discarded.

All of the current domains are based on the wav2vec v2.0 model architecture. An n-gram language model is used during decoding to improve the ASR performance.

Domains

  • english-v1 (both 8k and 16k data)
  • farsi-v1
  • french-v1
  • iraqiArabic-v1
  • levantineArabic-v1
  • mandarin-v1
  • pashto-v1
  • russian-v1
  • spanish-v1
  • ukrainian-v1

Inputs

For scoring, an audio buffer or file. There is no verification performed by OLIVE or by ASR plugins that the audio passed as input is actually being spoken in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize.

Outputs

ASR plugins are region scorers, and as such will return a list of words, in the order they are spoken. Each detected word consists of timestamp regions in seconds (start and end time pairs), each with an accompanying placeholder score, along with the 'value' of that word. Each word output must be part of the vocabulary that specific language's domain was trained with. At this point, out-of-vocabulary words are not supported, so uncommon words, slang words, names, and some other vocabulary may not be able to be recognized by these plugins. If interested in this feature in the future, please contact us to start a conversation about adding such functionality.

Note that all current ASR plugin domains will output words in their 'native' script. This means that for languages like English and Spanish, each word will be in ASCII text, with the Latin alphabet. Mandarin Chinese, Russian, and Farsi, however, words will be comprised of unicode characters in the native script.

An example output excerpt for an English domain:

    input-audio.wav 0.000 0.190 AND 1.00000000
    input-audio.wav 0.210 0.340 WE'RE 1.00000000
    input-audio.wav 0.330 0.460 GOING 1.00000000
    input-audio.wav 0.450 0.520 TO 1.00000000
    input-audio.wav 0.510 0.940 FLY 1.00000000
    input-audio.wav 1.080 1.300 WAS 1.00000000
    input-audio.wav 1.290 1.390 THAT 1.00000000
    input-audio.wav 1.290 1.390 IT 1.00000000
    input-audio.wav 1.380 1.510 WE'RE 1.00000000
    input-audio.wav 1.500 1.660 GOING 1.00000000
    input-audio.wav 1.650 1.720 TO 1.00000000
    input-audio.wav 1.710 1.930 FLY 1.00000000
    input-audio.wav 1.920 2.110 OVER 1.00000000
    input-audio.wav 2.100 2.380 SAINT 1.00000000
    input-audio.wav 2.370 2.950 LOUIS 1.00000000

An example output excerpt for a Mandarin Chinese domain:

    input-audio.wav 0.280 0.610 战斗 1.00000000
    input-audio.wav 0.600 0.880 爆发 1.00000000
    input-audio.wav 0.870 0.970 的 1.00000000
    input-audio.wav 0.960 1.420 居民区 1.00000000
    input-audio.wav 1.410 2.120 有很多 1.00000000
    input-audio.wav 2.110 2.590 忠于 1.00000000
    input-audio.wav 2.580 3.140 萨德尔 1.00000000
    input-audio.wav 3.130 3.340 的 1.00000000
    input-audio.wav 3.330 3.720 武装 1.00000000
    input-audio.wav 3.710 4.190 份子 1.00000000

Note that for languages that read from right to left, the direction that text is rendered may appear to 'flip' when viewing the bare text output in a terminal or text editor that doesn't properly deal with the orientation switch mid-line in some operating systems. This can cause the order of the 'word' and 'score' fields to reverse relative to the output of left-to-right read languages, and appear like this Farsi output example:

    input-audio.wav 0.000 0.480 1.00000000 خوب 
    input-audio.wav 0.470 0.740 1.00000000 ای 
    input-audio2.wav 0.00 0.320 1.00000000 آره 
    input-audio2.wav 0.310 0.460 1.00000000 می 
    input-audio2.wav 0.450 0.680 1.00000000 گم 
    input-audio2.wav 0.670 0.880 1.00000000 چند 
    input-audio2.wav 0.870 1.330 1.00000000 داره

This is a rendering problem only, however, and rest assured that if interacting with OLIVE through the API, all ordering is properly preserved. Most methods of automatically parsing the raw text output should also properly deal with the ordering, as column-based operators like awk are not affected by the visual order.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

  • REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected topic of interest and corresponding score for this topic.

Compatibility

OLIVE 5.5+

Limitations

As the debut ASR plugin release for OLIVE, there are several known limitations that will impact the usage of this plugin.

ASR is language dependent and also largely audio domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the speech recognition system's acoustic model. The individual words that the plugin is capable of recognizing is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be recognized by a plugin out-of-the-box. Several factors contirbute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).

Regarding performance, ASR plugins are especially sensitive to the tradeoffs between accuracy performance and realistic resource requirement and runtime constraints. Typically larger, more resource-hungry models are capable of achieving greater accuracy overall, but at the expense of longer runtimes/slower performance, and higher memory requirements. SRI is able to tune our domains to balance these constraints and performance based on customer needs.

Language Dependence

Each domain of this ASR plugin is language specific, and is only capable of transcribing speech in a single language. There is no filter or any sort of verification performed by OLIVE to ensure that the speech passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input audio if the source language is unknown. An OLIVE Language ID plugin could be used as a front-end to triage out-of-domain languages before passing audio to the ASR plugin.

Overlap

To keep memory usage of the plugin under control, incoming audio is segmented into smaller chunks for processing, preventing large audio files from overwhelming system RAM during processing. This allows the plugin performance to be less dependent on the length of input audio, but transcription errors near these break points become much more likely, especially if a word happens to be split in the process. Often when chunking audio like this, the audio is split into chunks that overlap slightly, to minimize the chance of a split word causing errors, and give the recognizer another chance to correctly identify the word or words being spoken near the break points. Typically this requires some sort of duplicated recognition resolution or merging logic, for reconciling differences in recognition output within the overlapped sections. This feature has not yet been added to the plugin, which will be addressed in a future release. Due to the lack of overlap/conflict resolution, this plugin is currently configured to have 0 overlap between consecutive segments, and there may be transcription errors resulting from these audio chunk break points.

Arabic Script Languages

Note that for the Farsi and Arabic domains, and for other languages that read from right to left, the direction that text is rendered may appear to 'flip' when viewing the bare text output in a terminal or text editor that doesn't properly deal with the orientation switch mid-line in some operating systems. This can cause the order of the 'word' and 'score' fields to reverse relative to the output of left-to-right read languages, and appear like this Farsi output example:

    input-audio.wav 0.000 0.480 1.00000000 خوب 
    input-audio.wav 0.470 0.740 1.00000000 ای 
    input-audio2.wav 0.000 0.320 1.00000000 آره 
    input-audio2.wav 0.310 0.460 1.00000000 می 
    input-audio2.wav 0.450 0.680 1.00000000 گم 
    input-audio2.wav 0.670 0.880 1.00000000 چند 
    input-audio2.wav 0.870 1.330 1.00000000 داره

This is a rendering problem only, however, and rest assured that if interacting with OLIVE through the API, all ordering is properly preserved. Most methods of automatically parsing the raw text output should also properly deal with the ordering, as column-based operators like awk are not affected by the visual order.

Minimum Speech Duration

The system will not attempt to perform speech recognition unless 0.31 seconds of speech of more is found in the file.

Comments

GPU Support

This plugin was designed and developed to run optimally on GPU hardware. It is capable of running on CPU in the absence of an available GPU or the proper configuration, but it will do so at a significantly reduced speed.

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file: plugin_config.py.

Option Name Description Default Expected Range
sad_threshold Speech detection threshold: Higher value results in less speech being found and processed. 0.0 -4.0 to 4.0
unicode_normalization Enable or disable unicode normalization on the plugin output for Arabic languages. None None (no normalization), "NFC", "NFD", "NFKC", or "NFKD".