asr-dynapy-v4 (Automatic Speech Recognition)

Version Changelog

Plugin Version	Change
v4.0.0	Initial plugin release, based on v3.0.0, but with bug fixes and merged in some specialty features like Streaming, and the addition of memory-saving look-ahead model architectures for some domains. Published with OLIVE 5.5.0

Description

Automatic Speech Recognition plugins perform speech-to-text conversion of speech contained within a submitted audio segment to create a transcript of what is being said. Currently the outputs are based on word-level transcriptions. ASR plugins are not-performing any translation, but simply speech-to-text in the native language. All ASR domains are language-dependent, and each one is meant to work only with a single, specific language.

This plugin builds from its predecessor, asr-dynapy-v2, and features bug fixes and compatibility updates, as well as 3 additional domains bringing new language capabilities. It is based on SRI's DynaSpeak recognition platform, features word-based region-scoring outputs and provides the capability to output a score for each timestamped word. This score represents a NN Confidence value if available, and will back off to Word Posterior score if the Confidence is not available. If neither of these measures are available, the 'score' field will contain a -1.0. All domains described below currently report word posterior scores, but the plugin is capable of handling NN confidence values in updated domains that will be delivered in the future.

Each domain in the plugin is specific to a single language, and is only capable of transcribing speech in that one language. See below for the domains (and languages) available for this plugin, along with additional details regarding each.

All current domains are trained on conversational telephone speech, and will perform best on matched audio conditions, but are still capable of recognition in mismatched audio domains.

Any input audio at a sample rate higher than 8000 Hz will be resampled and processed as 8 kHz audio. Any higher-frequency (> 4 kHz) information will be discarded.

All of the current domains are based on the time-delay neural networks (TDNN) architecture. Some of these domains are also chain models, which allow us to compute frames at a lower frequency without sacrificing accuracy, allowing faster processing speed thanks to less computation. The chain model domains use much deeper networks than previous technologies providing much better accuracy. Refer to the domains list below to view which domains are and aren't currently chain models.

Domains (Supported Languages)

Distant Speech targeted domains

These domains have been augmented with additional audio data sources to better handle distortions and other effects caused by non-close-talk recording conditions.

mandarin-tdnnChainStaticSrilm-multi-v1
- Mandarin domain augmented for distant speech.
russian-tdnnChainLookaheadRnnlm-multi-v2
- Russian domain augmented for distant speech.
spanish-tdnnChainStaticRnnlm-multi-v2
- Spanish domain augmented for distant speech.
ukrainian-tdnnChainStaticSrilm-multi-v1
- Ukrainian domain augmented for distant speech.

Telephony Speech trained domains with Lookahead models

english-tdnnLookaheadRnnlm-tel-v2
- English domain trained with conversational telephony speech. This domain features lookahead archicture for memory footpring benefits and reports word posterior scores.
farsi-tdnnLookaheadRnnlm-tel-v1
- Farsi domain trained with conversational telephony speech. This domain now features lookahead models and reports word posterior scores.
french-tdnnLookaheadRnnlm-tel-v1
- French domain augmented with African-accented French data. This domain features lookahead models and reports word posterior scores.
iraqiArabic-tdnnLookaheadRnnlm-tel-v1
- Iraqi Arabic dialect domain trained with conversational telephony speech. This domain features lookahead models and reports word posterior scores.
korean-tdnnLookahead-tel-v1
- Korean domain trained with conversational telephony speech. Features lookahead models and reports word posterior scores.
levantineArabic-tdnnLookaheadRnnlm-tel-v1
- Levantine Arabic dialect domain trained with conversational telephony speech. This domain features lookahead models and reports word posterior scores.
mandarin-tdnnLookaheadRnnlm-tel-v1
- Mandarin Chinese domain trained with conversational telephony speech. This domain features lookahead models and reports word posterior scores.
pashto-tdnnLookaheadRnnlm-tel-v1
- Pashto domain trained with conversational telephony speech. This domain features lookahead models and reports word posterior scores.
russian-tdnnLookaheadRnnlm-tel-v1
- Russian domain trained with conversational telephony speech. This domain features lookahead models and reports word posterior scores.
spanish-tdnnLookaheadRnnlm-tel-v1
- Spanish domain trained with conversational telephony speech. This domain features lookahead models and reports word posterior scores.

Inputs

For scoring, an audio buffer or file. There is no verification performed by OLIVE or by ASR plugins that the audio passed as input is actually being spoken in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize.

Outputs

ASR plugins are region scorers, and as such will return a list of words, in the order they are spoken. Each detected word consists of timestamp regions in seconds (start and end time pairs), each with an accompanying score, along with the 'value' of that word. Each word output must be part of the vocabulary that specific language's domain was trained with. At this point, out-of-vocabulary words are not supported, so uncommon words, slang words, names, and some other vocabulary may not be able to be recognized by these plugins. If interested in this feature in the future, please contact us to start a conversation about adding such functionality.

Note that all current ASR plugin domains will output words in their 'native' script. This means that for languages like English and Spanish, each word will be in ASCII text, with the Latin alphabet. Mandarin Chinese, Russian, and Farsi, however, words will be comprised of unicode characters in the native script.

An example output excerpt for an English domain:

    input-audio.wav 0.000 0.190 and 43.00000000
    input-audio.wav 0.210 0.340 we're 44.00000000
    input-audio.wav 0.330 0.460 going 97.00000000
    input-audio.wav 0.450 0.520 to 97.00000000
    input-audio.wav 0.510 0.940 fly 66.00000000
    input-audio.wav 1.080 1.300 was 31.00000000
    input-audio.wav 1.290 1.390 that 24.00000000
    input-audio.wav 1.290 1.390 it 22.00000000
    input-audio.wav 1.380 1.510 we're 27.00000000
    input-audio.wav 1.500 1.660 going 97.00000000
    input-audio.wav 1.650 1.720 to 98.00000000
    input-audio.wav 1.710 1.930 fly 94.00000000
    input-audio.wav 1.920 2.110 over 79.00000000
    input-audio.wav 2.100 2.380 saint 93.00000000
    input-audio.wav 2.370 2.950 louis 96.00000000

An example output excerpt for a Mandarin Chinese domain:

    input-audio.wav 0.280 0.610 战斗 99.00000000
    input-audio.wav 0.600 0.880 爆发 98.00000000
    input-audio.wav 0.870 0.970 的 99.00000000
    input-audio.wav 0.960 1.420 居民区 86.00000000
    input-audio.wav 1.410 2.120 有很多 93.00000000
    input-audio.wav 2.110 2.590 忠于 99.00000000
    input-audio.wav 2.580 3.140 萨德尔 100.00000000
    input-audio.wav 3.130 3.340 的 100.00000000
    input-audio.wav 3.330 3.720 武装 55.00000000
    input-audio.wav 3.710 4.190 份子 53.00000000

Note that for languages that read from right to left, the direction that text is rendered may appear to 'flip' when viewing the bare text output in a terminal or text editor that doesn't properly deal with the orientation switch mid-line in some operating systems. This can cause the order of the 'word' and 'score' fields to reverse relative to the output of left-to-right read languages, and appear like this Farsi output example:

    input-audio.wav 0.000 0.480 58.00000000 خوب 
    input-audio.wav 0.470 0.740 51.00000000 ای 
    input-audio2.wav 0.00 0.320 100.00000000 آره 
    input-audio2.wav 0.310 0.460 99.00000000 می 
    input-audio2.wav 0.450 0.680 99.00000000 گم 
    input-audio2.wav 0.670 0.880 73.00000000 چند 
    input-audio2.wav 0.870 1.330 50.00000000 داره

This is a rendering problem only, however, and rest assured that if interacting with OLIVE through the API, all ordering is properly preserved. Most methods of automatically parsing the raw text output should also properly deal with the ordering, as column-based operators like awk are not affected by the visual order.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected topic of interest and corresponding score for this topic.
- RegionScorerRequest

Compatibility

OLIVE 5.2+

Limitations

As the debut ASR plugin release for OLIVE, there are several known limitations that will impact the usage of this plugin.

ASR is language dependent and also largely audio domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the speech recognition system's acoustic model. The individual words that the plugin is capable of recognizing is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be recognized by a plugin out-of-the-box. Several factors contirbute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).

Regarding performance, ASR plugins are especially sensitive to the tradeoffs between accuracy performance and realistic resource requirement and runtime constraints. Typically larger, more resource-hungry models are capable of achieving greater accuracy overall, but at the expense of longer runtimes/slower performance, and higher memory requirements. SRI is able to tune our domains to balance these constraints and performance based on customer needs.

Language Dependence

Each domain of this ASR plugin is language specific, and is only capable of transcribing speech in a single language. There is no filter or any sort of verification performed by OLIVE to ensure that the speech passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input audio if the source language is unknown. An OLIVE Language ID plugin could be used as a front-end to triage out-of-domain languages before passing audio to the ASR plugin.

Out of Vocabulary (OOV) Words, Names

The individual words that the plugin is capable of recognizing is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be recognized by a plugin out-of-the-box. Several factors contirbute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts), or pruning the vocabulary to increase processing speed and/or reduce memory requirements.

At this time there is no provision in OLIVE for adding words or names to the vocabulary. Some of the models underlying this plugin support vocabulary addition like this, so it may be possible for this feature to be supported in the future with OLIVE modifications.

Confidences vs Word Posteriors

Some of the domains that will be delivered for this plugin will report DNN Confidence scores back as part of the output, as a sort of likelihood measure that the predicted word is correct. For other domains, these DNN confidence measures are not available, and Word Posterior scores are provided instead. Generally, the NN confidences are a more reliable score measure, but they require additional networks to be trained to compute, and the confidence networks are specific to the rest of the models within the domain and must be retrained each time. The five domains initially delivered with this plugin only report word posterior scores.

Overlap

To keep memory usage of the plugin under control, incoming audio is segmented into smaller chunks for processing, preventing large audio files from overwhelming system RAM during processing. This allows the plugin performance to be less dependent on the length of input audio, but transcription errors near these break points become much more likely, especially if a word happens to be split in the process. Often when chunking audio like this, the audio is split into chunks that overlap slightly, to minimize the chance of a split word causing errors, and give the recognizer another chance to correctly identify the word or words being spoken near the break points. Typically this requires some sort of duplicated recognition resolution or merging logic, for reconciling differences in recognition output within the overlapped sections. This feature has not yet been added to the plugin, which will be addressed in a future release. Due to the lack of overlap/conflict resolution, this plugin is currently configured to have 0 overlap between consecutive segments, and there may be transcription errors resulting from these audio chunk break points.

Resources

The Russian domain, rus-tdnnChain-tel-v2, thanks to its large TDNN architecture, complex language model, and large vocabulary thanks to Russian's agglutinative, is currently tuned more for maximum accuracy performance than for speed or resource management. As a result, it currently has a rather high minimum memory requirement for execution, relative to other plugins. Roughly 9GB of free system memory is required as a baseline for performing recognition with this domain.

Arabic Script Languages

Note that for the Farsi and Arabic domains, and for other languages that read from right to left, the direction that text is rendered may appear to 'flip' when viewing the bare text output in a terminal or text editor that doesn't properly deal with the orientation switch mid-line in some operating systems. This can cause the order of the 'word' and 'score' fields to reverse relative to the output of left-to-right read languages, and appear like this Farsi output example:

    input-audio.wav 0.000 0.480 58.00000000 خوب 
    input-audio.wav 0.470 0.740 51.00000000 ای 
    input-audio2.wav 0.000 0.320 100.00000000 آره 
    input-audio2.wav 0.310 0.460 99.00000000 می 
    input-audio2.wav 0.450 0.680 99.00000000 گم 
    input-audio2.wav 0.670 0.880 73.00000000 چند 
    input-audio2.wav 0.870 1.330 50.00000000 داره

This is a rendering problem only, however, and rest assured that if interacting with OLIVE through the API, all ordering is properly preserved. Most methods of automatically parsing the raw text output should also properly deal with the ordering, as column-based operators like awk are not affected by the visual order.

Non-Verbal Recognizer Output

Each recognizer's vocabulary may contain many non-verbal annotations that are valid potential word candidates that can show up in the transcription output. These include things like @reject@ or for words the recognizer cannot form a hypothesis for, and also includes notations for phenomena like hesitations or filled pauses. These may or may not be useful for a given user's task or use case, so it is currently left to the end user to decide how to process these non-verbal outputs/notations.

Minimum Speech Duration

The system will not attempt to perform speech recognition unless 0.31 seconds of speech of more is found in the file.

Comments

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file: plugin_asr_dynapy_v1_config.py.

Option Name	Description	Default	Expected Range
sad_threshold	Speech detection threshold: Higher value results in less speech being found and processed.	0.0	-4.0 to 4.0
unicode_normalization	Enable or disable unicode normalization on the plugin output for Arabic languages.	None	None (no normalization), "NFC", "NFD", "NFKC", or "NFKD".