lid-embedplda-v2 (Language Identification)

Version Changelog

Plugin Version	Change
v2.0.0	Initial plugin release, functionally identical to v1.0.0, but updated to be compatible with OLIVE 5.0.0
v2.0.1 (latest)	Updated to be compatible with OLIVE 5.1.0

Description

LID plugins detect one or more language or dialect classes in an audio segment as a global score. A plugin domain could consist of 50 or more languages and dialects in a single plugin, or as few as one for use cases where the customer is only focused on a single target class. Some plugin domains are solely focused on dialect or sub-language recognition, such as languages of China. Several LID plugins allow users to add new classes or augment existing classes with more data for the class to improve accuracy.

Language recognition plugin for clean telephone or microphone data, based on a language embeddings DNN fed with acoustic DNN bottleneck features, and language classification using a PLDA backend and duration-aware calibration. This plug-in has been reconfigured to allow enrollment and addition of new classes. Unsupervised adaptation through target mean normalization, and supervised PLDA and calibration updates from enrollments have been implemented via the update function. These updates must be invoked by the user via the API.

Domains

multi-v1
- Generic domain for most close talking conditions with signal-to-noise ratio above 10 dB. Currently set up with 10 languages configured (optionally configurable to up to 63 languages). See below for the currently-configured and available languages. See the configuring languages section for instructions on reconfiguring the available languages if necessary.

Inputs

Audio file or buffer and an optional identifier.

Outputs

Generally, a list of scores for all classes in the domain, for the entire segment. As with SAD and SID, scores are generally log-likelihood ratios where a score of greater than “0” is considered a detection. Plugins may be altered to return only detections, rather than a list of classes and scores, but this is generally done on the client side for sake of flexibility.

An example output excerpt:

    input-audio.wav amh -19.9012123573
    input-audio.wav ara -15.8882738579
    input-audio.wav cmn -15.5530382622
    input-audio.wav eng -14.1870705116
    input-audio.wav fas -17.3224474419
    input-audio.wav fre 10.1847232353
    input-audio.wav hau -15.1134468544
    input-audio.wav jpn -21.0655495155
    input-audio.wav kor -18.3601671684
    input-audio.wav pus -16.2738787163
    input-audio.wav rus -10.4046117294
    input-audio.wav spa -18.1588427055
    input-audio.wav tur -14.0825478065
    input-audio.wav urd -20.4127785194
    input-audio.wav vie -18.552107476

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled and enabled languages of interest.
- GlobalScorerRequest
CLASS_MODIFIER – Enroll new language models or augment existing language models with additional data.
- ClassModificationRequest
- ClassRemovalRequest

Compatibility

OLIVE 5.1+

Limitations

Known or potential limitations of the plugin are outlined below.

All current LID plugins assume that an audio segment contains only a single language and may be scored as a unit. If a segment contains multiple languages the entire segment will still be scored as a unit. In many cases, a minimum duration of speech of 2 seconds is required in order to output scores. This value can optionally be overwritten, but scores provided for such short segments will be volatile.

Minimum Speech Duration

The system will only attempt to perform language identification if the submitted audio segment contains more than 2 seconds of detected speech.

Languages of Low Confidence

Many of the language models that are included and hidden within the domain's data model, disabled by default, do not contain enough data within the model for reliable detection of this language, and are included solely to help with score calibration, and differentiating other languages. If in doubt regarding whether an enrolled language should be used for detection or not, please reach out to SRI for clarification.

Comments

Language/Dialect Detection Granularity

LID plugins that are capable of dialect detection typically include functionality to fall back to the base language class in the case of limited confidence. This is typically done by outputting scores for all dialects (i.e. ara-arz, ara-apc, and ara-arb) as well as the base language (i.e. ara). Note that any language with dialect information does not have the base class enrolled, but this is determined from the maximum of the dialect detectors for the base language available within the plugin (whether exposed or not). In the case that a dialect score is sufficiently high, the base language score will be set to 0.001 lower than the highest-scoring dialect, and otherwise the base class is set to 0.001 higher than the highest-scoring dialect score. In this way, labelling the audio sample based on the maximum scoring will indicate a specific dialect if confident, and otherwise the base language. This default mode is defined as BASEAPPEND. There are two alternate modes available that can optionally be set:

BASEAPPEND - Default behavior, described above.
BASEONLY – Output only base language scores formed by the maximum of the dialect-specific scores for a given base language.
STANDARD – Output scores based on enrolled classes without producing a base language summarization for dialect-compatible detectors.

Enrollments

Some recent LID plugins allows class modifications. A class modification is essentially an enrollment capability similar to SID. A new enrollment is created with the first class modification request (sending the system audio with a language label, generally 30 seconds or more per cut), and becomes usable when sufficient cuts have been provided (approximately 10). In general, 30 minutes from around 30 samples is the minimum amount of data required to produce a reasonable language model. This enrollment can be augmented with subsequent class modification requests by adding more audio from the same language to an existing class, again, like SID or SDD. In addition to user enrolled languages, most LID plugins are supplied with several pre-enrolled languages. Users can augment these existing languages using their own data by enrolling audio with the same label as an existing language.

Configuring Languages

Most LID plugins have the ability to re-configure the languages available in a domain. Configuring languages in the domain can be done by entering the domain directory of interest within the plugin folder using the command line interface and calling

    $ ./configure_languages.py

to get all languages or

    $ ./configure_languages.py lang1,lang2,…,langN

for a subset of available languages. Please note that running ./configure_languages.py without any arguments should be done with extreme care. This will enable all languages and dialects in the domain; including those that were included solely for their utility in score calibration, that may not have enough training data to create a model that acts as a reliable detector. Enabling all languages may adversely affect the plugin’s performance. This plugin supports adjusting the language detection granularity discussed above, though this is for advanced users only. An example of changing this setting using the configure_langauges,py script is

    $ ./configure_languages.py lang1,lang2,...,langN BASEONLY

Where the options for this setting are discussed above, if supported.

Default Enabled Languages

The following languages are identified as high-confidence languages, supported by a sufficient amount of training data to make them reliable language detectors. As such, they are enabled by default in the plugin as-delivered, and serve as a general purpose base language set.

Language Code	Language Name
amh	Amharic
arz	Egyptian Arabic
apc	North Levantine Arabic
arb	Modern Standard Arabic
cmn	Mandarin Chinese
yue	Yue Chinese
eng	English
fas	Farsi
fre	French
jpn	Japanese
kor	Korean
pus	Pashto
rus	Russian
spa	Spanish
tgl	Tagolog
tha	Thai
tur	Turkish
urd	Urdu
vie	Vietnamese

Supported Languages

The full list of languages that exist as an enrolled class within this plugin as delivered are provided in the chart below. Note that as mentioned previously, not all of these languages were enrolled with enough data to serve as reliable detectors, but remain in the domain for the benefits to differentiating other languages, and for score calibration. If in doubt regarding whether an enrolled language should be used for detection or not, please reach out to SRI for clarification.

Language Code	Language Name
alb	Albanian
amh	Amharic
arz	Egyptian Arabic
apc	North Levantine Arabic
arb	Modern Standard Arabic
aze	Azerbaijani
bel	Belorussian
ben	Bengali
bos	Bosnian
bul	Bulgarian
cmn	Mandarin Chinese
yue	Yue Chinese
eng	English
fas	Farsi
fre	French
geo	Georgian
ger	German
gre	Greek
hau	Hausa
hrv	Croatian
ind	Indonesian
ita	Italian
jpn	Japanese
khm	Khmer
kor	Korean
mac	Macedonian
mya	Burmese
nde	Ndebele
orm	Oromo
pan	Punjabi
pol	Polish
por	Portuguese
prs	Dari
pus	Pashto
ron	Romanian
rus	Russian
sna	Shona
som	Somali
spa	Spanish
srp	Serbian
swa	Swahili
tam	Tamil
tgl	Tagalog
tha	Thai
tib	Tibetan
tir	Tigrinya
tur	Turkish
ukr	Ukranian
urd	Urdu
uzb	Uzbek
vie	Vietnamese

Global Options

This plugin does not feature user-configurable option parameters. It does, however, offer configurable language models and language-reporting granularity. For details, refer here.