lid-embedplda-v2 (Language Identification)
Version Changelog
Plugin Version | Change |
---|---|
v2.0.0 | Initial plugin release, functionally identical to v1.0.0, but updated to be compatible with OLIVE 5.0.0 |
v2.0.1 (latest) | Updated to be compatible with OLIVE 5.1.0 |
Description
LID plugins detect one or more language or dialect classes in an audio segment as a global score. A plugin domain could consist of 50 or more languages and dialects in a single plugin, or as few as one for use cases where the customer is only focused on a single target class. Some plugin domains are solely focused on dialect or sub-language recognition, such as languages of China. Several LID plugins allow users to add new classes or augment existing classes with more data for the class to improve accuracy.
Language recognition plugin for clean telephone or microphone data, based on a language embeddings DNN fed with acoustic DNN bottleneck features, and language classification using a PLDA backend and duration-aware calibration. This plug-in has been reconfigured to allow enrollment and addition of new classes. Unsupervised adaptation through target mean normalization, and supervised PLDA and calibration updates from enrollments have been implemented via the update function. These updates must be invoked by the user via the API.
Domains
- multi-v1
- Generic domain for most close talking conditions with signal-to-noise ratio above 10 dB. Currently set up with 10 languages configured (optionally configurable to up to 63 languages). See below for the currently-configured and available languages. See the configuring languages section for instructions on reconfiguring the available languages if necessary.
Inputs
Audio file or buffer and an optional identifier.
Outputs
Generally, a list of scores for all classes in the domain, for the entire segment. As with SAD and SID, scores are generally log-likelihood ratios where a score of greater than “0” is considered a detection. Plugins may be altered to return only detections, rather than a list of classes and scores, but this is generally done on the client side for sake of flexibility.
An example output excerpt:
input-audio.wav amh -19.9012123573
input-audio.wav ara -15.8882738579
input-audio.wav cmn -15.5530382622
input-audio.wav eng -14.1870705116
input-audio.wav fas -17.3224474419
input-audio.wav fre 10.1847232353
input-audio.wav hau -15.1134468544
input-audio.wav jpn -21.0655495155
input-audio.wav kor -18.3601671684
input-audio.wav pus -16.2738787163
input-audio.wav rus -10.4046117294
input-audio.wav spa -18.1588427055
input-audio.wav tur -14.0825478065
input-audio.wav urd -20.4127785194
input-audio.wav vie -18.552107476
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.
- GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled and enabled languages of interest.
- CLASS_MODIFIER – Enroll new language models or augment existing language models with additional data.
Compatibility
OLIVE 5.1+
Limitations
Known or potential limitations of the plugin are outlined below.
All current LID plugins assume that an audio segment contains only a single language and may be scored as a unit. If a segment contains multiple languages the entire segment will still be scored as a unit. In many cases, a minimum duration of speech of 2 seconds is required in order to output scores. This value can optionally be overwritten, but scores provided for such short segments will be volatile.
Minimum Speech Duration
The system will only attempt to perform language identification if the submitted audio segment contains more than 2 seconds of detected speech.
Languages of Low Confidence
Many of the language models that are included and hidden within the domain's data model, disabled by default, do not contain enough data within the model for reliable detection of this language, and are included solely to help with score calibration, and differentiating other languages. If in doubt regarding whether an enrolled language should be used for detection or not, please reach out to SRI for clarification.
Comments
Language/Dialect Detection Granularity
LID plugins that are capable of dialect detection typically include functionality to fall back to the base language class in the case of limited confidence. This is typically done by outputting scores for all dialects (i.e. ara-arz, ara-apc, and ara-arb) as well as the base language (i.e. ara). Note that any language with dialect information does not have the base class enrolled, but this is determined from the maximum of the dialect detectors for the base language available within the plugin (whether exposed or not). In the case that a dialect score is sufficiently high, the base language score will be set to 0.001 lower than the highest-scoring dialect, and otherwise the base class is set to 0.001 higher than the highest-scoring dialect score. In this way, labelling the audio sample based on the maximum scoring will indicate a specific dialect if confident, and otherwise the base language. This default mode is defined as BASEAPPEND. There are two alternate modes available that can optionally be set:
- BASEAPPEND - Default behavior, described above.
- BASEONLY – Output only base language scores formed by the maximum of the dialect-specific scores for a given base language.
- STANDARD – Output scores based on enrolled classes without producing a base language summarization for dialect-compatible detectors.
Enrollments
Some recent LID plugins allows class modifications. A class modification is essentially an enrollment capability similar to SID. A new enrollment is created with the first class modification request (sending the system audio with a language label, generally 30 seconds or more per cut), and becomes usable when sufficient cuts have been provided (approximately 10). In general, 30 minutes from around 30 samples is the minimum amount of data required to produce a reasonable language model. This enrollment can be augmented with subsequent class modification requests by adding more audio from the same language to an existing class, again, like SID or SDD. In addition to user enrolled languages, most LID plugins are supplied with several pre-enrolled languages. Users can augment these existing languages using their own data by enrolling audio with the same label as an existing language.
Configuring Languages
Most LID plugins have the ability to re-configure the languages available in a domain. Configuring languages in the domain can be done by entering the domain directory of interest within the plugin folder using the command line interface and calling
$ ./configure_languages.py
to get all languages or
$ ./configure_languages.py lang1,lang2,…,langN
for a subset of available languages. Please note that running ./configure_languages.py
without any arguments should be done with extreme care. This will enable all languages and dialects in the domain; including those that were included solely for their utility in score calibration, that may not have enough training data to create a model that acts as a reliable detector. Enabling all languages may adversely affect the plugin’s performance. This plugin supports adjusting the language detection granularity discussed above, though this is for advanced users only. An example of changing this setting using the configure_langauges,py script is
$ ./configure_languages.py lang1,lang2,...,langN BASEONLY
Where the options for this setting are discussed above, if supported.
Default Enabled Languages
The following languages are identified as high-confidence languages, supported by a sufficient amount of training data to make them reliable language detectors. As such, they are enabled by default in the plugin as-delivered, and serve as a general purpose base language set.
Language Code | Language Name |
---|---|
amh | Amharic |
arz | Egyptian Arabic |
apc | North Levantine Arabic |
arb | Modern Standard Arabic |
cmn | Mandarin Chinese |
yue | Yue Chinese |
eng | English |
fas | Farsi |
fre | French |
jpn | Japanese |
kor | Korean |
pus | Pashto |
rus | Russian |
spa | Spanish |
tgl | Tagolog |
tha | Thai |
tur | Turkish |
urd | Urdu |
vie | Vietnamese |
Supported Languages
The full list of languages that exist as an enrolled class within this plugin as delivered are provided in the chart below. Note that as mentioned previously, not all of these languages were enrolled with enough data to serve as reliable detectors, but remain in the domain for the benefits to differentiating other languages, and for score calibration. If in doubt regarding whether an enrolled language should be used for detection or not, please reach out to SRI for clarification.
Language Code | Language Name |
---|---|
alb | Albanian |
amh | Amharic |
arz | Egyptian Arabic |
apc | North Levantine Arabic |
arb | Modern Standard Arabic |
aze | Azerbaijani |
bel | Belorussian |
ben | Bengali |
bos | Bosnian |
bul | Bulgarian |
cmn | Mandarin Chinese |
yue | Yue Chinese |
eng | English |
fas | Farsi |
fre | French |
geo | Georgian |
ger | German |
gre | Greek |
hau | Hausa |
hrv | Croatian |
ind | Indonesian |
ita | Italian |
jpn | Japanese |
khm | Khmer |
kor | Korean |
mac | Macedonian |
mya | Burmese |
nde | Ndebele |
orm | Oromo |
pan | Punjabi |
pol | Polish |
por | Portuguese |
prs | Dari |
pus | Pashto |
ron | Romanian |
rus | Russian |
sna | Shona |
som | Somali |
spa | Spanish |
srp | Serbian |
swa | Swahili |
tam | Tamil |
tgl | Tagalog |
tha | Thai |
tib | Tibetan |
tir | Tigrinya |
tur | Turkish |
ukr | Ukranian |
urd | Urdu |
uzb | Uzbek |
vie | Vietnamese |
Global Options
This plugin does not feature user-configurable option parameters. It does, however, offer configurable language models and language-reporting granularity. For details, refer here.