tmt-neural-v1 (Text Machine Translation)

Version Changelog

Plugin Version	Change
v1.0.0	Initial plugin release with OLIVE 5.4.0
v1.1.0	Updated plugin, adds GPU support, and including minor bug fixes and additional domains. Tested and released with OLIVE 5.5.0
v1.1.1	Updated to add Iraqi Arabic to English as an available domain. Tested and released with OLIVE 5.5.1

Description

Text Machine Translation plugins perform translation of text from one language to another, typically from one language, specified by the domain, to English. All TMT domains are language-dependent, and each one is meant to work only with a single, specific language. TMT plugins may have some limitations or special processing considerations depending on the language(s) involved and their native alphabets, as well as the training data used to train the underlying models.

This is the first neural Text Machine Translation plugin released for the OLIVE architecture. It offers neural-model based translation of text, using a pure C++ neural machine translation toolkit as its base. Each domain provides translation from one language into English, with the three source languages currently available being Mandarin Chinese, Russian, and Spanish. The input and output formats match those shown in the examples below.

The goal of this plugin is to ingest text in one language, defined by the specified domain, and translate it to another language. In both currently available domains, the destination language is English.

An example input string, in Spanish:

    manual de fotografía para aprender todo lo esencial sobre fujifilm cámaras

Note that some punctuation and special characters will be stripped from the input during preprocessing.

An example output string translation of the above example, as provided by the spa-eng-generic-v2 domain:

    manual of photography to learn everything essential about unk cameras

Words that the system does not recognize or can't translate will be marked up with an unk tag, as can be seen in the example output below. Note that all system output will be lowercase for case sensitive languages, and apart from the unk tag, all output will be devoid of punctuation.

Domains (Supported Languages)

irq-eng-nmt-v1
- Translates Iraqi Arabic text into English text.
cmn-eng-nmt-v1
- Translates Mandarin text into English text.
rus-eng-nmt-v1
- Translates Russian text into English text.
spa-eng-nmt-v3
- Translates Spanish text into English text.
ukr-eng-nmt-v3
- Translates Ukrainian text into English text.
eng-cmn-nmt-v1
- Translates English text into Mandarin text.
eng-rus-nmt-v1
- Translates English text into Russian text.
eng-spa-nmt-v3
- Translates English text into Spanish text.

Inputs

For scoring, a text string or text-populated file is required. There is no verification performed by OLIVE or by TMT plugins that the text passed as input is actually in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize. Note that output may fail or be very confusing if the input language does not match the domain's capabilities.

An example input string:

    manual de fotografía para aprender todo lo esencial sobre fujifilm cámaras

There is a bit of preprocessing of input that occurs before a string is sent to translation. Thus, all input is lower-cased and frequent punctuation marks such as commas, exclamation marks or question marks are stripped from the string. However, no spelling error correction of any kind is performed.

Outputs

The output format for TMT plugins is simply text. Words that the system does not recognize or can't translate will be marked up with an unk tag, as can be seen in the example output below. Note that all system output will be lowercase, and apart from the unk tag, all output will be devoid of punctuation.

An example output string translation of the input example:

    manual of photography to learn everything essential about unk cameras

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

TextTransformer – Plugin accepts and analyzes text string inputs, and outputs a new text string as output. In the case of TMT plugins, a text string in the source language should be provided as input, with the expectation that the output will be a text string translated to the desired destination language.
- TextTransformRequest
- TextTransformResult

Compatibility

OLIVE 5.4+

Limitations

As the debut neural TMT plugin release for OLIVE, there are several known limitations that will impact the usage of this plugin.

Like ASR, TMT plugins are language dependent and also largely text domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the translation models underlying each domain. Several factors contribute to what might limit the vocabulary of a translation model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).

Regarding performance, TMT plugins are especially sensitive to the tradeoffs between accuracy performance and realistic resource requirement and runtime constraints. The size-on-disk of each domain can be quite large depending on the technology of the individual plugin, and some technologies/plugins may have additional constraints. See the individual plugin detail pages for more information.

Multi-Language Memory Constraints

Each domain currently uses a significant amount of memory once the translation models have been loaded by using the plugin. Without substantial memory resources available on the machine hosting OLIVE, it's possible to quickly run out of available memory and run into strange behavior and/or failures - especially when running TMT alongside other memory-intensive plugins like ASR.

Language Dependence

Each domain of this TMT plugin is language specific, and is only capable of translating text from one single language to one other. There is no filter or any sort of verification performed by OLIVE to ensure that the text passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input text if the source language is unknown, though it is often easy to spot mismatched input languages by the high number of output words that appears with the "unk" tag.

Spelling Errors

Note that there is no spell-checking or other types of spelling related pre-processing that occurs on the input data. Therefore any spelling mistakes in the input are likely to cause the system to output unk tags.

Out of Vocabulary (OOV) Words, Names

The individual words that the plugin is capable of recognizing and translating is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be translated by a plugin out-of-the-box. Several factors contribute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts), or pruning the vocabulary to increase processing speed and/or reduce memory requirements.

At this time there is no provision in OLIVE for adding words or names to the vocabulary or to the translation model. Some of the models underlying this plugin support vocabulary addition like this, so it may be possible for this feature to be supported in the future with OLIVE modifications.

Input Expectations

This plugin has been trained to translate a single sentence at a time - to maximize the performance of this plugin, each scoring request input should be a single sentence. When longer segments, like paragraphs, pages, or even whole files are input, performance may be less than optimal, but the plugin will 'split' the data into 25-word chunks that it treats as sentences to minimize the negative impact. This is less than ideal, but performs much better on the whole than treating larger inputs as a single sentence and leaving them as-is.

Comments

GPU Support

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.

Global Options

This plugin does not expose any options to the user.