Skip to content

tmt-statistical-v1 (Text Machine Translation)

Version Changelog

Plugin Version Change
v1.0.0 Initial plugin release with OLIVE 5.0.0
v1.0.1 (latest) Updated to be compatible with OLIVE 5.1.0

Description

Text Machine Translation plugins perform translation of text from one language to another, typically from one language, specified by the domain, to English. All TMT domains are language-dependent, and each one is meant to work only with a single, specific language. TMT plugins may have some limitations or special processing considerations depending on the language(s) involved and their native alphabets, as well as the training data used to train the underlying models.

This is the first Text Machine Translation plugin released for the OLIVE architecture. It offers statistical-model based translation of text, using SRI's SRInterp engine. Each domain provides translation from one language into English, with the two source languages currently available being Spanish and French. The input and output formats match those shown in the examples below.

The goal of this plugin is to ingest text in one language, defined by the specified domain, and translate it to another language. In both currently available domains, the destination language is English.

An example input string, in Spanish:

    manual de fotografía para aprender todo lo esencial sobre fujifilm cámaras

Note that some punctuation and special characters will be stripped from the input during preprocessing.

An example output string translation of the above example, as provided by the spa-eng-generic-v2 domain:

    manual of photography to learn everything essential about __UNKNOWN:fujifilm cameras

Words that the system does not recognize or can't translate will be marked up with an __UNKNOWN: tag, as can be seen in the example output below. Note that all system output will be lowercase for case sensitive languages, and apart from the __UNKNOWN: tag, all output will be devoid of punctuation.

Domains

  • spa-eng-generic-v2
    • Translates Spanish text into English text. Trained mostly on OpenSubtitles data.
  • fre-eng-generic-v1
    • Translates French text into English text. Trained mostly on OpenSubtitles data.

Inputs

For scoring, a text string or text-populated file is required. There is no verification performed by OLIVE or by TMT plugins that the text passed as input is actually in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize. Note that output may fail or be very confusing if the input language does not match the domain's capabilities.

An example input string:

    manual de fotografía para aprender todo lo esencial sobre fujifilm cámaras

There is a bit of preprocessing of input that occurs before a string is sent to translation. Thus, all input is lower-cased and frequent punctuation marks such as commas, exclamation marks or question marks are stripped from the string. However, no spelling error correction of any kind is performed.

Outputs

The output format for TMT plugins is simply text. Words that the system does not recognize or can't translate will be marked up with an __UNKNOWN: tag, as can be seen in the example output below. Note that all system output will be lowercase, and apart from the __UNKNOWN: tag, all output will be devoid of punctuation.

An example output string translation of the input example:

    manual of photography to learn everything essential about __UNKNOWN:fujifilm cameras

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

  • TextTransformer – Plugin accepts and analyzes text string inputs, and outputs a new text string as output. In the case of TMT plugins, a text string in the source language should be provided as input, with the expectation that the output will be a text string translated to the desired destination language.

Compatibility

OLIVE 5.1+

Limitations

As the debut TMT plugin release for OLIVE, there are several known limitations that will impact the usage of this plugin.

Like ASR, TMT plugins are language dependent and also largely text domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the translation models underlying each domain. Several factors contribute to what might limit the vocabulary of a translation model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).

Regarding performance, TMT plugins are especially sensitive to the tradeoffs between accuracy performance and realistic resource requirement and runtime constraints. The size-on-disk of each domain can be quite large depending on the technology of the individual plugin, and some technologies/plugins may have additional constraints. See the individual plugin detail pages for more information.

Multi-Job Threading

This plugin is currently not capable of multi-threading or parallel processing. If using this plugin with an OLIVE server, the server must be run in single-worker mode, by specifying a maximum number of jobs of 1:

    scenicserver -j 1

or

    scenicserver --workers 1

Alternatively, if the connected client only submits jobs synchronously, waiting for the completion and response of each job before submitting additional queries, problems will be avoided.

Due to this limitation, of performing machine translation with this plugin using the CLI tools (localanalyze), if a multi-line input file is provided, this same stipulation of maximizing the active workers to '1' must be used to ensure that the line order of the output text file matches the line order of the input file.

Language Dependence

Each domain of this TMT plugin is language specific, and is only capable of translating text from one single language to one other. There is no filter or any sort of verification performed by OLIVE to ensure that the text passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input text if the source language is unknown, though it is often easy to spot mismatched input languages by the high number of output words that appears with the "__UNKNOWN:" tag.

Spelling Errors

Note that there is no spell-checking or other types of spelling related pre-processing that occurs on the input data. Therefore any spelling mistakes in the input are likely to cause the system to output __UNKNOWN: tags.

Out of Vocabulary (OOV) Words, Names

The individual words that the plugin is capable of recognizing and translating is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be translated by a plugin out-of-the-box. Several factors contribute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts), or pruning the vocabulary to increase processing speed and/or reduce memory requirements.

At this time there is no provision in OLIVE for adding words or names to the vocabulary or to the translation model. Some of the models underlying this plugin support vocabulary addition like this, so it may be possible for this feature to be supported in the future with OLIVE modifications.

Input Limit

There is a maximum limit on the input that a single request/query can have. This may depend on input resources available and number of total characters, but currently seems to be roughly 600 words. Extra large inputs that may approach or exceed this number should be split into multiple job submissions.

Resources (disk space)

Because it is based on statistical MT, this plugin's performance generally directly corresponds to the size of the models it uses, as these models grow the system is exposed to more and more data to learn from. As a result, the included models are very large. Please ensure you have adequate disk space available before attempting to use this plugin.

Future TMT plugins will be based on a neural MT architecture that will not have such extreme model size requirements.

Comments

Global Options

This plugin does not expose any options to the user.