tmt-ctranslate-v1 (Text Machine Translation)
Version Changelog
Plugin Version | Change |
---|---|
v1.0.0 | Initial plugin release with OLIVE 5.6.0. Carries over the functionality of several domains from tmt-neural-v1 with improvements to speed, memory usage, hardware compatibility, and stability, and adds new language capabilities as well |
v1.2.0 | New domains released, including improved Mandarin to English translation based on text directly (rather than speech), meant for chat-style translation, as well as Ukrainian to English, English to Ukrainian, and an updated English to Russian model. Updated character limits to be domain specific to be more compatible with Mandarin and provide better translation. Tested and released for OLIVE 5.6.0 |
v1.3.0 | Improved CPU/GPU device handling and compatibility between OLIVE's multi-processing ethos with ctranslate2's. Tested and released with OLIVE 5.7.0 |
Description
Text Machine Translation plugins perform translation of text from one language to another, typically from one language, specified by the domain, to English. All TMT domains are language-dependent, and each one is meant to work only with a single, specific language. TMT plugins may have some limitations or special processing considerations depending on the language(s) involved and their native alphabets, as well as the training data used to train the underlying models.
The goal of this plugin is to ingest text in one language, defined by the specified domain, and translate it to another language. Both of these languages are specified in the domain names - the first language listed is the "source" language, that will be translated from. The second language listed is the "target" language, that the plugin will attempt to translate to.
An example input string, in Spanish:
manual de fotografía para aprender todo lo esencial sobre fujifilm cámaras
Note that some punctuation and special characters will be stripped from the input during preprocessing.
An example output string translation of the above example, as provided by the spanish-english-v1
domain:
manual of photography to learn everything essential about unk cameras
Words that the system does not recognize or can't translate may be marked up with an unk
tag for some domains, as can be seen in the example output below. Note that all system output will be lowercase for case sensitive languages, and apart from the unk
tag, all output will be devoid of punctuation as the models have been primarily trained with punctuation-less ASR output and are meant for translating ASR output more than text documents. Most domains perform best when given individual sentences as input.
An important exception to the above punctuation statement is the mandarin-english-text-v1
domain listed below. This is a new/advanced domain that has been trained with more text awareness and is capable of both considering punctuation as input in its ingestion step, and of outputting punctuation in the resulting translation.
Domains (Supported Languages)
english-mandarin-v2
- Translates English text into Mandarin text.
english-russian-v2
- Translates English text into Russian text.
english-spanish-v2
- Translates English text into Spanish text.
english-ukrainian-v2
- Translates English text into Ukrainian text.
iraqiArabic-english-v1
- Translates Iraqi Arabic text into English text.
mandarin-english-text-v2
- Translates Mandarin text into English text. Capable of leveraging punctuation information during translation and producing punctuation in the output. Best able to deal with multi-sentence input.
mandarin-english-v2
- Translates Mandarin text into English text.
russian-english-v2
- Translates Russian text into English text.
spanish-english-v2
- Translates Spanish text into English text.
ukrainian-english-v2
- Translates Ukrainian text into English text.
Inputs
For scoring, a text string or text-populated file is required. There is no verification performed by OLIVE or by TMT plugins that the text passed as input is actually in the language that the domain is capable of recognizing. The burden lies on the user to manually or automatically screen this audio before attempting to recognize. Note that output may fail or be very confusing if the input language does not match the domain's capabilities.
An example input string:
manual de fotografía para aprender todo lo esencial sobre fujifilm cámaras
There is a bit of preprocessing of input that occurs before a string is sent to translation. Thus, all input is lower-cased and frequent punctuation marks such as commas, exclamation marks or question marks are stripped from the string. However, no spelling error correction of any kind is performed.
As described above - one exception to this is the mandarin-english-text-v1
domain, which bypasses this preprocessing as it can handle the punctuation in the input.
Outputs
The output format for TMT plugins is simply text. Words that the system does not recognize or can't translate will be marked up with an unk
tag, as can be seen in the example output below. Note that all system output will be lowercase, and apart from the unk
tag, all output will be devoid of punctuation.
An example output string translation of the input example:
manual of photography to learn everything essential about unk cameras
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.
- TextTransformer – Plugin accepts and analyzes text string inputs, and outputs a new text string as output. In the case of TMT plugins, a text string in the source language should be provided as input, with the expectation that the output will be a text string translated to the desired destination language.
Compatibility
OLIVE 5.6+
Limitations
Like ASR, TMT plugins are language dependent and also largely text domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the translation models underlying each domain. Several factors contribute to what might limit the vocabulary of a translation model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts).
Regarding performance, TMT plugins are especially sensitive to the tradeoffs between accuracy performance and realistic resource requirement and runtime constraints. The size-on-disk of each domain can be quite large depending on the technology of the individual plugin, and some technologies/domains may have additional constraints.
Multi-Language Memory Constraints
Each domain currently uses a significant amount of memory once the translation models have been loaded by using the plugin. Without substantial memory resources available on the machine hosting OLIVE, it's possible to quickly run out of available memory and run into strange behavior and/or failures - especially when running TMT alongside other memory-intensive plugins like ASR.
Language Dependence
Each domain of this TMT plugin is language specific, and is only capable of translating text from one single language to one other. There is no filter or any sort of verification performed by OLIVE to ensure that the text passed to this domain is indeed of the correct language - this burden lies on the users to either manually or automatically triage input text if the source language is unknown, though it is often easy to spot mismatched input languages by the high number of output words that appears with the "unk
" tag for domains that support it.
Spelling Errors
Note that there is no spell-checking or other types of spelling related pre-processing that occurs on the input data. Therefore any spelling mistakes in the input are likely to cause the system to output unk
tags.
Out of Vocabulary (OOV) Words, Names
The individual words that the plugin is capable of recognizing and translating is determined by the vocabulary that the corresponding language model was trained with. This means that some uncommon or unofficial words, like slang or other types of colloquial speech, as well as names or places, may not be possible to be translated by a plugin out-of-the-box. Several factors contribute to what might limit the vocabulary of a language model, including the age of the text data used during development, the source or domain of this data (such as broadcast news transcript versus social media type posts), or pruning the vocabulary to increase processing speed and/or reduce memory requirements.
At this time there is no provision in OLIVE for adding words or names to the vocabulary or to the translation model. Some of the models underlying this plugin support vocabulary addition like this, so it may be possible for this feature to be supported in the future with OLIVE modifications.
Input Expectations
This plugin has been trained to translate a single sentence at a time - to maximize the performance of this plugin, each scoring request input should be a single sentence. When longer segments, like paragraphs, pages, or even whole files are input, performance may be less than optimal, but the plugin will 'split' the data into 25-word chunks that it treats as sentences to minimize the negative impact. This is less than ideal, but performs much better on the whole than treating larger inputs as a single sentence and leaving them as-is.
Comments
GPU Support
Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.
Global Options
This plugin does not expose any options to the user.