tpd-fusion-v1 (Topic Detection)

Version Changelog

Plugin Version	Change
v1.0.0	Initial plugin release, tested to be compatible with OLIVE 5.2.0
v1.0.1	Released with OLIVE 5.3.0, minor parameter change to improve output stability

Description

Topic Detection (TPD) plugins detect and label regions within a submitted audio segment where one or more enrolled topics are discussed. TPD plugins are capable of handling multiple topics within a single audio segment, and require topics of interest to be enrolled into the system by the end user.

This plugin features a bimodal TPD framework that combines the word (XLM-RoBERTa) and acoustic topic embeddings before the PLDA backend scoring and multi-class calibration. The word and acoustic embeddings are extracted separately using the embedding extractors of the unimodal TPD plugins, tpd-dynapy and tpd-embed, respectively and fused by applying linear discriminant analysis before applying the PLDA scoring.

The discrimination of the system is improved when adding more enrollment data for each topic, or adding new topics - that is, topics are NOT independent. All domains are tailored toward conversational telephone speech, but may be suitable for other domains (lightly tested).

Before a topic can be detected by the TPD plugin, the topic of interest must be enrolled into the system by providing annotated audio examples from the desired topic. TPD requires substantial enrollment data per topic and is very unlikely to perform well with less than ten examples from ten separate conversations.

There are several pre-enrolled background classes for each domain. These background topics give the system a sort of baseline for calibration and comparison when attempting to detect topics from the pool of target user-enrolled topics.

Take care when enrolling new topics for detection. If one of these 'background' topics is similar to a topic the user is interested in, the identical topic name should be used when enrolling examples of this topic, so the system knows to replace the similar 'background' topic with the topic of interest, using the enrollment audio provided.

Domains

This plugin includes two Mandarin Chinese domains - they are both using monolingual word embedding extractors with the difference being the size of their respective embedding sizes, 768 for cmn-tel-v1 vs. 1024 for cmn-tel-large-v1. The base domain is significantly faster and uses significantly less memory during processing than the 'large' domain. The larger word embeddings used by the 'large' domain afforded slightly better TPD performance compared to the standard model in SRI's pilot experiments, and may have the potential to improve performance more in user data.

eng-tel-v1
- English domain focused on conversational telephony speech. There are no pre-enrolled classes for detection.
rus-tel-v1
- Russian domain focused on conversational telephony speech. There are no pre-enrolled classes for detection.
cmn-tel-v1
- Mandarin Chinese domain focused on conversational telephony speech. There are no pre-enrolled classes for detection.
cmn-tel-large-v1
- Same as above, except this one uses higher-dimensional word embeddings.

Inputs

For enrollment, an audio file or buffer and time-annotated regions corresponding to a given topic identifier/label are required. For scoring, an audio buffer or file. IMPORTANT: TPD requires substantial enrollment data per class to function properly, or error rates will be very high. The system begins to perform well with ten or more examples of the topic, from different conversations. These topics should be annotated to include only the parts of the conversation that are "on-topic". Each conversational example should be at least 30 seconds or more in duration.

Outputs

TPD is a region scorer, and as such will return a list of detections consisting of timestamp regions in seconds, each with an accompanying score, and a previously-enrolled topic name that this score belongs to, when a topic has been detected in the input audio. It's possible to have overlapping topics, to have more than one topic detected within the same audio file/segment, and to have more than one detected region with the same topic being discussed.

/data/eng/jRVUhtd_O1M.wav 352.380 424.550 WINE_MAKING 2.60474730
/data/eng/jRVUhtd_O1M.wav 431.180 442.280 WINE_MAKING 2.52452803
/data/eng/mueQ8-wABmg.wav 48.670 131.130 FISHING 5.71391582
/data/eng/mueQ8-wABmg.wav 142.250 167.950 FISHING 9.55547237
/data/eng/mueQ8-wABmg.wav 172.220 182.370 FISHING 9.55547237

Enrollments

Topic Detection plugins allow class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new topic. Adding additional examples to existing topics proceeds exactly like other enrollments for plugins like speaker detection, etc. TPD enrollment requires an audio sample with regions that are labeled for the desired topic(s). If no annotations for time offsets are provided it is assumed that the entire files is on-topic. A new enrollment is created with the first class modification, which consists of essentially sending the system an audio sample with a topic annotation. This enrollment can be augmented with subsequent class modification requests by adding more audio with region labels for the same topic(s).

Note that you should never enroll the same audio in two or more different topics, even if the topics are closely related. Each recording should only ever have one label - audio should never be reused in enrollment unless it is first deleted from an existing topic.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

REGION_SCORER – Score all submitted audio, returning labeled regions within the submitted audio, where each region includes a detected topic of interest and corresponding score for this topic.
- RegionScorerRequest
- RegionScorerStereoRequest?
CLASS_MODIFIER – Enroll new topic models or augment existing topic models with additional data.
- ClassModificationRequest
- ClassRemovalRequest

Compatibility

OLIVE 5.2+

Limitations

There are four main limitations that will impact the usage of this plugin.

Labeling Resolution

Region scoring is performed using a sliding window, if a large window is used (60 seconds) and a single topic exists in that window, the system is robust. If multiple topics exist in a single window or a smaller window (20 seconds) is used, the results will be noisier. There is a balance to be found between resolution of topic change and robustness. Testing on a dataset known to have short time-span topics (say 20 seconds) using a 60 second window will not perform as well as matching the window to the expected topic duration of 20 seconds.

Low Enrollment Data

While it is possible to enroll a topic on a single file, the system thrives on ample enrollment data per topic and benefits from having multiple topics from the target domain. If a given topic has few samples (such as 10), it will not perform as good as if it had 50 samples (recommended). If one topic is enrolled, the system will not perform as well as when 10 or 20 topics are enrolled for the target domain. This is because the enrolled topics actively discriminate each other.

Minimum Speech Duration

The system will not attempt to perform topic detection unless 10 seconds of speech of more is found in the file (configurable as min_speech). This minimum amount of speech must also occur in a relatively short timespan, with no more than a 4 second gap between islands of speech.

Language Dependence

TPD is language dependent and also largely domain dependent. The domains that a plugin can effectively cover are largely determined by the data used to train the speech recognition system (for text-based plug-ins like tpd-dynapy-v1) or acoustic embeddings extractor (as with tpd-embed-v1/v2). Older TPD plugins that are based on ASR technology can also be very slow and heavy on resource utilization, making them cumbersome to use. Newer TPD plugins are based on a topic embeddings framework that is much easier to port to new languages since it is based on more language-independent technology. It is also significantly faster than the ASR-based approach. Topic embeddings plugins do have trade-offs, though; as a newer and less mature technology, the topic detection accuracy often does not yet match the traditional ASR approach. It is up to the user to ensure that the audio being passed in to this plugin is of the appropriate language. Each domain is capable of recognizing audio in a single, specific language, and has no capability of detecting or rejecting if input audio does not fit this language.

Comments or Usage Notes

The system retrains the discriminative space on loading or after any enrollment is added/removed. For this purpose, some base data is included in the plugin with enrollment data used to complement the space. This results in topic detection enrollments being dependent on each other (i.e., unenroll topic A and topic B scores will change for a previously run file). Similarly, there are 10 or 16 base topics for each domain used to improve calibration when limited topics have been enrolled.Data from these topics is not used if the user enrolls a topic of the same label. It is therefore important for the user to match the naming of these topics if they choose to enroll a very similar or the same topic in order to prevent competition of the same/similar topic. The topics for each domain are listed in a file 'model.classes' within each domain and listed below. Additionally, the min_speech requirement of 10.0 seconds requires that speech segments must make up 10.0 seconds of continuous speech whereby they are deemed continuous when no more than 4 seconds of silence exists between segments.

These base-topics are listed below.

Base Topic Data

eng-cts-v1

BUYING_A_CAR
CAPITAL_PUNISHMENT
DRUG_TESTING
EXERCISE_AND_FITNESS
FAMILY_FINANCE
JOB_BENEFITS
NEWS_MEDIA
PETS
PUBLIC_EDUCATION
RECYCLING

rus-cts-v1

ACTIVITIES
BIRTHDAY-WISHES
BOOKS
BUYING
CHILDREN
COMPUTER
EDUCATION
ENTERTAINMENT
FOOD
FRIENDS
GET-TOGETHER
HEALTH
HOLIDAY
HOME-MAINTENANCE
IMMIGRATION
LANGUAGE
LIFE
LOCATION
MARRIAGE
MOVING
MUSIC
PERFORMANCE
PERSONAL
PETS
POLITICS
PROJECT
REAL-ESTATE
SPEECH-COLLECTION
SPORTS
TRANSPORTATION
TRAVEL
TV
WEATHER
WORK

cmn-cts-v1

ACTIVITIES
BUYING
CITIZENSHIP
COLLECTION
CORRESPONDENCE
FINANCES
FRIENDS
HEALTH
HOME
LIFE
LOCATION
PERSONAL
SCHOOL
TRANSPORT
TRAVEL
WORK

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file: plugin_config.py.

Option Name	Description	Default	Expected Range
threshold	Detection threshold: Higher value results in less detections being output.	0.0	-10.0 to 10.0
window	Size of sliding window over speech: Each window is assumed to contain a single topic. Longer is more robust at the cost of resolution.	60.0	10.0 to 120.0
min_speech	Required amount of speech in file to enroll and perform topic detection.	10.0	2.0 to 30.0