Skip to content

OLIVE Command Line Interface Guide (Legacy)

Disclaimer

Note that the tools described below are legacy tools that are mostly used for internal testing and development. With docker-based deliveries, these utilities are difficult to access and have many performance tradeoffs versus using the OLIVE server through a client - they should not be used for integration, only for very basic experimentation. The functionality offered by these should instead be accessed through the provided Java example client (OliveAnalyze, OliveEnroll, etc.) or Python example client (olivepyanalyze, olivepyenroll, etc.). Documentation for these utilities is under construction and will be provided soon - but each utility has a help statement that provides instructions for running each.

Introduction

This document describes running the OLIVE (formerly SCENIC) system from a command line. Our command line applications are geared toward a variety of specialized users such as researchers, system evaluators (i.e. Leidos for the DARPA RATS program), and testers. Casual users should consider using our graphical application. However, our command line applications can function as general-purpose tools, but may require specially formatted files such the RATS XML files for audio analysis and LDC-format TSV files for training annotations.

1: Overview

OLIVE command line interface (CLI) tools include:

  • localenroll – Used to enroll ‘targets’ into the system, such as a target speaker for speaker identification (SID), a topic of interest for topic identification (TID), or a keyword or phrase of interest for query-by-example keyword spotting (QBE).
  • localanalyze – Used to query the OLIVE server to score audio to find speech with a speech activity detection (SAD) plugin, report scores for potential speakers or languages of interest for SID or language identification (LID) plugins, report likelihood and location(s) of conversation topics or keywords of interest (TID, QBE, KWS).
  • localtrain – Used to train or adapt plugins that support the LearningTrait (SupervisedAdapter, SupervisedTrainer, or UnsupervisedAdapter) with examples of new audio conditions to improve performance in such conditions. Also used to add new language recognition capabilities to a LID plugin, and to retrain the background models of a SID plugin to prepare it for new audio conditions. Training and adaptation are not available in all plugins, please refer to individual plugin documentation or plugin capabilities matrix to verify availability of training or adaptation.

2: Command Line Testing and Analysis

A. Enrollment with localenroll

The localenroll command is used to enroll audio for SID and TID. It can be invoked from a BASH or C-shell terminal. It takes a simply formatted text file as input and does not produce an output file.

The audio enrollment list input file is formatted as one or more newline-separated lines containing a path to an audio file and a class or model ID, which can be a speaker name, topic name, or query name for SID, TID, and QBE respectively. A general example is given below, and more details and plugin-specific enrollment information are provided in the appropriate section in the Plugin Appendix.

Format:

<audio_path> <model_id>

Example enrollment list file (SID):

/data/speaker1/audiofile1.wav speaker1
/data/speaker1/audiofile2.wav speaker1
/data/speaker7/audiofile1.wav speaker7

The basic syntax for calling localenroll (more details and options below) is:

$ ./localenroll <path_to_plugin_domain> <path_to_enrollment_file>

Where an example of that may be:

$ ./localenroll plugins/sid-embed-v1/domains/multi-v1/ /data/sid/smoke_enroll.lst

The numerous options available in localenroll can be seen by executing localenroll --help, the output of which is shown below:

usage: localenroll [-h] [--options OPTIONS_PATH] [--timeout TIMEOUT]
                   [--version] [--log LOG_FILE] [--work_dir WORK_DIR]
                   [--jobs JOBS] [--debug] [--nochildren] [--quiet]
                   [--verbose] [--purge]
                   plugin_domain_path enrollment_file

positional arguments:
  plugin_domain_path    path to plugin domain used for analysis
  enrollment_file       List of enrollments of the form <audio_path>
                        <class_id> OR a PEM formatted file of the form
                        <audio_path> <channel> <class_id> <start> <end>

optional arguments:
  -h, --help            show this help message and exit
  --options OPTIONS_PATH
                        Optional file containing name/value pairs. The option
                        file must have a configuration section named
                        enrollment.Only values from the enrollment section are
                        read
  --timeout TIMEOUT     timeout, in seconds, for all jobs regardless of the
                        audio duration. otherwise the job will timeout based
                        on the duration of audio to process and the domain's
                        timeout_weight
  --version             show program's version number and exit
  --log LOG_FILE, -l LOG_FILE
                        path to output log file
  --work_dir WORK_DIR, -w WORK_DIR
                        path to work dir
  --jobs JOBS, -j JOBS  specify number of parallel JOBS to run; default is the
                        number of local processors
  --debug, -d           debug mode prevents deletion of logs and intermediate
                        files on success
  --nochildren          Perform sequential (local) processing without creating
                        jobs as sub processes for ease of debugging (IPython)
                        at the expense of speed/parallelization
  --quiet, -q           turn off progress bar display
  --verbose, -v         Print logging to the terminal. Overrides --quiet
  --purge, -p           purge the work dir before starting

B. Scoring and processing with localanalyze

I. Invoking localanalyze

The localanalyze utility is used to perform OLIVE scoring and analysis with most plugins (SAD, SID, SDD, LID, KWS, QBE, GID, TID), or processing with an ENH plugin, all on list-based input files. It can be invoked from a BASH or C-shell terminal. A path to a valid OLIVE plugin and domain as well as an audio paths input file are required for all tasks. For some plugins, like LID and SID, an optional IDs input file can be specified via the --class_ids argument to limit which languages or speakers are scored. This IDs input file is also how a keyword spotting plugin is informed what the keywords of interest are for a given analysis.

The exact details for invoking localanalyze will depend upon the plugin technology being used, and may vary slightly depending upon the options available to each individual plugin, but the general format for running this utility is:

$ localanalyze <path_to_plugin_domain> <list_of_files_to_analyze_or_process>

With an example (SID):

$ localanalyze plugins/sid-embed-v1/domains/multi-v1/ /data/sid/test_data.lst

The format of the audio input file is simply a list of one or more newline-separated lines containing a path to an audio file:

<audio_path>

Example audio input file:

/data/sid/test/unknownSpkr1.wav
/data/sid/test/unknownSpkr27.wav

As mentioned above, if you would only like to score a subset of the enrolled speakers or languages, you can optionally pass a list of these identifiers as a newline-separated list text file, with the --class_ids command line argument. This same argument is how you select keywords to search for when running localanalyze with a keyword spotting plugin (see KWS section in the Plugin Appendix for more information).

IDs List Format:

<id_1>
<id_2 (opt)>
<id_N (opt)>

A Speaker Identification IDs example:

Chris 
Billy
Spkr3

A Keyword Spotting IDs example:

turn left
torpedo
watermelon

Example (KWS) of a localanalyze call with the --class_ids argument:

$ localanalyze --class_ids search_list.lst plugins/kws-batch-v9/domains/eng-tel-v1/ /data/kws/test/test-audio.lst

Note that re-running localanalyze will overwrite the contents of the output.txt file or OUTPUT directory, depending on what type of plugin is being run.

The OLIVE usage/help statement for localanalyze:

usage: localanalyze [-h] [--output OUTPUT_PATH] [--thresholds THRESHOLDS]
                [--class_ids ID_LIST_PATH] [--options OPTIONS_PATH]
                [--regions REGION_PATH] [--timeout TIMEOUT] [--version]
                [--log LOG_FILE] [--work_dir WORK_DIR] [--jobs JOBS]
                [--debug] [--nochildren] [--quiet] [--verbose] [--purge]
                plugin_domain_path audio_paths_file

positional arguments:
  plugin_domain_path    path to plugin domain used for analysis
  audio_paths_file      List of audio files to analyze OR a PEM formatted file
                        of the form <audio_path> <channel> <class_id> <start>
                        <end>

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT_PATH, -o OUTPUT_PATH
                        path to output file or directory
  --thresholds THRESHOLDS
                        Optional comma-separated threshold values to apply to
                        frame-level scores, e.g. 0.0,1.5. Use syntax '--
                        thresholds=' for negative values, e.g
                        --thresholds=-2.0,-1.0
  --class_ids ID_LIST_PATH, -i ID_LIST_PATH
                        Optional file that specifies class ids to be scored.
                        E.g. limit the speakers that scored.
  --options OPTIONS_PATH
                        Optional file containing plugin specific name/value
                        pairs. The option file may have more or more section
                        headings s for each plugin type. Common section names
                        are 'frame scoring', ,'global scoring' or 'region
                        scoring'
  --regions REGION_PATH, -r REGION_PATH
                        Optional flag indicating that the audio paths file
                        should be supplemented with regions from a PEM
                        formated file, it is up to the plugin to utilize these
                        regions to supplement its scoring. This flag is
                        ignored if the audio input list (audio_paths_file) is
                        a PEM formatted file.
  --timeout TIMEOUT     timeout, in seconds, for all jobs regardless of the
                        audio duration. otherwise the job will timeout based
                        on the duration of audio to process and the domain's
                        timeout_weight
  --version             show program's version number and exit
  --log LOG_FILE, -l LOG_FILE
                        path to output log file
  --work_dir WORK_DIR, -w WORK_DIR
                        path to work dir
  --jobs JOBS, -j JOBS  specify number of parallel JOBS to run; default is the
                        number of local processors
  --debug, -d           debug mode prevents deletion of logs and intermediate
                        files on success
  --nochildren          Perform sequential (local) processing without creating
                        jobs as sub processes for ease of debugging (IPython)
                        at the expense of speed/parallelization
  --quiet, -q           turn off progress bar display
  --verbose, -v         Print logging to the terminal. Overrides --quiet
  --purge, -p           purge the work dir before starting

II. Output

Plugin Scoring Types

In general, the output format and location of a call to localanalyze will depend on the type of ‘scorer’ the plugin being used is. There are currently four types of plugins in OLIVE:

  • Global scorer
    • Any plugin that reports a single score for a given model over the entire test audio file is a global scoring plugin. Currently SID, LID, and GID are the only global scoring plugins. Every input test audio file will be assigned a single score for each enrolled target model, as measured by looking at the entire file at once.
    • Example – sid-embed-v1, lid-embed-v1
  • Region scorer
    • Region scoring plugins are capable of considering each audio file in small pieces at a time. Scores are reported for enrolled target models along with the location within that audio file that they are thought to occur. This allows OLIVE to pinpoint individual keywords or phrases or pick out one specific speaker in a recording where several people may be talking. TID, SDD, QBE, and KWS are all region scorers.
    • Example – sdd-embed-v1, qbe-tdnn-v4, kws-batch-v9
  • Frame scorer
    • A frame scoring plugin provides a score for every ‘frame’ of audio within every test file passed to localanalyze. This allows OLIVE to find distinct regions of speech with high precision in recordings with noise and/or silence. SAD is a frame scoring plugin. It is also possible to apply a threshold to a frame scoring plugin at run-time to report regions of detection instead of frame scores. For a plugin like SAD, this allows OLIVE to provide output in the form of speech regions. A frame is a short segment of audio that typically consists of 10 milliseconds of audio (100 frames per second).
    • Example – sad-dnn-v4
  • Audio to audio
    • This plugin takes an audio file as input, and also returns an audio file as output. Currently the only plugins that fall into this category are speech/audio enhancement plugins, where the goal is removing noise and distortion from an audio file to improve the human listening experience and intelligibility.
  • Example – enh-mmse-v1
Global Scorer Output

In the case of global scorers like LID and SID, the output file, which by default is called output.txt, contains one or more lines containing the audio path, speaker/language ID (class id), and the score:

<audio_path> <class_id> <score>

The name and location of the output file can be overridden by passing it as the argument to the -o or --output argument when calling localanalyze. To see specific examples for each plugin type, please refer to the appropriate section of the Plugin Appendix.

Region Scorer Output

Region scoring plugins will generate a single output file, that is also called output.txt by default, just like global scorers. The file looks very similar to a global scorer’s output, but includes a temporal component to each line that represents the start and end of each scored region. In practice, this looks like:

<audio_path> <region_start_timestamp> <region_end_timestamp> <class_id> <score>

Each test file can have multiple regions where scores are reported, depending on the individual plugin. The region boundary timestamps are in seconds. Specific examples can be found in the Plugin Appendix at the end of this document.

Frame Scorer Output

In the case of frame scorers like SAD, an output file is generated for each audio input file, where each audio output file contains a score for each frame in the audio input. There is one frame score per line. Alternatively, an option exists to produce segmentation scores from SAD results by using the --threshold argument. When using the --threshold argument, the output file adheres to standard 5-column PEM format. Without supplying a threshold to localanalyze, the frame scorer output looks like this:

<frame_1_score>
<frame_2_score>
…
<frame_N_score>

When a threshold is provided, the output file will resemble the following:

<filename>, <channel>, <label (speech)>, <speech region start time (seconds)>, <end time (seconds)>
Audio to Audio Output

An audio-to-audio plugin takes an audio file as input and returns a corresponding audio file as output. Currently, this plugin type is used to supply enhancement capabilities to OLIVE, to allow OLIVE to improve the quality, intelligibility, or just general human listening experience for an audio file. By default, each output audio file is created in an OUTPUT directory in the location that localanalyze was invoked. Within the OUTPUT directory, the folder structure of the original input audio file is preserved.

This means that if the input audio file was:

/data/enhancement-test/test_file1.wav

Then the enhanced output file will be found by default in:

./OUTPUT/data/enhancement-test/test_file1.wav

3: Command Line Field Adaptation

A. Command Line Field Adaptation Overview

In general, training and adaptation are very resource and time intensive operations. Very large amounts of RAM are used at certain steps in training. When attempting to train or adapt, the machine should be dedicated to that operation.

If the plugin path contains a domain then adaptation is implied, otherwise training is implied. The high-level difference between training and adaptation is that adaptation will use the new data supplied during adaptation in addition to the data already used to train the model used by the plugin/domain. Training, on the other hand, ignores the data originally used for training a model and retrains from scratch using only the new data provided. When performing training, none of the data in the base plugin will be used, but the feature configs will. Check the plugin’s traits to determine if full training and/or adaptation are supported.

B. Invoking localtrain

Not to be confused with enrollment, the localtrain command line application is used to perform field adaptations for SAD, LID & SID. localtrain takes a plugin or plugin_domain path, and one or more data input files formatted for:

  • Unsupervised data - a newline separated list of audio file paths
  • Supervised data with file level annotations - a newline separated list of audio files paths with a class Id (i.e. “audio_file1.flac fas\n”)
  • Supervised data with region level annotations - a newline separated list of audio file paths, start time (seconds), end time (seconds), and class ID (i.e. “audio_file1.flac 1.25 3.5 fas\n”)

If multiple data files are specified then they must all use the same annotation format.

The localtrain utility outputs a new domain in the plugin path. The details of the localtrain executable are below:

usage: localtrain [-h] --domain-id DOMAIN_ID [--overwrite] [--preprocess]
                  [--finalize] [--unique] [--options OPTIONS_PATH]
                  [--timeout TIMEOUT] [--version] [--log LOG_FILE]
                  [--work_dir WORK_DIR] [--jobs JOBS] [--debug] [--nochildren]
                  [--quiet] [--verbose] [--purge]
                  plugin_or_domain_path data [data ...]

Train or adapt OLIVE audio recognition systems

positional arguments:
  plugin_or_domain_path
                        Path to the plugin or domain. A plugin path implies
                        full training. A domain path implies adaptation of the
                        specified domain.
  data                  paths to data files for training/adapation. The files
                        can have one of three forms. 1: <audio_path>\n 2:
                        <audio_path> <class_id>\n 3: <audio_path> <class_id>
                        <start> <end> \n. The first form has no annotations
                        and implies unsupervised. The second form provides for
                        file-level annotations while the third form supports
                        region-level annotations. Start and end times should
                        be in seconds. If multiple files are specified, they
                        must have the same form.

optional arguments:
  -h, --help            show this help message and exit
  --domain-id DOMAIN_ID
                        The id of the new domain you're creating through
                        training or adaptation. Should be a string that is
                        somewhat descriptive of the conditions
  --overwrite           Forcefully overwite an existing domain
  --preprocess          Pre-process audio only, do not finalize
                        training/adaptation
  --finalize            Pre-process audio only, do not finalize
                        training/adaptation
  --unique              gurantees log files are written to unique
                        directoires/files. Helpful when running in SGE mode
  --options OPTIONS_PATH
                        Optional file containing plugin specific name/value
                        pairs. The option file must have one or more sections
                        for each plugin type. Common section names are
                        'supervised trainer', ,'supervised adapter',
                        'unsupervised trainer' or 'unsupervised adapter'
  --timeout TIMEOUT     timeout, in seconds, for all jobs regardless of the
                        audio duration. otherwise the job will timeout based
                        on the duration of audio to process and the domain's
                        timeout_weight
  --version             show program's version number and exit
  --log LOG_FILE, -l LOG_FILE
                        path to output log file
  --work_dir WORK_DIR, -w WORK_DIR
                        path to work dir
  --jobs JOBS, -j JOBS  specify number of parallel JOBS to run; default is the
                        number of local processors
  --debug, -d           debug mode prevents deletion of logs and intermediate
                        files on success
  --nochildren          Perform sequential (local) processing without creating
                        jobs as sub processes for ease of debugging (IPython)
                        at the expense of speed/parallelization
  --quiet, -q           turn off progress bar display
  --verbose, -v         Print logging to the terminal. Overrides --quiet
  --purge, -p           purge the work dir before starting

In order to use an adapted system plugin, simply pass the full path of the domain generated by localtrain to localenroll or localanalyze as the plugin_domain_path argument. For training, do not include the domain in the plugin path.

When running on a SGE, you may split the audio processing from the finalization step by using the --preprocess flag to first pre-process audio files, then use invoke localtrain with the --finalize argument to finalize training.

Guidelines for the minimum amount of audio data required to successfully execute localtrain are listed in the table below.

Task Operation Speech Duration
SAD Adapt to new channel 1h
LID Adapt to new channel 20m
LID Train a new language 3h
SID Adapt to new channel 1h

i. Examples

SAD Adaptation Example:
$ localtrain ./plugins/sad-dnn-v1/domains/ptt-v1/ adaptation-data.lst

Where each line of adaptation-data.lst has the following format:

/path/to/audio.wav label

C. LID Training/Adaptation

When training new channel conditions, it is recommended to train all supported languages in the LID model to produce the best results. The out of set language is labeled as ‘xxx’. Use this language ID when training to add languages that you do not want to target in the LID task but are known to be in the test dataset.

4: Log Files

a. OLIVE Command Line Logging

When executing localtrain, localenroll, and localanalyze, here are three named log files that may be of interest should something go awry.

  1. The top-level log file: This log file corresponds to the -l option to the localtrain, localenroll, and localanalyze utilities. By default, it is named the same as the utility being used with “.log” appended (i.e. localanalyze.log when running localanalyze) and will be written to the directory from which you executed the utility.
  2. The pool executor log file: This file will be written to [work_directory]/logs/pool_executor.log, where work_directory corresponds to the -w option to localtrain/localenroll/localanalyze and defaults to your current directory/WORK. The pool executor log file is the best log file to look at if unexpected errors occur. It corresponds to our internal job scheduler also known as the pool executor.
  3. The pool monitor log file: This file will be written to [work_directory]/logs/pool_monitor.log, where work_directory corresponds to the -w option to the localtrain, localenroll, localanalyze utilities and defaults to your current directory/WORK/. This log contains stats about memory and CPU utilization.

All three of these log files will exhibit log rotation behavior.

In the event of errors, [work_directory]/logs may also contain log files named [order_id].failed, where order_id generally corresponds to the file names of the audio files being used for adaptation/training, enrollment, or analysis. The id can be used to tie errors in the pool executor log file to the “.failed” log files.

If you run the OLIVE CLI utilities in debug mode (-d), all log files will be maintained, even if they were successful.

b. Rotating Log Files

OLIVE employs rotating log files in many places. In this context, rotating refers to a log file that is rewritten each time the application is run. The old log file, if any, is renamed with an integer suffix denoting how many invocations in the past it corresponds to. For instance, if you run localanalyze and don’t specify a -l option, you’ll get the default localanalyze.log file. If localanalyze.log already exists, it is moved to localanalyze.log.1. The system will keep the 10 most recent log files. A file named localanalyze.log.8 means that the file corresponds to eight invocations ago.

5: Plugin Appendix

Plugin Types and Acronyms

Currently, OLIVE supports the plugin technologies listed in the following list. For operating instructions that apply to only a specific technology, refer to that section within this appendix.

  • SAD – Speech activity detection.
  • SID – Speaker identification.
  • LID – Language identification.
  • KWS – Keyword spotting.
  • QBE – Query by example based keyword spotting.
  • TID – Topic identification.
  • SDD – Speaker diarization and detection.
  • GID – Gender identification.
  • ENH – Speech and audio enhancement.

Speech Activity Detection (SAD)

SAD plugins are frame scorers that take an audio list file and annotate the presence and location of speech in each audio file in that list. In standard operation, SAD plugins produce a single output file for each input file, by default in a directory called OUTPUT in the location localanalyze was called from. Output files carry the name of the original input file, but with a new extension “.scores” – for example, audioFile1.wav will become audioFile1.wav.scores, saved inside OUTPUT/. The format of these results files is a newline separated list of numerical values representing the likelihood that each 10ms frame of the audio file contains speech. Typically, a score above 0 represents speech detection, and a score below 0 represents no speech.

SAD analysis example:

$ localanalyze /plugins/sad-dnn-v4/domains/ptt-v1/ /data/sad/test/test-audio.lst

Output files:

OUTPUT/audioFile1.wav.scores
OUTPUT/audioFile2.wav.scores

Example audioFile1.wav.scores contents:

-0.22624
-0.10081
0.00925
0.12365

Alternatively, SAD plugins can be run with the --thresholds flag to have localanalyze automatically convert the frame scores to regions of speech, by applying the provided threshold.

SAD analysis example using thresholds:

$ localanalyze --thresholds=0.0 /plugins/sad-dnn-v4/domains/ptt-v1/ /data/sad/test/test-audio.lst

This will provide a single output file in OUTPUT directory corresponding to the provided threshold: 0.0.pem. If more than one threshold is provided, there will be a PEM file placed into OUTPUT corresponding to each provided threshold.

Example PEM output:

/data/sad/test/audioFile1.wav 1 speech 63.110 66.060
/data/sad/test/audioFile1.wav 1 speech 66.510 69.230
/data/sad/test/audioFile1.wav 1 speech 93.480 96.090
/data/sad/test/audioFile1.wav 1 speech 96.570 100.760

Note that if negative thresholds are to be used, it is very important to specify the thresholds using an ‘=’ character. For example, this threshold specification is valid:

--thresholds=-2.0,4.0

And this is not valid:

--thresholds -2.0,4.0

If only thresholds of 0 or above are going to be used, it is acceptable to omit the equals sign.

Speaker Identification (SID)

SID plugins are global scorers that take an audio list file and return a score for each enrolled speaker model scored against the audio in each input audio file. Generally, a score above 0 for an enrolled speaker model represents that speaker being detected in the respective audio file. In order to perform analysis on a file with a SID plugin you must first enroll one or more target speakers.

The enrollment list file for a SID plugin follows this format for each line:

<audio_file_path> <speaker_id>

An example enroll.lst:

/data/spkr_example_audio_5760.wav UIM1
/data/spkr_example_audio_5761.wav UIM1
/data/spkr_example_audio_5762.wav John
/data/spkr_example_audio_5763.wav John

Enrolling these speakers with localenroll:

$ localenroll /path/to/plugins/sid-embed-v2/domains/multi-v1/ enroll.lst

Example localanalyze call:

$ localanalyze ./plugins/sid-embed-v2/domains/multi-v1/ ./data/sid/test/testAudio.lst

By default, the output of this call is written to output.txt in the directory the command was run.

The format of output.txt contains one line for each enrolled speaker model, for each input audio file, and the corresponding score:

<audio_file_path> <speaker_id> <score>

Example output.txt:

/data/sid/audio/file1.wav speaker1 -0.5348
/data/sid/audio/file1.wav speaker2 3.2122
/data/sid/audio/file1.wav speaker3 -5.5340
/data/sid/audio/file2.wav speaker1 0.5333
/data/sid/audio/file2.wav speaker2 -4.9444
/data/sid/audio/file2.wav speaker3 -2.6564

Trial-based Calibration Speaker Identification (SID TBC)

Trial-based Calibration SID plugins are identified by ‘tbc’ in the plugin name. They are used very similarly to a basic SID plugin, using localenroll and localanalyze just like the examples shown in the previous example. The benefit of TBC plugins is that they allow OLIVE to perform calibration at test-time, based on the actual data conditions being encountered, rather than being forced to use a single, global calibration model that has been trained a priori. The basics of TBC enrollment and testing follow the previous SID examples, but the additional options and outputs available to TBC are detailed below.

The standard approach to calibration uses a “one size fits all” calibration model based on the developer’s best understanding of potential operating conditions. This is problematic when the user either doesn’t know ahead of time what likely conditions are, or when operating conditions may vary widely. Trial-based calibration was developed as a means of providing calibration that is responsive to the particular conditions of a trial, and adapts its calibration model based on the conditions encountered. There are two ways we have developed to do this. The first draws from a pool of available data (either provided by the developer or augmented with user-provided data) and uses measures of the conditions found within this data and the trial conditions to build an ideal calibration set on the fly. This is useful in that this approach can also determine when a trial CANNOT be calibrated, and to measure the success of calibration when it is possible. The clear downside of this approach is that it is quite slow. A second approach to TBC is to use a model that has used a DNN to learn to predict both calibration parameters and confidence from large sets of trials and available calibration data. This approach is very fast (about 5000 times faster than the first approach) but has the downside that expanding the calibration set by the user’s data isn’t possible. This plug-in provides both approaches in the TBC plug-in, as two separate domains.

In addition to the output score file detailed in the SID section, TBC plugins have additional possible outputs.

Speech Detection Output

Segmentation files are used to label time regions in the speech signal.

We use this format for voice activity detection (VAD).

If an output_ivs_dump_path is provided as an option to localenroll or localanalyze, the system will produce this file in a folder corresponding to the wav_id for all registered waveforms.

The format is the following:

md5sum start end (in seconds)

Example:

b5ae06002383da6c74a7b8424c5fb9282859cca36750565bfb80a13ab732fc57 0.060 0.060
b5ae06002383da6c74a7b8424c5fb9282859cca36750565bfb80a13ab732fc57 0.090 0.090
b5ae06002383da6c74a7b8424c5fb9282859cca36750565bfb80a13ab732fc57 0.110 0.170
b5ae06002383da6c74a7b8424c5fb9282859cca36750565bfb80a13ab732fc57 0.200 0.200
b5ae06002383da6c74a7b8424c5fb9282859cca36750565bfb80a13ab732fc57 0.560 3.550
Persistent I-Vectors

For both localenroll and localanalyze, if the output_ivs_dump_path is defined via an options file with the --options flag, an i-vector from each audio file is saved in this directory for re-use. This reduces unnecessary computation when there exists overlap between lists of wave files to be processed. For instance, if the enroll and test wave file lists are identical (i.e., the case of an exhaustive comparison of a set of files), i-vector persistence will reduce overall computation by almost a factor of 2 since i-vector extraction consumes more of the computation required for an evaluation. I-vectors will be saved in a sub-directory of output_ivs_dump_path based on the base name of the wave file.

In addition to this optional feature, the enrollment vectors are loaded in memory prior to verification and if the md5sum of a test audio file matches one used in the enrollment process, the corresponding vector will be used instead of re-processing the audio. This is because vector extraction is identical between enrollment and verification.

Trial-based Calibration

Trial-based calibration (TBC) does not change the way calibration works but changes the way calibration is used. It relaxes the constraint on the system developers to train a calibration model that is ideally matched to the end use conditions. Rather than train a calibration model a priori, the system postpones this training until the conditions of the particular verification trial are known to the system; a trial consists of comparing test audio to an enrolled speaker model. The goal of trial-based calibration is to use information about the trial to generate an ideal calibration set for the trial conditions using the reservoir of possible calibration audio files available. Using this set, a calibration model tailored to the conditions of the trial can be trained and used to effectively calibrate the verification score.

The TBC operation and output differs from traditional SID plugins; it may choose to reject a trial and NOT output a score if insufficient data is available for calibrating for those conditions. For instance, the output may look similar to the following:

waves/T6_ACK2.sph T6 0.0 -inf Unable to calibrate with only 8 relevant target trials (20 needed with similarity above 3.00)
waves/T6_ACK3.sph T1 0.0 -inf Unable to calibrate with only 12 relevant target trials (20 needed with similarity above 3.00)sed 0 segments in calibration
waves/T6_ACK3.sph T2 0.0 -inf Unable to calibrate with only 3 relevant target trials (20 needed with similarity above 3.00)sed 0 segments in calibration
waves/T6_ACK3.sph T3 0.0 -inf Unable to calibrate with only 0 relevant target trials (20 needed with similarity above 3.00)sed 0 segments in calibration
waves/T6_ACK3.sph T4 0.0 -inf Unable to calibrate with only 2 relevant target trials (20 needed with similarity above 3.00)sed 0 segments in calibration
waves/T6_ACK3.sph T5 0.0 -inf Unable to calibrate with only 8 relevant target trials (20 needed with similarity above 3.00)sed 0 segments in calibration
waves/T6_ACK3.sph T6 0.0 -inf Unable to calibrate with only 9 relevant target trials (20 needed with similarity above 3.00)sed 0 segments in calibration
waves/T4_ACK3.sph T5 0.96153318882 4.04821968079 Used 95 target trials in calibration
waves/T4_RC.sph T4 3.46785068512 4.07499170303 Used 95 target trials in calibration
waves/T5_Tip2.sph T5 8.90770149231 4.07352733612 Used 98 target trials in calibration
waves/T5_RC.sph T5 10.2386112213 4.03855705261 Used 47 target trials in calibration
waves/T4_Tip2.sph T4 10.8663234711 4.07404613495 Used 218 target trials in calibration
waves/T4_Tip1.sph T4 11.793006897 3.98730397224 Used 164 target trials in calibration
waves/T4_ACK2.sph T4 11.8091144562 3.90610170364 Used 119 target trials in calibration
waves/T4_ACK1.sph T4 12.2115001678 4.16342687607 Used 208 target trials in calibration
waves/T5_ACK1.sph T5 13.8099250793 3.99625587463 Used 99 target trials in calibration
waves/T5_Tip1.sph T5 14.9411458969 3.96994686127 Used 83 target trials in calibration
waves/T4_ACK3.sph T4 16.003446579 4.05554199219 Used 146 target trials in calibration

The output follows the structure:

<testwave> <modelid> <score> <confidence> <info>

In the instance of insufficient calibration segments being located, a score of 0.0 and a calibration confidence of -inf is given. In contrast, when sufficient data is found for calibration, the number of segments used in calibration is reported.

There exist two options for applying calibration with the current plugin: DNN-assisted TBC, normal TBC, or global calibration. Each of these options use duration information to reduce the impact of duration variation impacting calibration performance. Note that changing calibration domains does NOT require re-enrollment of models as these are done in a domain-independent way for any TBC-enabled plugin.

DNN-assisted Trial-based Calibration

DNN-assisted TBC is invoked by passing the tbcdnn-v1 domain to localanalyze. This is a very fast and newly pioneered effort by SRI to reduce the computation needed to apply dynamic calibration methods to speaker recognition and operates with very low overhead compared to global calibration, and yet significantly benefits calibration performance in varying conditions or conditions that differ from the development conditions.

localanalyze ... <plugin>/domains/tbcdnn-v1 test.lst
Normal Trial-based Calibration

TBC is applied by default with the 'sid-embedDnnTbc-v1' plugin. The data within the domain (such as 'tbc-v1') is used as candidate calibration data.

localanalyze ... <plugin>/domains/tbc-v1 test.lst

TBC is applied to verification scores on a trial-by-trial basis. As such, verification using TBC will operate at a speed much slower than global or DNN-assisted TBC depending on the size and make-up of the TBC data. This should be considered when using TBC in a cluster environment where it is the number of trials (model vs test comparisons) that determine the running time instead of the number of test files.

Global Calibration

Each domain can be used to invoke global calibration. This is particularly useful for user-defined data as it provides a rapid means of improving calibration performance without a dramatic increase in computation time. In this case, verification will operate at a much faster pace since TBC is essentially disabled and the global calibration model parameters are applied to all scores. In order to invoke global calibration, and optional parameter must be passed to localanalyze via an options file:

echo "[global scoring]
global_calibration = True" > options.lst
localanalyze --options options.lst ... <plugin>/domains/tbc-v1 test.lst
Optional Parameters

The TBC-based plugins offer several tunable parameters via the options parameter to localenroll or localanalyze. These can be passed to the enrollment phase or verification phase by preceding the options in an ascii text file as such:

$ cat options.lst
[enrollment]
...enrollment options per line...
[global scoring]
...verification options per line...

The optional parameters and their purpose are provided below.

tbc_confidence_threshold = 3.0, # Similarity threshold for processing a trial with TBC
score_threshold = 0.0,          # Score offset subtracted from output LLRs to assist in making 0 threshold output
tgt_max = 300,                  # The maxmimum number of target trials used for TBC of a trial
imp_max = 3000,                 # The maxmimum number of impostor trials used for TBC of a trial
tgt_imp_min = 20,               # The mimum number of relevant target and impostor calibration trials needed to use TBC (rejected otherwise)
global_calibration = False,     # Apply global calibration instead of TBC
ivs_dump_path = None,           # Output path for dumping vectors and meta information
sad_threshold   = 0.5,          # Threshold for speech activity detection (higher results in less speech)
sad_filter      = 1,            # Smoothing of LLRs from SAD DNN prior to thresholding
sad_interpolate = 1,            # If > 1, a speed up of SAD by interpolating values between frames (4 works well)

Utilizing these parameters in an options file may look like this:

echo "[enrollment]
sad_threshold = 1.0
ivs_dump_path = ./embeddings
[global scoring]
sad_threshold = 1.0
ivs_dump_path = ./embeddings
tgt_max = 100
tgt_imp_min = 50" > options.lst

localenroll --options options.lst ... <plugin>/domains/<domain> enroll.lst
localanalyze --options options.lst ... <plugin>/domains/<domain> test.lst
Verification Trial Output

The format for the verification trial is the following. Note that for global calibration, the optional parameters (calibration_confidence and calibration_remarks) are not output.

Output format:

wav_id speaker_id score [calibration_confidence calibration_remarks]

Here is an example of score executed with Global Calibration:

waves/T1_ACK1.sph T6 5.19274568558
waves/T1_ACK1.sph T4 1.204241395
waves/T1_ACK1.sph T5 1.69025540352

Here is an example of scores executed with DNN-assisted TBC:

waves/T1_ACK1.sph T6 5.19274568558 4.42 Used DNN-assisted TBC with confidence 5.751
waves/T1_ACK1.sph T4 1.204241395   5.12 Used DNN-assisted TBC with confidence 3.122
waves/T1_ACK1.sph T5 0.0 -inf Unable to calibrate with confidence above threshold (3.00)

Here is an example of scores executed with normal TBC:

waves/T1_ACK1.sph T6 5.19274568558 4.42 Used 67 target trials in calibration
waves/T1_ACK1.sph T4 1.204241395   5.12 Used 73 target trials in calibration
waves/T1_ACK1.sph T5 0.0 -inf Unable to calibrate with only 8 relevant target trials (20 needed with similarity above 3.00)

Speaker Diarization and Detection (SDD)

The overall goal of SDD plugins is to detect regions of speech within an audio recording that are associated with different speakers, and then identify those speakers if possible. SDD plugins have three different modes of operation, as outlined below. Changing the execution mode for SDD is done by passing an options file to localanalyze as an argument to the --options flag. The main behavior and premise of the technology and plugin remain the same, but each mode changes the format and information contained in the output file.

Running the SDD plugin is very similar to running SID plugins, with the same syntax for enrollment and testing. Currently, training or adaptation through localtrain is not supported, but enrolling new speakers and testing against enrolled speakers is as simple as:

$ localenroll /path/to/plugins/sdd-embed-v1/domains/multi-v1/ enrollmentAudio.lst 
$ localanalyze /path/to/plugins/sdd-embed-v1/domains/multi-v1/ testAudio.lst

The enrollment and test audio file lists in this example follow the same format as the lists used by SID plugins, described above. By default, if run as above with no options, the plugin will run in Speaker Detection mode, and provide the output described above. In order to run in SID or SID Exhaustive mode, you will need to provide an options file to specify that behavior:

$ localanalyze --options options.lst /path/to/plugins/sdd-embed-v1/domains/multi-v1/ testAudio.lst

Where options.lst is a text file with contents similar to:

[region scoring]
mode: SID_EXHAUSTIVE
sad_threshold: 0.0
diarization_max_num_speakers: 2

The [region scoring] header is alerting the plugin that the options are being passed for scoring, and all of the parameters shown above (sad_threshold, diarization_max_num_speakers, mode) are optional parameters. The mode option controls the output behavior as described above, and the possible options are SID, SID_EXHAUSTIVE, and SPEAKER_DETECTION, and described directly below.

The sad_threshold defaults to 2.0, and is used to fine tune the threshold for the internal speech activity detection plugin if necessary. The parameter diarization_max_num_speakers defaults to 4, and is the largest number of speakers the plugin will attempt to consider when clustering segments within the file.

SDD Execution Modes

To facilitate the understanding of each mode’s output, consider that speech from an audio file is made up of clusters of speakers, and each cluster will have one or more contiguous segment of speech.

  • SPEAKER_DETECTION
    • The goal of Speaker Detection is to show the most probable speaker model for each segment of the input audio file. As output, this mode gives one line per segment within the file, along with the top scoring enrolled model for the cluster that segment belongs to, and that cluster's score for the given model. Note that many scores will be repeated in the output file, since each segment in the cluster shares the same score for a given speaker model. This mode is performed by default if no options file with a mode override is given.
  • SID
    • This mode is meant for triaging large amounts of audio files when the main goal is just finding which of these files may contain speech from one of the enrolled speakers. The output is the maximum score for each enrolled speaker within the audio file after scoring against each cluster in the file, as well as the timestamps for the beginning and end of the longest segment within the cluster that scored the highest for that model. This gives a specific segment to spot check and evaluate the plugin's decision if needed.
  • SID_EXHAUSTIVE
    • When using SID Exhaustive, each diarized cluster is scored against each enrolled model. The output is a complete listing for every speech segment of the input audio file, the score from testing every enrolled model against the cluster that the segment belongs to. Many scores will be repeated in the output file, since each segment in the cluster shares the same score. In this example, Chris and Jimmy are the only enrolled models, and 5 total segments were identified within the file.

Language Identification (LID)

LID plugins are global scorers that act very similar to SID with respect to scoring, except that each score corresponds to a language model rather than a speaker model. In most cases, LID plugins will be delivered from SRI with a set of languages already enrolled. Languages can be added to some plugins by the user if enough appropriate data is available, through the localtrain CLI call. Details on this will be added to this document in a later revision.

Example localanalyze call:

$ localanalyze /path/to/plugins/lid-embed-v2/domains/multi-v1/ testAudio.lst

Output format:

<audio_file_path> <language_id> <score>

Output example:

/data/lid/audio/file1.wav fre -0.5348
/data/lid/audio/file1.wav eng 3.2122
/data/lid/audio/file1.wav spa -5.5340
/data/lid/audio/file1.wav rus 0.5333
/data/lid/audio/file1.wav ara -4.9444
/data/lid/audio/file2.wav fre -2.6564

Keyword Spotting (KWS)

KWS is an automatic speech recognition (ASR) based approach to detecting spoken keywords in audio. Rather than enrolling target keywords from audio, as you would with query-by-example, telling the plugin what keywords to search for is done by passing an IDs file to localanalyze. The format of the IDs file is:

IDs List Format:

<id_1>
<id_2 (opt)>
<id_N (opt)>

A Keyword Spotting IDs example, search_list.lst:

remote
torpedo
voice recognition

Example KWS localanalyze call with the --class_ids argument:

$ localanalyze --class_ids search_list.lst /path/to/plugins/kws-batch-v9/domains/eng-tel-v1/ /data/kws/test/testAudio.lst

The output format for KWS plugins is identical to that of QBE. It is written to output.txt by default and follows this format:

<audio_file_path> <start_time_s> <end_time_s> <keyword_id> <score>

Example:

/data/kws/testFile1.wav 7.170 7.570 remote 1.0
/data/kws/testFile1.wav 10.390 10.930 remote 0.693357
/data/kws/testFile1.wav 1.639 2.549 voice recognition 1.0

Automatic Speech Recognition (ASR)

Automatic Speech Recognition plugins perform speech-to-text conversion of speech contained within a submitted audio segment to create a transcript of what is being said. Currently the outputs are based on word-level transcriptions. ASR plugins are not-performing any translation, but simply speech-to-text in the native language. All ASR domains are language-dependent, and each one is meant to work only with a single, specific language.

ASR plugins do not require any enrollment of words (like QBE) or specification of words of interest (like traditional KWS), but instead rely on the vocabulary model built into the domain to define the list of available words. All that is necessary for scoring an audio file for ASR is a list of input files to be scored, which follows the format below.

Generic input audio list format:

<audioFile_1>
<audioFile_2 (opt)>
...
<audioFile_N (opt)>

A specific example of this, called testAudio.lst, might look like:

/data/asr/testFile1.wav
/data/asr/testFile2.wav
/data/asr/testFile3.wav

Note that if the files are not contained within the directory that localanalyze is being run from, or if a relative path from that location is not provided, the full file path to each file is necessary.

An example ASR localanalyze call:

$ localanalyze /home/user/oliveAppData/plugins/asr-dynapy-v1/domains/eng-tdnnChain-tel-v1/ /data/asr/test/testAudio.lst

The output format for KWS plugins is identical to that of QBE and other region-scoring OLIVE plugins. ASR plugins are region scorers, and as such will return a list of words, in the order they are spoken. Each detected word consists of timestamp regions in seconds (start and end time pairs), each with an accompanying score, along with the 'value' of that word. Each word output must be part of the vocabulary that specific language's domain was trained with. At this point, out-of-vocabulary words are not supported, so uncommon words, slang words, names, and some other vocabulary may not be able to be recognized by these plugins.

Note that all current ASR plugin domains will output words in their 'native' script. This means that for languages like English and Spanish, each word will be in ASCII text, with the Latin alphabet. Mandarin Chinese, Russian, and Farsi, however, words will be comprised of unicode characters in the native script.

Output is written to output.txt by default and follows this format:

<audio_file_path> <start_time_s> <end_time_s> <word> <score>

An example in English:

/data/asr/testEnglish1.wav 0.000 0.190 and 43.00000000
/data/asr/testEnglish1.wav 0.210 0.340 we're 44.00000000
/data/asr/testEnglish1.wav 0.330 0.460 going 97.00000000
/data/asr/testEnglish1.wav 0.450 0.520 to 97.00000000
/data/asr/testEnglish1.wav 0.510 0.940 fly 66.00000000
/data/asr/testEnglish1.wav 1.080 1.300 was 31.00000000
/data/asr/testEnglish1.wav 1.290 1.390 that 24.00000000
/data/asr/testEnglish1.wav 1.290 1.390 it 22.00000000
/data/asr/testEnglish1.wav 1.380 1.510 we're 27.00000000
/data/asr/testEnglish1.wav 1.500 1.660 going 97.00000000
/data/asr/testEnglish1.wav 1.650 1.720 to 98.00000000
/data/asr/testEnglish1.wav 1.710 1.930 fly 94.00000000
/data/asr/testEnglish1.wav 1.920 2.110 over 79.00000000
/data/asr/testEnglish1.wav 2.100 2.380 saint 93.00000000
/data/asr/testEnglish1.wav 2.370 2.950 louis 96.00000000

An example output excerpt for a Mandarin Chinese domain:

/data/asr/testMandarin1.wav 0.280 0.610 战斗 99.00000000
/data/asr/testMandarin1.wav 0.600 0.880 爆发 98.00000000
/data/asr/testMandarin1.wav 0.870 0.970 的 99.00000000
/data/asr/testMandarin1.wav 0.960 1.420 居民区 86.00000000
/data/asr/testMandarin1.wav 1.410 2.120 有很多 93.00000000
/data/asr/testMandarin1.wav 2.110 2.590 忠于 99.00000000
/data/asr/testMandarin1.wav 2.580 3.140 萨德尔 100.00000000
/data/asr/testMandarin1.wav 3.130 3.340 的 100.00000000
/data/asr/testMandarin1.wav 3.330 3.720 武装 55.00000000
/data/asr/testMandarin1.wav 3.710 4.190 份子 53.00000000

Note that for languages that read from right to left, the direction that text is rendered may appear to 'flip' when viewing the bare text output in a terminal or text editor that doesn't properly deal with the orientation switch mid-line in some operating systems. This can cause the order of the 'word' and 'score' fields to reverse relative to the output of left-to-right read languages, and appear like this Farsi output example:

/data/asr/testFarsi1.wav 0.000 0.480 58.00000000 خوب 
/data/asr/testFarsi1.wav 0.470 0.740 51.00000000 ای 
/data/asr/testFarsi1.wav 0.000 0.320 100.00000000 آره 
/data/asr/testFarsi1.wav 0.310 0.460 99.00000000 می 
/data/asr/testFarsi1.wav 0.450 0.680 99.00000000 گم 
/data/asr/testFarsi1.wav 0.670 0.880 73.00000000 چند 
/data/asr/testFarsi1.wav 0.870 1.330 50.00000000 داره

This is a rendering problem only, however, and rest assured that if interacting with OLIVE through the API, all ordering is properly preserved. Most methods of automatically parsing the raw text output should also properly deal with the ordering, as column-based operators like awk are not affected by the visual order.

Query by Example Keyword Spotting (QBE)

Query by example is a specific type of keyword spotting plugin that searches for keywords matching a spoken word or phrase example, rather than from a text example like traditional KWS. This means that it is necessary to enroll keywords into the system from audio examples with localenroll before using QBE to search audio for these keywords or phrases. Enrollment follows the same format as enrolling speakers into a SID plugin, with the enrollment audio list following this format:

<audio_file_path> <keyword_id>

Example:

/data/qbe/enroll/watermelon_example1.wav Watermelon
/data/qbe/enroll/watermelon_example2.wav Watermelon
/data/qbe/enroll/airplane_example.wav Airplane
/data/qbe/enroll/keyword_example.wav Keyword

Note that currently each enrollment audio file must contain ONLY the keyword that is desired to be enrolled. Also note that the text label in the second column of the enrollment file is only for user readability and is not used by the system when determining what to search the audio file for.

The output of QBE follows the format of the traditional KWS output exactly:

<audio_file_path> <start_time_s> <end_time_s> <keyword_id> <score>

Example:

/data/qbe/test/testFile1.wav 0.630 1.170 Airplane 4.37324614709
/data/qbe/test/testFile2.wav 0.350 1.010 Watermelon -1.19732598006

Topic Identification (TID)

Topic Identification plugins attempt to detect and categorize the topic being discussed within an audio recording from a known and pre-enrolled set of available topics, and report this topic (if any are detected) to the user. Each domain of a TID plugin is language-dependent, and should display the target language as the first string of the domain’s name.

Some TID plugins may be delivered with pre-enrolled topics – please consult the documentation that accompanied your delivery if you are unsure if this is the case. If no topics are enrolled, however, or if you wish to enroll new topics, the format is very similar to enrolling new speakers into a SID plugin, and follows the same CLI call structure, with one caveat.

$ localenroll --local $domain $enroll_list

Due to limitations with the current TID technology, enrollment must be performed with the --local flag set. This limits OLIVE to serialized processing, which will process the enrollment slightly slower, but avoid resource-competition issues that may cause the system to become unstable and crash.

Enrollment audio lists for TID follow the same format as SID, but substitute a topic name for a speaker label. Note: in the current version, we require the user to provide audio examples that are primarily about the topic of interest as enrollment examples. If there are significant portions of an audio file that are off-topic, we suggest the file be cut and fed as separate examples.

Enroll list format:

<audio_file_path> <topic_id>

Example: /tid/enroll/topic_example_audio_5760.wav Travel /tid/enroll/topic_example_audio_5761.wav Travel /tid/enroll/topic_example_audio_5762.wav Travel /tid/enroll/topic_example_audio_5763.wav Travel

To run TID, once target topics have been enrolled, the call to localanalyze is very similar to other plugin types, with the current plugin again requiring the --local flag.

$ localanalyze --local /path/to/plugins/tid-svm-v2/domains/r-tel-v1/ testAudio.lst

As with the SID and LID plugins, by default the plugin’s output will be written to the file “output.txt” in the directory localanalyze was called from. This can be overridden by passing localanalyze the -o flag, as well as an alternate file to save the results to. The TID results structure is very similar to KWS, with the following format for each line:

<audio_file_path> <start_time_s> <end_time_s> <topic_id> <confidence_score>

Example:

/data/tid-example_wavs/20110511_GET-TOGETHER.wav 18.790 55.300 transportation 0.0222
/data/tid-example_wavs/20110511_TRANSPORTATION.wav 4.010 19.140 transportation 0.4532

The start and end timestamps above are provided in seconds. The will be displayed as it was supplied in the second column of the enrollment list. The will be between 0 and 1, and marks the confidence of the system in the decision of this topic.

Please note that output.txt will be overwritten by each successive experiment. Please back it up or use the -o option to localanalyze if you would like to save the results.

Also note that the start and end times for each topic refer to the chunk in each audio that has the highest probability of being about that topic. Currently, we report only ONE such segment per file in order to help the user locate the most useful part. The score associated with that segment is global, in that it represents the likelihood that this topic is present anywhere in the document.

Important Background Example Information

In order to train a Topic detector, we currently use an SVM classifier. This model uses "positive" examples of the topic as provided by the user during the enrollment phase, as well as "negative" examples to model what is not the topic. Those negative examples can be crucial to the performance of the final system. Currently, those "negative" examples come pre-processed as a python numpy archive and cannot be modified by the user explicitly.

We do provide two different numpy archives that can be tried by a user:

  1. BG_RUS001-train-acc+neg-no-travel.npz (default)
  2. BG_RUS001-train-test-random-all-chunk-plugin.npz

Archive 2) includes only data from the RUS001 conversational corpus, which didn't have very topic-specific prompts. Archive 1) includes a mix of RUS001 data as well as a subset of the RU_CTS conversational corpus which was topic annotated.

We excluded examples pertaining to 'TRAVEL' in this archive, but this archive contains conversations about the following (loosely defined) topics:

  • ACTIVITIES
  • BIRTHDAY_WISHES
  • CHILDREN
  • ECONOMY
  • EDUCATION
  • ENTERTAINMENT
  • FOOD_DRINK
  • FRIENDS_RELATIVES
  • GET-TOGETHER
  • HEALTH
  • HOME
  • HOME_MAINTENANCE
  • IMMIGRATION
  • LANGUAGE_COMMUNICATION
  • LEISURE
  • LIFE_PHILOSOPHY_RELATIONSHIPS
  • LOCATION_DESCRIPTION
  • MARRIAGE
  • MOOD_PHYSICAL
  • MOVING_HOMES
  • MUSIC
  • PERFORMANCE_REHEARSAL
  • PETS
  • POLITICS
  • PROJECT
  • READING_WRITING
  • RELIGION_HOLIDAY
  • SPEECH_COLLECTION_PROJECT
  • TECHNOLOGY
  • TRANSPORTATION
  • TV_MOVIES
  • WEATHER_CLIMATE
  • WORK

If the topic you are training for is very similar to a topic listed above, it might be worth it trying archive (2) as well. In the future, we will provide the opportunity for the user to feed his own negative examples.

Gender Identification (GID)

Gender ID plugins allow for triage of audio files to identify only those containing speakers of a certain gender. For scoring files, gender identification plugins operate in the same manner as SID and LID plugins. GID plugins are delivered with two pre-enrolled classes; ‘m’ and ‘f’, for male and female, respectively, so user-side enrollment is not necessary. To score files with a GID plugin using localanalyze, use the following syntax:

$ localanalyze /path/to/plugins/gid-gb-v1/domains/clean-v1/  testAudio.lst

Where the output follows this format:

<audio file 1> m <male likelihood score>
<audio file 1> f <female likelihood score>
… 
<audio file N> m <male likelihood score>
<audio file N> f <female likelihood score>

Example:

/data/gender/m-testFile1.wav m 0.999999777927
/data/gender/m-testFile1.wav f -0.22073142865

Enhancement (ENH)

Enhancement or AudioConverter plugins are audio-to-audio plugins that take an audio file as input and provide a second audio file as output. Currently they are used to enhance the input audio file, for listening comfort and/or intelligibility. By default, each output audio file is created in an OUTPUT directory in the location that localanalyze was invoked. Within the OUTPUT directory, the folder structure of the original input audio file is preserved.

This means that if the input audio file was:

/data/enhancement-test/test_file1.wav

Then the enhanced output file will be found by default in:

./OUTPUT/data/enhancement-test/test_file1.wav

Running speech enhancement is as simple as:

$ localanalyze /path/to/plugins/enh-mmse-v1/domains/multi-v1/ inputAudio.lst

In addition, you can pass an optional PEM file that specifies optional regions to the plugin – the current enhancement plugin uses this file to pass the ‘noise regions’ to OLIVE to allow the plugin to build a noise profile for more accurate characterization and removal of the noise present in that audio file.

6: Testing

In this section you will find a description of benchmarking that is performed with each release of OLIVE to measure the performance of each plugin with respect to speed and memory usage. The hardware and software details used for these results are provided in the next section, with each of the current plugins’ memory and speed results following.

a. Benchmarking Setup (Hardware/Software)

Each data point below was obtained by running the OLIVE 4.9.1 software with a 4.8.0 runtime that has been patched to 4.9.1 (contains libraries needed by the new Topic Identification and Enhancement plugins). Tests were performed on the CentOS 7 operating system, on a Gigabyte BRIX GB-BXI7-5500. This machine has 16GB RAM available, and an Intel i7-5500U processor, which is a dual core (quad-thread) processor that runs at 2.4 GHz base (3.0 GHz turbo).

b. Plugin Memory Usage Results

These results were generated by running several files through each plugin via the OLIVE CLI using localanalyze, one at a time, while measuring the memory used by the localanalyze utility. The files vary in length from 10 seconds through 4 hours, and this allows us to see how each plugin handles scaling the audio file length up, and also compare overall resource utilization between individual plugins. The values reported for each plugin and audio file are the peak memory usage, in MB – lower values are better. Note that this report is for processing a single file at a time and is representative of memory utilization that can be expected for serialized processing, or processing on a machine with a single processing core. Parallel processing can cause memory usage to rise.

TODO: Need to put the charts/resuts in here!

SAD

SAD Memory Usage (MB)
Plugin sad-dnn-v4 sad-dnn-v4 sad-dnn-v4
Domain digPtt-v1 ptt-v1 tel-v1
10s 142 142 142
1 min 158 158 158
10 min 221 221 221
30 min 408 408 408
2 hr 1,262 1,262 1,262
4 hr 2,377 2,377 2,377

c. Plugin Speed Analysis Results

The following charts show the speed performance of the current release of each OLIVE plugin. Values are reported as the speed of the plugin in ‘times faster than real time’ and represent how fast the plugin is able to process the input audio data, with respect to the length of that data – higher is better. Each plugin is fed 10 hours of total data consisting of roughly 4-minute audio cuts to achieve this measurement. For this test, OLIVE has been limited to using a single core for processing, in order to keep measurements and results consistent. Note that enabling parallel processing if multiple CPU cores are available will improve performance.

Plugin Speed Statistics Reported in Times Faster than Real Time

Plugin Domain Speed (x RT)
sad-dnn-v4 digPtt-v1 104.8
sad-dnn-v4 ptt-v1 111.5
sad-dnn-v4 tel-v1 117.0
sid-embed-v2 multi-v1 51.6
sid-embedDnnTbc-v1 tbc-v1 43.9
sid-embedDnnTbc-v1 tbcdnn-v1 64.5
sid-embedTbc-v1 tbc-v1 40.2
lid-embed-v2 multi-v1 39.7
kws-batch-v9 eng-tel-v1 1.11
kws-batch-v9 eng-tel-v2 1.09
kws-batch-v9 f-tel-v1 1.23
kws-batch-v9 r-tel-v1 1.61
kws-batch-v9 r-tel-v2 1.61
qbe-tdnn-v4 digPtt-v1 15.2
qbe-tdnn-v4 multi-v1 16.1