emo-whisper-commercial-v1 (Emotion Detection)

Version Changelog

Plugin Version	Change
v1.0.0	Initial plugin release, with OLIVE 6.0.0

Description

The goal of emotion detection is to determine which emotion label best represents the emotion portrayed in the speech signal being analyzed. This plugin provides for global scoring (one score per emotion label per input file) which assumes a single emotion is represented across the entirety of the speech signal.

Internally, the plugin relies on embeddings extracted from a Whisper Medium model, and therefore requires a GPU to operate faster than real time. Embeddings are compared using a Gaussian Backend producing comparison scores as likelihood ratios. The plugin provides the ability to enroll user-labelled emotion data which either complements existing emotion classes, or adds new emotion classes to be detected.

Pre-enrolled Emotion Classes

Out of the box, the plugin has the following classes available to report scores against:

angry
disgust
fear
happy
neutral
other
sad
surprise
tired

Domains

embed-v1
- General purpose domain for emotion detection.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning emotion labels and corresponding scores in the form of a likelihood ratio. Scores will be output for all emotion labels, even if negative.
- GlobalScorerRequest
CLASS_MODIFIER – Enroll new emotion models or augment existing emotion models with additional data.
- ClassModificationRequest
- ClassRemovalRequest

Compatibility

OLIVE 6.0+

Limitations

Known or potential limitations of the plugin are outlined below.

Processing Speed

As this plugin utilizes a medium size Whisper network, a GPU must be used to process faster than real time. A CPU can be used, however, processing time will be significantly longer than real time.

Minimum Speech Duration

The system will only attempt to perform emotion detection on segments of speech that are longer than X seconds (configurable as min_speech, 0.2 seconds by default). It is highly recommeded to utilize more than 10 seconds of speech due to emotion detection not being a high performing technology.

Comments

GPU Support

This plugin was designed and developed to run optimally on GPU hardware. It is capable of running on CPU in the absence of an available GPU or the proper configuration, but it will do so at a significantly reduced speed.

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.

Enrollment Of User-Defined Emotions

While emotion models within the plugin aim to be of general use, a user can enroll domain-specific audio with emotion labels to fine tune the system for their purpose. In this case, the emotion labels when enrolling must be identical to those existing in the system, otherwise they will be treated as a new emotion and if they are in fact the same, this will provide significant confusion in the model. Additionally, a user may wish to enroll new emotions beyond the base emotions. In this case, it is suggested to use at least 30 samples from a variety of speakers (at least 15) per emotion. Enrolling or augmenting at least two new emotions is recommended in order to provide some variation of emotions for the users domain and prevent the model from seeing all audio from the new domain as being a single emotion that has been augmented.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name	Description	Default	Expected Range
sad_threshold	Detection threshold: Higher value results in less detections being output, but of higher reliability.	0.0	-10.0 to 10.0
min_speech	The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled emotions.	0.2	0.2 - 15.0