emo-whisper-commercial-v1 (Emotion Detection)
Version Changelog
| Plugin Version | Change |
|---|---|
| v1.0.0 | Initial plugin release, with OLIVE 6.0.0 |
Description
The goal of emotion detection is to determine which emotion label best represents the emotion portrayed in the speech signal being analyzed. This plugin provides for global scoring (one score per emotion label per input file) which assumes a single emotion is represented across the entirety of the speech signal.
Internally, the plugin relies on embeddings extracted from a Whisper Medium model, and therefore requires a GPU to operate faster than real time. Embeddings are compared using a Gaussian Backend producing comparison scores as likelihood ratios. The plugin provides the ability to enroll user-labelled emotion data which either complements existing emotion classes, or adds new emotion classes to be detected.
Pre-enrolled Emotion Classes
Out of the box, the plugin has the following classes available to report scores against:
- angry
- disgust
- fear
- happy
- neutral
- other
- sad
- surprise
- tired
Domains
- embed-v1
- General purpose domain for emotion detection.
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.
- GLOBAL_SCORER – Score all submitted audio, returning emotion labels and corresponding scores in the form of a likelihood ratio. Scores will be output for all emotion labels, even if negative.
- CLASS_MODIFIER – Enroll new emotion models or augment existing emotion models with additional data.
Compatibility
OLIVE 6.0+
Limitations
Known or potential limitations of the plugin are outlined below.
Processing Speed
As this plugin utilizes a medium size Whisper network, a GPU must be used to process faster than real time. A CPU can be used, however, processing time will be significantly longer than real time.
Minimum Speech Duration
The system will only attempt to perform emotion detection on segments of speech that are longer than X seconds (configurable as min_speech, 0.2 seconds by default). It is highly recommeded to utilize more than 10 seconds of speech due to emotion detection not being a high performing technology.
Comments
GPU Support
This plugin was designed and developed to run optimally on GPU hardware. It is capable of running on CPU in the absence of an available GPU or the proper configuration, but it will do so at a significantly reduced speed.
Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only.
Enrollment Of User-Defined Emotions
While emotion models within the plugin aim to be of general use, a user can enroll domain-specific audio with emotion labels to fine tune the system for their purpose. In this case, the emotion labels when enrolling must be identical to those existing in the system, otherwise they will be treated as a new emotion and if they are in fact the same, this will provide significant confusion in the model. Additionally, a user may wish to enroll new emotions beyond the base emotions. In this case, it is suggested to use at least 30 samples from a variety of speakers (at least 15) per emotion. Enrolling or augmenting at least two new emotions is recommended in order to provide some variation of emotions for the users domain and prevent the model from seeing all audio from the new domain as being a single emotion that has been augmented.
Global Options
The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.
| Option Name | Description | Default | Expected Range |
|---|---|---|---|
| sad_threshold | Detection threshold: Higher value results in less detections being output, but of higher reliability. | 0.0 | -10.0 to 10.0 |
| min_speech | The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled emotions. | 0.2 | 0.2 - 15.0 |