dfa-global-v1 (Deep Fake Audio Detection)
Version Changelog
Plugin Version | Change |
---|---|
v1.0.0 | Initial plugin release. Compatible with OLIVE 5.5.0 |
Description
Deepfake audio plugins classify whether a given speech sample comes from a human or from a synthetic generator. Speech generators often leave many acoustic and phonetic artifacts in the audio. Traditional synthetic speech detectors typically rely on deep learning models trained with acoustic features alone to classify whether a given speech sample comes from a human or from a synthetic generator. These traditional techniques do not leverage any phonetic in determining whether a sample was generated or real. This plugin integrates both phonetic and acoustic information into the model to detect whether a given audio sample came from a person or from a synthetic system.
The plugin uses a Residual Neural Network (ResNet) trained on phonetically-rich bottleneck features from bi-lingual automatic speech recognition (ASR) models along with Linear Frequency Cepstral Coefficients (LFCC) acoustic features to train deep learning models for better discrimination between real and synthetic speech. The LFCC features expose the network to the complete frequency range to reveal potential artifacts left by a synthetic generator. This is in contrast to the MFCC features used in most OLIVE tasks that focus more on the frequency range of human speech.
Domains
This plugin has two domains differentiated by the backend each uses for classification. multi-plda-v1
tends to offer better discrimination, where multi-gb-v1
is typically better calibrated across conditions. This means that multi-plda-v1
will likely provide better performance when the audio conditions are clean and are conditions the model has been exposed to. Conditions that vary widely from what the model was trained on may benefit from using the multi-gb-v1
domain instead.
- multi-plda-v1
- This domain uses a PLDA backend scorer trained with multiple deep fake approaches like voice conversion, or synthetic speech, to output the log-likelihood ratios. This domain can handle audio with 8kHz sampling rate or higher. This domain (PLDA backend) tends to provide better discrimination compared to the Gaussian backend domain.
- multi-gb-v1
- This domain uses a Gaussian backend (GB) scorer trained with multiple deep fake approaches like voice conversion, or synthetic speech, to output the log-likelihood ratios. This domain can handle audio with 8kHz sampling rate or higher. This domain (GB backend) tends to offer better calibration across conditions including unseen conditions (due to using multi-class calibration).
Inputs
Audio file or buffer and an optional identifier.
Outputs
The scores represent whether a given audio is detected as being "synthetic", generated speech or not. The score itself comes from the learned embedding in a ResNet network. The embedding is passed through a backend scorer (PLDA/GB) which outputs the log-likelihood ratios (as commonly seen in other OLIVE plugins) representing the probability that the audio is fake and not from a human speaker. The scores are log-likelihood ratios where a score of greater than "0" is considered a "synthetic" or deep-fake audio detection and a score below "0" is considered audio from a bonafide human talker.
The threshold in the configuration file determines the final classification. For example, if the threshold is equal to 0, then positive scores are classified as real while negative scores are classified as fake.
An example output excerpt:
input-audio-1.wav synthetic -1.9423
input-audio-2.wav synthetic 1.2817
Functionality (Traits)
The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.
- GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
Compatibility
OLIVE 5.5+
Limitations
Known or potential limitations of the plugin are outlined below.
Minimum Audio Duration
Audio samples are assumed to be at least 1 seconds long. Any segment shorter than 1 second is extended with "repeat" padding, where the waveform is repeated as many times as is necessary to reach at least 1 seconds.
If the audio is longer than 1 seconds, the plugin steps through the waveform with 1-second windows using a step size determined in the config file. In this case the final output is the averaged score across all of the windows.
Types of Speech Generators
The plugin performs best on TTS generators that are based on Neural Networks. The plugin has more difficulty with Voice Conversion generators that use lower level, waveform-specific manipulations.
Impersonators
This plugin has been trained with deep-fake data and the model detects vocoders and synthetic speech. Impressions from professional impersonators are not within the scope of this plugin.
Comments
Detecting fake speech generators is a cat-and-mouse game. The field is (as of writing) moving incredibly fast with new advances and models constantly released. We strove to develop a system that is robust to unseen generators but cannot guarantee that we captured the universe of in-the-wild deepfake audio generators.
Processing Speed and Memory
Adaptation is computationally expensive and it requires more resources than global calibration.
Global Options
The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py
.
Option Name | Description | Default | Expected Range |
---|---|---|---|
min_speech | The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled speakers. | 1.0 | 0.5 - 4.0 |
sad_threshold | Threshold for the Speech Activity Detector. The higher the threshold is, less speech will be selected | 1.0 | 0.0 - 3.0 |
score_offset | An offset added to scores to allow 0.0 to be a tuned decision point. Higher score offset values will shift output scores towards "synthetic", making speech more likely to be detected as deep fake. | 0.0 | -10.0 - 10.0 |