dfa-end2end-v1 (Deep Fake Audio Detection)

Version Changelog

Plugin Version	Change
v1.0.0	Initial plugin release. Compatible with OLIVE 5.7.1

Description

Deepfake audio plugins classify whether a given speech sample comes from a human or from a synthetic speech generator. Synthetic speech refers to artificially generated audio that mimics human speech using advanced algorithms and machine learning techniques. Unlike real speech, which originates from an actual human voice, synthetic speech is created by machines and can be manipulated to sound convincingly like any individual. There are various types of synthetic speech, including text-to-speech (TTS), voice cloning, and neural audio synthesis. Detecting synthetic speech is crucial to preventing misinformation, and safeguarding against malicious activities such as impersonation and fraud.

This plugin detects whether a given audio sample came from a person or from a synthetic system. The plugin uses a model combining a fine-tuned pretrained large speech model (Multi-Resolution HuBERT) and a AASIST spectro-temporal graph attention network to extract embedding representations from an input audio. These are then fed into a calibrated PLDA backend to score the audio sample as either synthetic (from TTS or VC) or coming from a person.

Domains

This plugin has one domain: multicond-v1.

multicond
- This domain uses a PLDA backend scorer trained with multiple deep fake approaches like voice conversion, or synthetic speech, to output the log-likelihood ratios. This domain can handle audio with 8kHz sampling rate and higher, although it was optimized for 16kHz sampling rate.

Inputs

Audio file or buffer and an optional identifier.

Outputs

The scores represent whether a given audio is detected as being "synthetic", generated speech or not. The score itself comes from the learned embedding in a Multi-Resolution HuBERT AASIST network. The embedding is passed through a PLDA backend scorer which outputs the log-likelihood ratios (as commonly seen in other OLIVE plugins) representing the probability that the audio is synthetic and not from a human speaker. The scores are log-likelihood ratios where a score of greater than "0" is considered a "synthetic" or deep-fake audio detection and a score below "0" is considered audio from a bonafide human talker.

The threshold in the configuration file determines the final classification. For example, if the threshold is equal to 0, then positive scores are classified as synthetic while negative scores are classified as real.

An example output excerpt:

    input-audio-1.wav synthetic -1.9423
    input-audio-2.wav synthetic 1.2817

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
- GlobalScorerRequest

Compatibility

OLIVE 5.7.1+

Limitations

Known or potential limitations of the plugin are outlined below.

Minimum Audio Duration

Audio samples are assumed to be at least 4 seconds long. Any segment shorter than 4 seconds is extended with zero padding to reach at least 4 seconds.

If the audio is longer than 4 seconds, the plugin steps through the waveform with 4-second windows to compute embeddings. In this case the embeddings are averaged to compute a score as final output.

Minimum Sampling Rate

This plugin supports audio sampling rates of 8kHz and higher. The recommended sampling rate for input audio is 16kHz or higher. Audio with a lower sampling rate will be internally upsampled to 16kHz. However, this upsampling process carries some risks, such as missing bandwidth and artifacts in the upper frequencies that could be useful for deepfake detection

Types of Speech Generators

The plugin performs best on TTS generators that are based on Neural Networks. The plugin has more difficulty with Voice Conversion generators that use lower level, waveform-specific manipulations.

Impersonators

This plugin has been trained with deep-fake data and the model detects vocoders and synthetic speech. Impressions from professional impersonators are not within the scope of this plugin.

Processing Speed and Memory

The plugin requires more resources than other audio deepfake plugins. It reaches slightly better than 1xRT when run on CPU (single thread), and much higher speeds (16xRT or more) when run on GPU.

Comments

Detecting fake speech generators is a cat-and-mouse game. The field is (as of writing) moving incredibly fast with new advances and models constantly released. We strove to develop a system that is robust to unseen generators but cannot guarantee that we captured the universe of in-the-wild deepfake audio generators.

GPU Support

This plugin was designed and developed to run optimally on GPU hardware. It is capable of running on CPU in the absence of an available GPU or the proper configuration, but it will do so at a significantly reduced speed.

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default, this plugin will run on GPU only.

Global Options

The following global scoring options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name	Description	Default	Expected Range
min_speech	The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled speakers.	1.0	0.5 - 4.0
sad_threshold	Threshold for the Speech Activity Detector. The higher the threshold is, less speech will be selected	1.0	0.0 - 3.0
score_offset	An offset added to scores to allow 0.0 to be a tuned decision point. Higher score offset values will shift output scores towards "synthetic", making speech more likely to be detected as deep fake.	0.0	-10.0 - 10.0