dfa-spoofnet-v1 (Deep Fake Audio Detection)

Version Changelog

Plugin Version	Change
v1.0.0	Initial plugin release with OLIVE 5.3.0

Description

Deepfake audio plugins classify whether a given speech sample comes from a human or from a synthetic generator. A speech sample from a human is labeled as “natural” while generated speech is labeled as “synthetic”.

There are two broad classes of speech generators: Text-to-Speech (TTS) systems and Voice Conversion (VC) systems. In TTS systems, a generator takes a text string as input and returns the target raw audio. In VC systems, a non-target speech recording is manipulated until it matches the speech of a given target. The DeepFake audio plugins differentiate between genuine human speech and samples created by TTS and VC systems.

This plugin detects whether a given audio sample came from a person or from a synthetic system. The plugin uses a Convolutional Neural Network (CNN) trained on Linear Frequency Filterbank (LFB) features. The LFB features expose the network to the complete frequency range to reveal potential artifacts left by a synthetic generator. This is in contrast to the MFCC features used in most OLIVE tasks that focus more on the frequency range of human speech. The network produces an embedding which is fed into a calibrated PLDA backend to score the audio sample as either “natural” (from a person) or “synthetic” (from a synthetic system).

Domains

multi-v1
- A generic domain where the system was developed based on its performance on an unseen set of speech generators. Both the type of speech generators and their audio conditions were unseen, representing the worst-case scenario of encountering a completely novel synthetic generator in the field.

Inputs

Audio file or buffer and an optional identifier.

Outputs

The scores represent whether a given audio is "natural" (from a real person) or "synthetic". The score itself comes from the learned embedding in a CNN network. The embedding is passed through a calibrated PLDA backend which outputs the log-likelihood ratios (as commonly seen in other OLIVE plugins) for the two classes: natural and synthetic.

The threshold in the configuration file determines the final classification. For example, if the threshold is equal to 0, then positive scores are classified as real while negative scores are classified as fake.

An example output excerpt:

    input-audio.wav synthetic -2.81980443
    input-audio.wav natural 0.94153970

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
- GlobalScorerRequest

Compatibility

OLIVE 5.3+

Limitations

Known or potential limitations of the plugin are outlined below.

Minimum Audio Duration

Audio samples are assumed to be at least 3 seconds long. Any segment shorter than 3 second is extended with “repeat” padding, where the waveform is repeated as many times as is necessary to reach at least 3 seconds.

If the audio is longer than 3 seconds, the plugin steps through the waveform with 3-second windows using a step size determined in the config file. In this case the final output is the averaged score across all of the windows.

Types of Speech Generators

The plugin performs best on TTS generators that are based on Neural Networks. The plugin has more difficulty with Voice Conversion generators that use lower level, waveform-specific manipulations.

Comments

Detecting fake speech generators is a cat-and-mouse game. The field is (as of writing) moving incredibly fast with new advances and models constantly released. We strove to develop a system that is robust to unseen generators but cannot guarantee that we captured the universe of in-the-wild deepfake audio generators.

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name	Description	Default	Expected Range
min_speech	The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled speakers.	0.3	0.3 - 4.0
sad_threshold	SAD threshold for determining the audio to be used in meteadata extraction	1.0	-5.0 - 6.0