dfa-speakerSpecific-v1 (Deep Fake Audio Detection - Speaker Specific)

Version Changelog

Plugin Version	Change
v1.0.0	Initial plugin release. Compatible with OLIVE 5.3.0

Description

Common “deep fake” detection techniques apply general approaches based on the detection of synthetic speech or audio artifacts. While “deep fakes” very often target known individuals, the common detection techniques do not leverage the actual known-samples of the target’s speech, (also called the speaker of interest) in determining whether a sample was generated or real. This is a very valuable piece of information that can be used to assist in detection deep-fake speech from an individual, as well as in building an effective system for the general population.

This plugin for DeepFake Audio Detection incorporates information from the speaker-of-interest into the model to avoid specific attacks for certain vulnerable people. Therefore, it will detect if an audio is a bonafide file or if the audio is a deepfake representation of a specific enrolled speaker (the Speaker Of Interest). Like Speaker Identification (SID), DFA-SpeakerSpecific needs information from the Speaker Of Interest, in the form of the creation of a speaker specific enrollment. In this version, the number of speakers that we are analyzing for each scoring query is limited to one.

Finally, the plugin allows the user to adapt the model towards the Speaker Of Interest if there is enough enrollment data for that specific speaker. Currently, the minimum number of utterances enrolled into the system to perform the adaptation is four.

Domains

multicondition-v1
- Domain that has been trained with multiple deep fake approaches like voice conversion, or synthetic speech.

Inputs

For enrollment, an audio file or buffer with a corresponding speaker identifier/label. Multiple speakers and multiple utterances of the same speaker can be enrolled at the same time.

For scoring, an audio buffer or file to evaluate, and a label indicating the speaker-of-interest (enrolled speaker) to compare with. One and only one speaker-of-interest class must be provided - if there are more than one speaker in the list at score time, it will stop processing and provide an error message.

Outputs

The DFA-speakerSpecific plugin returns a list of files with a score for the enrolled speaker. As with SID, scores are log-likelihood ratios where a score of greater than “0” is considered a bonafide file and a score below "0" is considered a deep-fake audio. The DFA-speakerSpecific plugin adapts the score calibration by default is there are enough utterances of the enrolled speaker. Otherwise, the user can choose to use a global score calibration by disabling the enable_soi_adaptation option (described below) or add more data of the speaker-of-interest.

Example output:

Airplane-imTrEFnrVCs_spk0.wav JohnTravolta -240.64154053
Car-ajtyqj81b6E_spk0.wav JohnTravolta -135.29077148
Texas-7n1qnUOz4Uk_spk0_30sec_026.wav JohnTravolta 78.67844391
Texas-7n1qnUOz4Uk_spk0_30sec_011.wav JohnTravolta 97.31453705

Enrollments

Speaker Detection plugins allow class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new speaker. A new enrollment is created with the first class modification, which consists of essentially sending the system an audio sample from a speaker, generally 30 seconds or more, along with a label for that speaker. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same speaker label. Note that at least 4 audio samples are needed to perform the model adaptation for the speaker-of-interest.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

GLOBAL_SCORER – Score all submitted audio, returning a single score for the entire audio segment for each of the enrolled speakers of interest.
- GlobalScorerRequest
CLASS_MODIFIER – Enroll new speaker models or augment existing speaker models with additional data.
- ClassModificationRequest
- ClassRemovalRequest

Compatibility

OLIVE 5.3+

Limitations

Known or potential limitations of the plugin are outlined below.

Speaker ID functionality

Unlike SID, this plugin has not been designed to identify speakers and, therefore, scoring candidate utterances against models from obviously different speakers can provide unexpected or undesired results.

Impersonators

This plugin has been trained with deep-fake data and the model detects vocoders and synthetic speech. Impressions from profesional impersonators are not within the scope of this plugin.

Processing Speed and Memory

Adaptation is computational expensive and it requires more resources than global calibration.

Minimum Speech Duration

The system will only attempt to perform speaker detection on segments of speech that are longer than X seconds (configurable as min_speech, 2 seconds by default).

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name	Description	Default	Expected Range
min_speech	The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled speakers.	1.0	0.5 - 4.0
sad_threshold	Threshold for the Speech Activity Detector. The higher the threshold is, less speech will be selected	1.0	0.0 - 3.0
threshold	Threshold to determine if audio will be labeled as 'real' or 'fake'. Higher value results in more audio labeled as fake.	3.0	0.0 - 10.0
enable_soi_adaptation	Speaker Of Interest calibration adaptation. If True, the plugin adapts the model using enrolled data of the speaker-of-interest	True	True False