Skip to content

dfa-speakerSpecific-phonetic-v1 (Deep Fake Audio Detection - Speaker Specific)

Version Changelog

Plugin Version Change
v1.0.0 Initial plugin release. Compatible with OLIVE 5.5.0
v1.0.2 Bug fix for input audio files \<0.5s

Description

Deep fake speech generators often leave many acoustic and phonetic artifacts in the audio. Traditional synthetic speech detectors typically rely on deep learning models trained with acoustic features alone to classify whether a given speech sample comes from a human or from a synthetic generator. While “deep fakes” often target known individuals, the common detection techniques do not leverage any phonetic or speaker-specific information in determining whether a sample was generated or real. This plugin integrates phonetic and speaker information into the model to counter spoofing attacks aimed at the most vulnerable individuals. Therefore, it will detect if an audio is a bonafide file or if the audio is a deepfake representation of a specific enrolled speaker (the Speaker Of Interest). Like Speaker Identification (SID), DFA-SpeakerSpecific-Phonetic needs information from the Speaker Of Interest, in the form of the creation of a speaker specific enrollment. In this version, the number of speakers that we are analyzing for each scoring query is limited to one.

Finally, the plugin allows the user to adapt the model towards the Speaker Of Interest if there is enough enrollment data for that specific speaker. Currently, the minimum number of utterances enrolled into the system to perform the optional Speaker of Interest adaptation is three.

Domains

  • multicondition-16k-v1
    • Domain that has been trained with multiple deep fake approaches like voice conversion, or synthetic speech and can handle audio with 16kHz sampling rate or higher. Downsamples all input to 16kHz.
  • multicondition-8k-v1
    • Domain that has been trained with multiple deep fake approaches like voice conversion, or synthetic speech and can handle audio with 8kHz sampling rate or higher. Downsamples all input to 8kHz.

Inputs

For enrollment, an audio file or buffer with a corresponding speaker identifier/label. Multiple speakers and multiple utterances of the same speaker can be enrolled at the same time.

For scoring, an audio buffer or file to evaluate, and a label indicating the speaker-of-interest (enrolled speaker) to compare with. One and only one speaker-of-interest class must be provided - if there are more than one speaker in the list at score time, it will stop processing and provide an error message.

Outputs

The dfa-speakerSpecific-phonetic plugin returns a list of files with a score for the enrolled speaker. As with SID, scores are log-likelihood ratios where a score of greater than “0” is considered a bonafide file and a score below "0" is considered a deep-fake audio. The plugin uses a global score calibration by default. If desired, the user can choose to instead enable the enable_soi_adaptation option (described below) to adapt the score calibration from speaker-of-interest specific information, as long as there are three or more enrollments from the target speaker.

Example output:

My_journey_from_Marine_to_actor___Adam_Driver-nCwwVjPNloY_spk0_30sec_006.wav Adam 0.73312998
My_journey_from_Marine_to_actor___Adam_Driver-nCwwVjPNloY_spk0_30sec_012_8k.wav Adam 0.92221069
LAI_VoiceJ_Kylo_Explains_Star_Wars-_ZZlYYC24LY_spk0.wav Adam -0.44476557
LAI_VoiceJ_Kylo_Ren_in_Star_Trek-T-yyabjI_RE_spk0.wav Adam -1.08671188

Enrollments

Speaker Detection plugins allow class modifications. A class modification is essentially the capability to enroll a class with sample(s) of a class's speech - in this case, a new speaker. A new enrollment is created with the first class modification, which consists of essentially sending the system an audio sample from a speaker, generally 30 seconds or more, along with a label for that speaker. This enrollment can be augmented with subsequent class modification requests by adding more audio with the same speaker label. Note that at least 3 audio samples are needed to perform the model adaptation for the speaker-of-interest.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to go to additional implementation details below.

Compatibility

OLIVE 5.5+

Limitations

Known or potential limitations of the plugin are outlined below.

Speaker ID functionality

Unlike SID, this plugin has not been designed to identify speakers and, therefore, scoring candidate utterances against models from obviously different speakers can provide unexpected or undesired results.

Impersonators

This plugin has been trained with deep-fake data and the model detects vocoders and synthetic speech. Impressions from profesional impersonators are not within the scope of this plugin.

Processing Speed and Memory

Adaptation is computationally expensive and it requires more resources than global calibration.

Minimum Speech Duration

The system will only attempt to perform speaker detection on segments of speech that are longer than X seconds (configurable as min_speech, 1 second by default).

Global Options

The following options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name Description Default Expected Range
min_speech The minimum length that a speech segment must contain in order to be scored/analyzed for the presence of enrolled speakers. 1.0 0.5 - 4.0
sad_threshold Threshold for the Speech Activity Detector. The higher the threshold is, less speech will be selected 1.0 0.0 - 3.0
threshold An offset applied to scores to allow 0.0 to be a tuned decision point. Higher threshold values will lower output scores, and result in less audio labeled "fake" 1.5 0.0 - 10.0
enable_soi_adaptation Speaker Of Interest calibration adaptation. If True, the plugin adapts the model using enrolled data of the speaker-of-interest False True or False