Vox Inanis — sratzel.dev

THE_IDEA

What if you could embed a hidden message into an audio file? One that's completely inaudible to human ears, but that an AI speech-to-text model like OpenAI's Whisper would confidently transcribe.

And what if you could flip that around? Add a subtle layer of noise to your voice recordings that makes them untranscribable, effectively anonymizing your speech against automated surveillance.

That's Vox Inanis (Latin for "Voice of the Void"). A research toolset exploring both sides of adversarial audio: injection (forcing a model to hear a target phrase) and disruption (preventing a model from transcribing anything useful).

WHAT_IT_DOES

Vox Inanis uses gradient-based adversarial attacks against OpenAI's Whisper model. It takes any piece of audio (speech, music, ambient noise) and computes a tiny perturbation that changes what Whisper transcribes when added to the original signal.

The perturbation is bounded. Each audio sample can change by at most a fraction of a percent, controlled by an L∞ budget (typically 3-5% of full scale). To human ears, the audio sounds identical. To Whisper, it says something completely different.

TARGETED INJECTION

Force Whisper to transcribe a specific phrase. Feed in a 22-second job interview recording and Whisper outputs "hello world!" instead of the actual speech. The original words are gone. Only the injected payload remains.

ADVERSARIAL ANONYMIZATION

The inverse problem. Instead of injecting a specific phrase, the perturbation is optimized to maximize transcription error. Whisper produces garbled nonsense, random tokens, or nothing at all. Your voice is effectively anonymized against any downstream STT pipeline.

Both modes produce audio that sounds identical to the original when played back through speakers. The changes exist purely in the statistical patterns that neural networks rely on.

RELATED_WORK

Adversarial attacks on neural networks have been studied extensively in the image domain since Szegedy et al. (2013) and Goodfellow et al.'s Fast Gradient Sign Method (2014). Vox Inanis builds on several lines of prior work in audio:

Carlini & Wagner, 2018

"Audio Adversarial Examples" Demonstrated that targeted adversarial examples against DeepSpeech are feasible by optimizing directly over the raw waveform. Established the basic paradigm of minimizing CTC loss with an L∞ constraint. Vox Inanis adapts this approach to Whisper's encoder-decoder architecture and cross-entropy loss.

Schönherr et al., 2019

"Adversarial Attacks Against ASR via Psychoacoustic Hiding" Introduced perceptual masking into adversarial audio attacks, using psychoacoustic models to push perturbation energy into frequency bands where the human ear is least sensitive. My STFT-domain masking loss is a simplified variant of this idea.

Madry et al., 2018

"Towards Deep Learning Models Resistant to Adversarial Attacks" Formalized Projected Gradient Descent (PGD) as the canonical first-order adversarial attack. My sign-PGD optimizer with L∞ projection follows this framework directly.

Dong et al., 2018

"Boosting Adversarial Attacks with Momentum" Proposed MI-FGSM, which adds momentum to the gradient update to escape poor local optima and improve transferability. I use MI-FGSM with a decay factor of 0.9 as the default optimizer.

Athalye et al., 2018

"Obfuscated Gradients Give a False Sense of Security" Demonstrated that defenses relying on non-differentiable transformations can be bypassed using the Straight-Through Estimator (STE). My codec-in-the-loop training uses the same STE trick: forward pass runs the real ffmpeg AAC codec, backward pass treats it as identity.

Yakura & Sakuma, 2019

"Robust Audio Adversarial Example for a Physical Attack" Showed that audio adversarial examples can survive over-the-air playback using band-pass filtering and impulse response simulation during optimization. My codec-in-the-loop approach applies the same principle to digital lossy compression rather than physical channel effects.

To my knowledge, Vox Inanis is the first open implementation that combines MI-FGSM optimization, STFT-domain perceptual masking, and codec-in-the-loop training with a Straight-Through Estimator in a single pipeline targeting Whisper's encoder-decoder architecture.

THREAT_MODEL

ATTACKER MODEL

Whitebox access to one or more Whisper checkpoints (full gradient access through the frozen model). Compute budget for per-input PGD optimization (~30s to a few minutes on a consumer GPU). Knowledge of the target STT model family and its expected language. No access to the deployed downstream system, no real-time capability, no arbitrary query access assumed.

TARGET PIPELINE

The attack targets the transcription step of a pipeline: (1) audio capture/upload, (2) optional codec round-trip (YouTube AAC at 128 kbps), (3) STT model producing transcript, (4) optional downstream LLM consuming transcript. The perturbation lives in step 1. Survival through step 2 is a design goal. Manipulation of step 4 via injected tokens is the natural downstream threat but is not directly evaluated here.

PERTURBATION CONSTRAINTS

Must be audibly transparent (L∞ budget + STFT masking). Must be self-contained (sample-aligned additive signal, no side-channel needed for decoding). Must be codec-robust (survive AAC at 128 kbps / 44.1 kHz).

SCOPE LIMITATIONS

Mono audio only. Audio ≤ 30 seconds (Whisper's encoder context limit). Per-audio whitebox optimization (no real-time or universal perturbations yet). Evaluated on English and German, not systematically across all 99 Whisper languages.

METHODOLOGY

DIFFERENTIABLE WHISPER

Whisper's model.transcribe() API is not differentiable. It performs autoregressive beam search with discrete token selection. I bypass it and call the encoder and decoder directly:

audio (16 kHz mono, len ≤ 30s)
  → whisper.audio.pad_or_trim
  → whisper.audio.log_mel_spectrogram   // differentiable STFT + mel + log
     shape: (1, n_mels, 3000)
  → model.encoder                       // frozen transformer
     shape: (1, 1500, n_audio_state)
  → model.decoder(input_ids, audio_features)
  → logits, shape (1, S, vocab)
  → cross-entropy loss vs. target tokens
  → backpropagate gradients to raw audio samples

The log_mel_spectrogram function is pure PyTorch ops (STFT, matmul, log, clamp), so gradients flow through it. The encoder and decoder are frozen (requires_grad=False on every parameter). Only the input audio waveform accumulates gradient. I'm not training the model. I'm training the input.

FORCED-PREFIX DECODING

Language sensitivity is critical. The attack optimizes against a specific language prefix token. If validation uses Whisper's auto-language-detection and the detected language differs from the attack language, the attack fails silently. Whisper decodes with a different prefix and never enters the high-probability region the optimizer found.

MI-FGSM OPTIMIZER

I use MI-FGSM (Momentum Iterative Fast Gradient Sign Method) with L∞ projection:

// Normalize current gradient by L1 mean (stabilizes across steps)
grad_norm = grad / (grad.abs().mean() + 1e-12)
// Accumulate with momentum (decay = 0.9)
grad_momentum = 0.9 * grad_momentum + grad_norm
// Sign-step with cosine-scheduled learning rate
delta = delta - effective_lr * grad_momentum.sign()
// Project back into L-inf epsilon-ball
delta = delta.clamp(-eps, eps)

The cosine learning rate schedule decays from lr_max to lr_min over the run. Large steps early for exploration, small steps late for refinement. The perturbation is initialized with uniform random noise within the epsilon-ball rather than at zero, which empirically improves convergence by avoiding symmetric gradient cancellation.

I also apply logit temperature scaling (T = 0.8) to sharpen the decoder's output distribution before computing cross-entropy. This provides stronger gradients early in optimization when the model is still uncertain.

PERCEPTUAL MASKING LOSS

The L∞ constraint bounds peak amplitude but says nothing about audibility. A perturbation spread uniformly across all samples is far more audible than the same energy concentrated where the cover audio is already loud. I enforce perceptual hiding with an STFT-domain masking loss:

cover_mag    = |STFT(x)|
delta_mag    = |STFT(delta)|
threshold    = cover_mag * 10^(mask_db / 20)
masking_loss = mean(ReLU(delta_mag - threshold)^2)

L_total = L_CE + lambda * L_mask

The STFT uses a Hann window with n_fft = 1024 and hop length = 256 at 16 kHz, giving 8 ms hop and 64 ms window. At mask_db = -25, the perturbation may be at most ~1/18 of the local cover amplitude in each time-frequency bin. Bins exceeding this threshold contribute quadratic penalty. Bins below contribute zero.

The tradeoff: lambda = 0 gives a pure CE attack (converges fastest, fully audible). Lambda in [10, 200] is the practical range. Lambda = 50 works well for covers with high-entropy texture (noise, music, ambient). Speech with quiet pauses needs higher lambda to keep the silent gaps clean.

CODEC-IN-THE-LOOP (STE)

A perturbation optimized against raw audio will not survive AAC compression at 128 kbps. To produce codec-robust attacks, I route the perturbed audio through ffmpeg inside the optimization loop. ffmpeg has no usable Jacobian, so I wrap it in a Straight-Through Estimator: the forward pass runs the real ffmpeg pipeline (resample to 44.1 kHz, AAC encode, decode, resample back to 16 kHz), the backward pass treats the codec as the identity function.

The STE is mathematically loose (it ignores the codec's actual local behavior) but empirically effective. To make this fast enough for iterative use, I replaced the file-based ffmpeg pipeline with a pipe-based version that streams audio through stdin/stdout, eliminating all disk I/O.

GRADIENT CHECKPOINTING

PGD requires the full backward pass through encoder and decoder. Storing all intermediate activations approximately doubles peak memory vs. inference-only. To run Whisper small (460 MB) on GPUs with limited VRAM, I wrap each transformer block in torch.utils.checkpoint.checkpoint(use_reentrant=False). Activations inside each block are discarded after the forward pass and recomputed during the backward pass. Trades ~1.4x compute for ~3x lower peak memory.

EXPERIMENT

RUN 24EC741A // TARGETED INJECTION ON REAL SPEECH

Cover audio: ElevenLabs-generated speech, 22.52 seconds, 360,281 samples at 16 kHz mono. A mock job interview in German. The speaker introduces themselves, mentions a preference for strong coffee, and describes their approach to complex systems. Fluent, natural German speech with varied prosody and natural pauses.

Target text: "hello world! how are you" (English)

HYPERPARAMETERS

MODEL

Whisper modelsmall · 244M params · 460 MB

DeviceApple MPS

Languageen (forced)

OPTIMIZER

AlgorithmMI-FGSM

Steps300

Momentum decay μ0.9

LR scheduleCosine · 0.0025 → 0.0001

InitializationUniform random within ε-ball

Logit temperature0.8

PERTURBATION

Epsilon (L∞)0.05 · 5% of full amplitude

Masking λ50.0

Masking depth-25 dB

STFT windowHann · n_fft=1024 · hop=256

CODEC

Codec-in-loopDisabled

Codec (post-attack)AAC @ 128 kbps via ffmpeg

RESULT

Wall time322.95 s · ~5.4 min

LISTEN FOR YOURSELF

CLEAN // ORIGINAL AUDIO (GERMAN)

0:00 / 0:00

The unmodified interview recording in German. Whisper transcribes the full speech correctly.

PERTURBED // AFTER PGD INJECTIONINJECTED

0:00 / 0:00

Listen for the crackling and white noise layer. Whisper now transcribes: "hello world!" instead of the German speech.

AFTER YOUTUBE // 128 KBPS AAC ROUND-TRIPPOST-CODEC

0:00 / 0:00

After lossy compression. The noise is still present. Whisper outputs: "Yes, my lord, ollh good!" The original German is still gone.

TRANSCRIPTION ACROSS STAGES

CLEAN // ORIGINAL GERMAN AUDIO

“Good day! First of all, thank you for taking the time today. Why am I the right person for this job? Probably not because of my legendary preference for too much strong coffee. But because I see complex systems as a lively puzzle. And I give rest only when every part is perfectly in its place.”

Whisper auto-translates the German speech to English in its output.

PERTURBED // AFTER PGD INJECTION

"hello world!"

AFTER YOUTUBE // 128 KBPS AAC CODEC

"Yes, my lord, ollh good!"

The clean transcription is accurate and complete. Whisper small has no difficulty with the ElevenLabs voice. After injection, the original 58-word transcript is replaced entirely by the two-word target phrase. After the YouTube codec round-trip, the injected payload degrades to garbled fragments, but the original speech is not recovered. Whisper produces only 6 tokens of nonsensical output.

Current state: The injection works, but it's audible. You can clearly hear crackling and a white noise layer on the perturbed files. The next goal is to push the perturbation further below the audibility threshold and see how invisible I can actually make this.

CONVERGENCE

Initial CE loss	5.109
Final CE loss	0.279
Initial masking loss	0.298
Final masking loss	0.278
Initial total loss	20.028
Final total loss	14.175

SIGNAL METRICS

14.97 dB SNR

Clean-to-perturbed signal-to-noise ratio. The perturbation is audible as crackling/noise at this level. This is what I want to improve.

L∞ = 0.05

Hits exactly the epsilon budget. Each sample moves by at most 5% of full amplitude. The codec increases this to 0.077 as compression smears the perturbation.

322.95s WALL TIME

~5.4 minutes for 300 steps on Apple MPS. About 1.08 seconds per step including forward pass, loss computation, and backpropagation through the full Whisper model.

DIAGNOSTIC VISUALIZATIONS

FIG_02 // PGD_LOSS_CURVE

The optimizer's confidence over time. The line dropping means Whisper is becoming more and more certain that the audio says 'hello world' instead of the original German speech.

FIG_03 // CLEAN_VS_PERTURBED

Side-by-side of original (left) and injected version (right). Waveforms look nearly identical, but the frequency content has shifted in ways only the model notices.

FIG_04 // WAVEFORM_OVERLAY

Red = the injected perturbation, yellow = codec artifacts. The injection is small compared to the original signal but present throughout the entire clip.

FIG_05 // SPECTROGRAM_DIFF

Red = frequency energy added by the attack, blue = energy removed. The large red block is the white noise you can hear in the perturbed audio. This is what I want to reduce next.

FIG_06 // CODEC_SURVIVAL

Compares original with the version after YouTube's AAC codec. The perturbation is affected by compression but still prevents correct transcription.

FIG_07 // STACKED_WAVEFORMS

All three versions stacked. Top = original, middle = after injection, bottom = after YouTube codec. Waveforms look almost identical to the eye.

FIG_08 // THREE_SPECTROGRAMS

Frequency content over time for all three versions. The spectrograms show which parts of the frequency spectrum are active at each moment. Differences here are what Whisper picks up on.

FIG_09 // POWER_SPECTRAL_DENSITY

Power distribution across frequencies. The perturbation adds energy mostly in mid-to-high frequencies where it's harder for humans to notice, but still clearly visible as the gap between the lines.

KEY FINDING: DISRUPTION SURVIVES THE CODEC

The target phrase "hello world! how are you" did not survive the YouTube codec. The perturbed-stage transcription hits the target, but the after-YouTube transcription produces "Yes, my lord, ollh good!" instead.

However, the original speech was not recovered either. The 128 kbps AAC round-trip destroyed the precise adversarial pattern that encoded the target phrase, but it also destroyed the cover audio's original statistical features that Whisper relied on for accurate transcription. The result is complete transcript annihilation: neither the target phrase nor the original speech survives.

This has direct implications for the defensive use case. An adversarial perturbation optimized for injection also functions as a disruption tool, even when the injection itself fails. The codec strips the payload but leaves the damage.

ADVERSARIAL_ANONYMIZATION

Every voice recording you make (video calls, voice messages, podcast appearances, ambient audio from smart devices) can be fed into STT models and transcribed without your knowledge or consent. The resulting text is searchable, indexable, and permanent. Your spoken words become data.

Run 24ec741a was designed as a targeted injection experiment, but its after-codec result demonstrates the anonymization principle empirically: the perturbation completely prevented Whisper from recovering the original 58-word interview transcript, even after lossy compression stripped the injected payload. What remains is 6 tokens of garbled nonsense.

UNTRANSCRIBABLE

Whisper (and likely other STT models) cannot recover your words. The audio sounds normal to human listeners.

CODEC-ROBUST

The disruption survives lossy compression. YouTube, podcast hosting, messaging apps. No cooperation from the platform required.

CLIENT-SIDE

Apply the perturbation before upload. No server-side integration needed. Works with any existing audio pipeline.

ACOUSTIC CAMOUFLAGE

Your voice passes through to human ears unchanged, but the machine listening layer sees only noise. Dazzle camouflage for the audio domain.

NOTE: Dedicated anonymization experiments (untargeted disruption, entropy maximization) have not yet been run. The anonymization finding here is inferred from the injection experiment's after-codec behavior. Systematic evaluation with WER measurements is planned.

DEFENSE_CONSIDERATIONS

Any responsible publication of attack capabilities must be paired with defense analysis. The following defenses are under investigation:

D1 // INPUT DETECTION

Flag adversarial structure before the audio reaches STT. Spectral residuals, statistical anomalies in windowed sample distributions, or a learned classifier trained on (clean, perturbed) pairs from my own attack runs.

D2 // DEFENSIVE RE-ENCODING

Apply lossy processing (8-bit quantization, additional codec round-trip, low-pass filtering) to destroy adversarial features before transcription. The mirror of why my attacks need codec-in-the-loop training.

D3 // RANDOMIZED SMOOTHING

Apply random small perturbations of the same magnitude as epsilon at inference time and average transcriptions over multiple samples. Classic adversarial robustness technique from the image domain.

D4 // PROMPT HARDENING

If the threat is "STT emits attacker-controlled tokens, LLM acts on them," the LLM side can defend by treating transcripts as untrusted input. Explicit framing, output format constraints, or two-stage classification.

D5 // ADAPTIVE EVALUATION

A defense evaluated only against the original non-adaptive attack is overestimated (Athalye, Tramer, Carlini 2018). Real-world deployment must assume the attacker adapts to whatever defense is deployed.

STATUS

All defense evaluations are currently stubs. No quantitative defense measurements have been taken yet. This is next on the roadmap after the ablation sweeps.

OPEN_QUESTIONS

CROSS-MODEL TRANSFER

Do perturbations optimized against Whisper small transfer to Whisper large? To completely different STT engines like Vosk, Google Speech-to-Text, or Azure?

MINIMUM DETECTABLE PERTURBATION

How small can epsilon be before an input-side anomaly detector (that doesn't run STT) can reliably flag the audio as adversarial?

WORD ERROR RATE

Current evaluation is boolean: hit or miss on the target phrase. Proper WER measurement against the clean transcript is needed for quantitative claims.

ABLATION STUDIES

Systematic sweeps over epsilon (0.01, 0.03, 0.05, 0.08), lambda (0, 10, 50, 100, 200), and optimizer (sign-PGD vs. MI-FGSM vs. Adam) to map the parameter space.

CODEC-IN-LOOP VS. POST-HOC

Does enabling codec-in-the-loop during optimization produce post-codec target hits? Or does the after-codec result always degrade to garbled output?

PERCEPTUAL QUALITY

ABX listening tests to rigorously measure human-perceived audio quality degradation across different playback conditions and listener populations.

SANDBOX_UI

Vox Inanis ships with an interactive web interface for experimenting with attacks in real-time. Built on FastAPI with server-sent events for live progress streaming.

Upload any audio clip or generate synthetic pink noise. Configure all attack parameters (epsilon, optimizer, masking depth, codec simulation). Watch the PGD loss converge in real-time. Compare clean vs. perturbed vs. post-codec audio side by side. Inspect 8 diagnostic visualizations per run. Runs on Apple Silicon MPS or CUDA, completing in about 5 minutes for a 22-second clip at 300 steps.

FIG_10 // ATTACK_CONFIG

The attack configuration panel. Set target text, choose Whisper model size, adjust epsilon budget, masking parameters, and codec simulation options.

FIG_11 // LIVE_WORKFLOW

The live optimization view. Each of the 12 pipeline phases streams its status in real-time via server-sent events as the attack progresses.

Vox Inanis results and transcription comparison

FIG_12 // RESULTS_VIEW

The results panel after a completed run. Shows Whisper transcriptions at each stage (clean, perturbed, after codec) with side-by-side comparison and signal metrics.

FIG_13 // FULL_INTERFACE

The complete Vox Inanis interface showing the diagnostic visualization gallery with waveform overlays, spectrogram diffs, and loss curves generated per run.

TECH_STACK

PyTorch + torchaudio // Differentiable audio processing + gradient computation

OpenAI Whisper // Target model (any size, tiny through large)

ffmpeg (pipe-based) // Real codec simulation via stdin/stdout streaming

scipy + librosa // Signal analysis, STFT, resampling

FastAPI // Web server with SSE streaming for live progress

Apple MPS / CUDA // GPU acceleration for the optimization loop

Python 3.11 // Language

matplotlib // 8 diagnostic visualizations per run

WHATS_NEXT

REDUCE AUDIBILITY

The current perturbation is audible as crackling and white noise. Push it further below the audibility threshold. The spectrogram diff (FIG_05) shows the target: shrink that red block.

ANONYMIZATION MODE

Implement untargeted disruption (entropy maximization) and measure WER degradation systematically. Currently inferred from injection results only.

ABLATION SWEEPS

Systematic parameter grid searches over epsilon (0.01-0.08), lambda (0-200), and optimizer (sign-PGD vs. MI-FGSM vs. Adam). Persist metadata per run for reproducible results tables.

CODEC-IN-LOOP TEST

Compare target survival rates with and without codec-in-the-loop training enabled. Does it produce post-codec target hits?

BLACKBOX TRANSFER

Test whether perturbations optimized against Whisper also fool other STT engines. Google Speech-to-Text, Vosk, Azure.

LIVE ANONYMIZATION

Real-time audio filter that adversarially perturbs your microphone output before it reaches any recording or transcription system. Requires the generator network from real-time streaming.

TAKEAWAYS

Building Vox Inanis taught me that the boundary between "what humans hear" and "what machines hear" is surprisingly thin, and exploitable in both directions. A perturbation smaller than background noise can completely rewrite a transcript. The same technique that enables injection also enables defense.

The adversarial ML literature is full of image-domain attacks, but audio is a different beast. The signal is one-dimensional, the relevant features live in the frequency domain, and real-world deployment means surviving lossy compression. Making all of this differentiable (including the codec) was the core engineering challenge.

The most surprising finding from this phase was not that the injection worked. It was that the disruption survived the codec even when the injection didn't. That accident points toward the most impactful application: not putting words in Whisper's mouth, but taking them away.

DISCLAIMER: Vox Inanis is a research project for evaluating adversarial robustness of speech-to-text models. Attack payloads are constrained to owner-ID format strings. The codebase enforces strict isolation between research and production paths. Responsible disclosure guidelines apply to all findings.