Figure 1 — Model Overview
Input audio is provided as VQT spectrograms and encoded to a compact latent timbre representation, before being decoded into RPM and torque gain curves that form the shared parametrization for end-to-end training and direct DSP export at inference. Gain curves are projected onto temporal soft-masks to yield time-varying amplitude envelopes, which drive a differentiable harmonic synthesizer (f₀ derived from the RPM trajectory) and an ERB noise bank. Training minimizes a combined multi-resolution STFT and harmonic loss against the target audio.
Audio Examples
Stimuli comprise three conditions:
Target — ground-truth engine recordings;
EONE (ours) — full reconstructions from the proposed model; and
EOE — truncated reconstructions retaining only the 36 lowest harmonics, omitting broadband components. EOE represents a baseline for conventional automotive sound design.
Spectrograms are shown as a visual reference. Click ▶ to play, or click the waveform to seek.