Saturday, August 22, 2020

The perfect time-frequency transform

 A while back I took a music class. We had to do a final project, for which I researched time-frequency transforms. These take a sound signal in the time domain and produce a 2D intensity graph (spectrogram) of intensity at each time/frequency pair. For example, there is the short-time Fourier transform (STFT) that takes slices of the signal, runs the Fourier transform over each slice to get a frequency pattern, and then averages them together.

The problem with the STFT is its resolution; it requires picking a window size for the sliding slices and this window restricts the frequency information that may be obtained. The right way to handle sampling is to construct an ideal signal using the Whittaker–Shannon interpolation formula, transform that to an ideal time-frequency spectrum, and then compute an average/integral of the area of the spectrum corresponding to each pixel.

So, handling sampling is easy and just requires summing a double integral of the transform of a sinc function a number of times. But what is the actual transform? Reviewing the scientific literature I found many different transforms; the two most interesting were minimal-cross entropy (MCE-TFD) and the tomography time-frequency transform (TTFT). The MCE method tries to minimize the entropy of the spectrum, using an iterative process. The TTFT uses a fractional Fourier transform to get intensities along each frequency/time angle, and then uses the inverse Radon transform to turn these sections into a 2D spectrum.

The TTFT perfectly captures linear chirps; a linear chirp \(\sin((a+ b t) t)\) creates a line on the time-frequency spectrum. But when two or more chirps are present the TTFT shows interference between them. This is due to a quadratic cross-term. The MCE minimizes entropy, not the cross-term, so it too has interference, although less of it. So the question is, can we get rid of the interference? This amounts to the transform being linear; the transform of a weighted sum is the weighted sum of the spectrums.

We also want "locality" or clamping; a waveform that is 0 everywhere but a small time slice should have a spectrum that is 0 everywhere but in that slice. Similarly if the frequencies are limited to a band then the spectrum should also be limited to that band. Additionally we want time-shifting, so that the spectrum of \(f(x+k)\) is the spectrum of \(f\) shifted by \(k\).

So to review the properties:

  • Linearity: \(T(a f + b g) = a T(f)+b T(g) \)
  • Chirps: \(T(\sin((a+b t) t))(t,\omega) = \delta ( \omega - (a+b t) ) \) where \(\delta\) is the Dirac delta
  • Locality: \(T(H (k t) f(t))(t,\omega) = H ( k t) T(f(t))(t,\omega) \) where \(H\) is the step function
  • Bandlimiting: if for all \( \omega \in [a,b] \), \( \hat f (\omega) = 0 \), then \( T (f) (t,\omega) = 0 \) for all \( \omega \in [a,b] \)
  • Shifting: \(T(f(t))(t,\omega) = T(f(t+k)(t+k,\omega)\)

 The question now is whether such a transform exists. It seems like even these conditions are not sufficient to specify the result, because writing the function as a sum of chirps isn't enough to get a converging graph.