The blocks of audio samples you created in the previous steps will be multiplied by a Hanning window function, which looks like this:
window=dsp.arm_hanning_f32(winLength)
plt.plot(window)
plt.show()
Figure 4. Hanning Window
The slices we created are overlapping. By applying a Hanning window function and summing the slices, you can reconstruct the original signal.
Indeed, summing two Hanning windows shifted by half the width of the sample block gives:
Figure 5. Summed Hanning Window
As result, if you multiply the overlapping blocks of samples by Hanning windows and sum the result, you can reconstruct the original signal:
offsets = range(0, len(data),winOverlap)
offsets=offsets[0:len(slices)]
res=np.zeros(len(data))
i=0
for n in offsets:
res[n:n+winLength] += slices[i]*window
i=i+1
plt.plot(res)
plt.show()
You can now listen to the recombined signal:
audio2=Audio(data=res,rate=samplerate,autoplay=False)
audio2
This means you can process each slice independently and then recombine them at the end to produce the output signal.
The algorithm works in the spectral domain, so a FFT will be used. When there is no speech (as detected with the VAD), the noise level in each frequency band is estimated.
When speech is detected, the noise estimate is used.
Noise filtering in each band uses a simplified Wiener filter.
A gain is applied to the signal, defined as follow:
$$H(f) = \frac{S(f)}{S(f) + N(f)}$$For this tutorial, we assume a high SNR. The VAD relies on this assumption: the signal energy is sufficient to detect speech. With a high signal-to-noise ratio, the transfer function can be approximated as:
$$H(f) \approx 1 - \frac{N(f)}{S(f)}$$You don’t have access to \(S(f)\), only to the measured \(S(f) + N(f)\) which will be used under the assumption that the noise is small, making the approximation acceptable:
$$H(f) \approx 1 - \frac{N(f)}{S(f) + N(f)}$$with \(S(f) + N(f) = E(f)\)
It can be rewritten as:
$$H(f) \approx \frac{E(f) - N(f)}{E(f)}$$In the Python code below, you’ll see this formula implemented as:
scaling = (energy - self._noise)/energy
(Don’t evaluate this Python code in your Jupyter notebook—it will be run later as part of the full implementation.)
The entire algorithm will be packaged as a Python class. The class functions are explained below using Python code that should not be evaluated in the Jupyter notebook.
You should only evaluate the full class definition in the Jupyter notebook—not the code snippets used for explanation.
NoiseSuppression
is a shared class used by both the float reference implementation and the Q15 version.
class NoiseSuppression():
def __init__(self,slices):
self._windowLength=len(slices[0])
self._fftLen,self._fftShift=fft_length(self._windowLength)
self._padding_left=(self._fftLen - self._windowLength)//2
self._padding_right=self._fftLen- self._windowLength-self._padding_left
self._signal=[]
self._slices=slices
self._window=None
The constructor for NoiseSuppression
:
The FFT length must be a power of 2. The slice length is not necessarily a power of 2. The constructor computes the closest usable power of 2. The audio slices are padded with zeros on both sides to match the required FFT length.
class NoiseSuppressionReference(NoiseSuppression):
def __init__(self,slices):
NoiseSuppression.__init__(self,slices)
# Compute the vad signal
self._vad=clean_vad([signal_vad(w) for w in slices])
self._noise=np.zeros(self._fftLen)
# The Hann window
self._window=dsp.arm_hanning_f32(self._windowLength)
The constructor for NoiseSuppressionReference
:
NoiseSuppression
def subnoise(self,v):
# This is a Wiener estimate.
energy = v * np.conj(v) + 1e-6
scaling = (energy - self._noise)/energy
scaling[scaling<0] = 0
return(v * scaling)
This function computes the approximate Wiener gain.
If the gain is negative, it is set to 0.
A small value is added to the energy to avoid division by zero.
This function is applied to all frequency bands of the FFT. The v
argument is a vector.
def remove_noise(self,w):
# We pad the signal with zeros. This assumes the padding is divisible by 2.
# A more robust implementation would also handle the odd-length case.
# The FFT length is greater than the window length and must be a power of 2.
sig=self.window_and_pad(w)
# FFT
fft=np.fft.fft(sig)
# Noise suppression
fft = self.subnoise(fft)
# IFFT
res=np.fft.ifft(fft)
# We assume the result should be real, so we ignore the imaginary part.
res=np.real(res)
# We remove the padding.
res=self.remove_padding(res)
return(res)
The function computes the FFT (with padding) and reduces noise in the frequency bands using the approximate Wiener gain.
def estimate_noise(self,w):
# Compute the padded signal.
sig=self.window_and_pad(w)
fft=np.fft.fft(sig)
# Estimate the noise energy.
self._noise = np.abs(fft)*np.abs(fft)
# Remove the noise.
fft = self.subnoise(fft)
# Perform the IFFT, assuming the result is real, so we ignore the imaginary part.
res=np.fft.ifft(fft)
res=np.real(res)
res=self.remove_padding(res)
return(res)
This function is very similar to the previous one. It’s used when no speech detected. It updates the noise estimate before reducing the noise.
def nr(self):
for (w,v) in zip(self._slices,self._vad):
result=None
if v==1:
# If voice is detected, we only remove the noise.
result=self.remove_noise(w)
else:
# If no voice is detected, we update the noise estimate.
result=self.estimate_noise(w)
self._signal.append(result)
The main function: it removes noise from each slice. If a slice does not contain speech, the noise estimate is updated before reducing noise in each frequency band.
The filtered slices are recombined:
def overlap_and_add(self):
offsets = range(0, len(self._signal)*winOverlap,winOverlap)
offsets=offsets[0:len(self._signal)]
res=np.zeros(len(data))
i=0
for n in offsets:
res[n:n+winLength]+=self._signal[i]
i=i+1
return(res)
You can evaluate this code in your Jupyter notebook.
def fft_length(length):
result=2
fft_shift=1
while result < length:
result = 2*result
fft_shift = fft_shift + 1
return(result,fft_shift)
class NoiseSuppression():
def __init__(self,slices):
self._windowLength=len(slices[0])
self._fftLen,self._fftShift=fft_length(self._windowLength)
self._padding_left=(self._fftLen - self._windowLength)//2
self._padding_right=self._fftLen- self._windowLength-self._padding_left
self._signal=[]
self._slices=slices
self._window=None
def window_and_pad(self,w):
if w.dtype==np.int32:
w=dsp.arm_mult_q31(w,self._window)
elif w.dtype==np.int16:
w=dsp.arm_mult_q15(w,self._window)
else:
w = w*self._window
sig=np.hstack([np.zeros(self._padding_left,dtype=w.dtype),w,np.zeros(self._padding_right,dtype=w.dtype)])
return(sig)
def remove_padding(self,w):
return(w[self._padding_left:self._padding_left+self._windowLength])
class NoiseSuppressionReference(NoiseSuppression):
def __init__(self,slices):
# In a better version this could be computed from the signal length by taking the
# smaller power of two greater than the signal length.
NoiseSuppression.__init__(self,slices)
# Compute the vad signal
self._vad=clean_vad([signal_vad(w) for w in slices])
self._noise=np.zeros(self._fftLen)
# The Hann window
self._window=dsp.arm_hanning_f32(self._windowLength)
# Subtract the noise
def subnoise(self,v):
# This is a Wiener estimate
energy = v * np.conj(v) + 1e-6
scaling = (energy - self._noise)/energy
scaling[scaling<0] = 0
return(v * scaling)
def remove_noise(self,w):
# We pad the signal with zero. It assumes that the padding can be divided by 2.
# In a better implementation we would manage also the odd case.
# The padding is required because the FFT has a length which is greater than the length of
# the window
sig=self.window_and_pad(w)
# FFT
fft=np.fft.fft(sig)
# Noise suppression
fft = self.subnoise(fft)
# IFFT
res=np.fft.ifft(fft)
# We assume the result should be real so we just ignore the imaginary part
res=np.real(res)
# We remove the padding
res=self.remove_padding(res)
return(res)
def estimate_noise(self,w):
# Compute the padded signal
sig=self.window_and_pad(w)
fft=np.fft.fft(sig)
# Estimate the noise energy
self._noise = np.abs(fft)*np.abs(fft)
# Remove the noise
fft = self.subnoise(fft)
# IFFT and we assume the result is real so we ignore imaginary part
res=np.fft.ifft(fft)
res=np.real(res)
res=self.remove_padding(res)
return(res)
# Process all the windows using the VAD detection
def nr(self):
for (w,v) in zip(self._slices,self._vad):
result=None
if v==1:
# If voice detected, we only remove the noise
result=self.remove_noise(w)
else:
# If no voice detected, we update the noise estimate
result=self.estimate_noise(w)
self._signal.append(result)
# Overlap and add to rebuild the signal
def overlap_and_add(self):
offsets = range(0, len(self._signal)*winOverlap,winOverlap)
offsets=offsets[0:len(self._signal)]
res=np.zeros(len(data))
i=0
for n in offsets:
res[n:n+winLength]+=self._signal[i]
i=i+1
return(res)
You can now test this algorithm on the original signal:
n=NoiseSuppressionReference(slices)
n.nr()
cleaned=n.overlap_and_add()
plt.plot(cleaned)
plt.show()
Figure 6. Cleaned signal
You can now listen to the result:
audioRef=Audio(data=cleaned,rate=samplerate,autoplay=False)
audioRef