Speech Recognition
Speech recognition is the process of understanding the words that are spoken
by humans. The speech signals are captured using a microphone and the system tries to understand the words that are being captured. Speech recognition is used extensively in human-computer interaction, smartphones, speech transcription, bio-metric systems, security, and more.
Researchers work on various aspects and applications of speech, such as
understanding spoken words, identifying who the speaker is, recognizing emotions,and identifying accents.
Speech recognition represents an important step in the field of human-
computer interaction. If we want to build cognitive robots that can interact with humans, they need to talk to us in natural language. This is the reason that automatic speech recognition has been the center of attention for many researchers in recent years. Let’s go ahead and see how to deal with speech signals and build a speech recognizer.
Visualizing audio signals
Let’s see how to visualize an audio signal. We will learn how to read an audio
signal from a file and work with it. This will help us understand how an audio
signal is structured. When audio files are recorded using a microphone, they are sampling the actual audio signals and storing the digitized versions. The real audio signals are continuous valued waves, which means we cannot store them as they are. We need to sample the signal at a certain frequency and convert it into discrete numerical form. Most commonly, speech signals are sampled at 44,100 Hz. This means that each second of the speech signal is broken down into 44,100 parts and the values at each of these timestamps is stored in an output file. We save the value of the audio signal every 1/44,100 seconds. In this case, we say that the sampling frequency of the audio
signal is 44,100 Hz. By choosing a high sampling frequency, it will appear that the audio signal is continuous when humans listen to it. Let’s go ahead and visualize an audio signal.
Create a new Python file and import the following packages:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
Read the input audio file using the wavefile.read method. It returns two values –
sampling frequency and the audio signal:
# Read the audio file
sampling_freq, signal = wavfile.read(‘random_sound.wav’)
Print the shape of the signal, the datatype, and the duration of the audio signal:
# Display the params
print(‘\nSignal shape:’, signal.shape)
print(‘Datatype:’, signal.dtype)
print(‘Signal duration:’, round(signal.shape[0] / float(sampling_
freq), 2), ‘seconds’)
Normalize the signal:
# Normalize the signal
signal = signal / np.power(2, 15)
Extract the first 50 values from the numpy array for plotting:
# Extract the first 50 values
signal = signal[:50]
Construct the time axis in seconds:
# Construct the time axis in milliseconds
time_axis = 1000 * np.arange(0, len(signal), 1) / float(sampling_freq)
Plot the audio signal:
# Plot the audio signal
plt.plot(time_axis, signal, color=’black’)
plt.xlabel(‘Time (milliseconds)’)
plt.ylabel(‘Amplitude’)
plt.title(‘Input audio signal’)
plt.show()
Transforming audio signals to the
frequency domain
In order to analyze audio signals, we need to understand the underlying frequency components. This gives us insights into how to extract meaningful information from this signal. Audio signals are composed of a mixture of sine waves of varying frequencies, phases, and amplitudes. If we dissect the frequency components, we can identify a lot of characteristics. Any
given audio signal is characterized by its distribution in the frequency spectrum. In order to convert a time domain signal into the frequency domain, we need to use a mathematical tool such as the Fourier Transform. If you need a quick refresher on the Fourier Transform, check out this link: http://www.thefouriertransform.com. Let’s see how to transform an audio signal from the time domain to the frequency domain.
Create a new Python file and import the following packages:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
Read the input audio file using the wavefile.read method. It returns two values –
sampling frequency and the audio signal:
# Read the audio file
sampling_freq, signal = wavfile.read(‘spoken_word.wav’)
Normalize the audio signal:
# Normalize the values
signal = signal / np.power(2, 15)
Extract the length and half-length of the signal:
# Extract the length of the audio signal
len_signal = len(signal)
# Extract the half length
len_half = np.ceil((len_signal + 1) / 2.0).astype(np.int)
Apply the Fourier transform to the signal:
# Apply Fourier transform
freq_signal = np.fft.fft(signal)Normalize the frequency domain signal and take the square:
# Normalization
freq_signal = abs(freq_signal[0:len_half]) / len_signal
# Take the square
freq_signal **= 2
Adjust the Fourier-transformed signal for even and odd cases:
# Extract the length of the frequency transformed signal
len_fts = len(freq_signal)
# Adjust the signal for even and odd cases
if len_signal % 2:
freq_signal[1:len_fts] *= 2
else:
freq_signal[1:len_fts-1] *= 2
Extract the power signal in dB :
# Extract the power value in dB
signal_power = 10 * np.log10(freq_signal)
Build the X axis, which is frequency measured in kHz in this case:
# Build the X axis
x_axis = np.arange(0, len_half, 1) * (sampling_freq / len_signal) /
1000.0
Plot the figure:
# Plot the figure
plt.figure()
plt.plot(x_axis, signal_power, color=’black’)
plt.xlabel(‘Frequency (kHz)’)
plt.ylabel(‘Signal power (dB)’)
plt.show()
Generating audio signals
Now that we know how audio signals work, let’s see how we can generate one such signal. We can use the NumPy package to generate various audio signals. Since audio signals are mixtures of sinusoids, we can use this to generate an audio signal with some predefined parameters.
Create a new Python file and import the following packages:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write
Define the output audio file’s name:
# Output file where the audio will be saved
output_file = ‘generated_audio.wav’
Specify the audio parameters, such as duration, sampling frequency, tone frequency,
minimum value, and maximum value:
# Specify audio parametersduration = 4 # in seconds
sampling_freq = 44100 # in Hz
tone_freq = 784
min_val = -4 * np.pi
max_val = 4 * np.pi
Generate the audio signal using the defined parameters:
# Generate the audio signal
t = np.linspace(min_val, max_val, duration * sampling_freq)
signal = np.sin(2 * np.pi * tone_freq * t)
Add some noise to the signal:
# Add some noise to the signal
noise = 0.5 * np.random.rand(duration * sampling_freq)
signal += noise
Normalize and scale the signal:
# Scale it to 16-bit integer values
scaling_factor = np.power(2, 15) — 1
signal_normalized = signal / np.max(np.abs(signal))
signal_scaled = np.int16(signal_normalized * scaling_factor)
Save the generated audio signal in the output file:
# Save the audio signal in the output file
write(output_file, sampling_freq, signal_scaled)
Extract the first 200 values for plotting:
# Extract the first 200 values from the audio signal
signal = signal[:200]
Construct the time axis in milliseconds:
# Construct the time axis in milliseconds
time_axis = 1000 * np.arange(0, len(signal), 1) / float(sampling_freq)
Plot the audio signal:
# Plot the audio signal
plt.plot(time_axis, signal, color=’black’)
plt.xlabel(‘Time (milliseconds)’)
plt.ylabel(‘Amplitude’)
plt.title(‘Generated audio signal’)
plt.show()
Extracting speech features
We learned how to convert a time domain signal into the frequency domain.
Frequency domain features are used extensively in all speech recognition systems.
The concept we discussed earlier is an introduction to the idea, but real-world
frequency domain features are a bit more complex. Once we convert a signal into the
frequency domain, we need to ensure that it’s usable in the form of a feature vector.This is where the concept of Mel Frequency Cepstral Coefficients (MFCCs) becomes relevant. MFCC is a tool that’s used to extract frequency domain features from a given audio signal.
In order to extract the frequency features from an audio signal, MFCC first extractsthe power spectrum. It then uses filter banks and a Discrete Cosine Transform(DCT) to extract the features. If you are interested in exploring MFCCs further,check out this link:http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs
We will be using a package called python_speech_features to extract the MFCC features. The package is available here: http://python-speech-features.readthedocs.org/en/latest
Create a new Python file and import the following packages:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank
Read the input audio file and extract the first 10,000 samples for analysis:
# Read the input audio file
sampling_freq, signal = wavfile.read(‘random_sound.wav’)
# Take the first 10,000 samples for analysis
signal = signal[:10000]
Extract the MFCC:
# Extract the MFCC features
features_mfcc = mfcc(signal, sampling_freq)
Print the MFCC parameters:
# Print the parameters for MFCC
print(‘\nMFCC:\nNumber of windows =’, features_mfcc.shape[0])
print(‘Length of each feature =’, features_mfcc.shape[1])
Plot the MFCC features:
# Plot the features
features_mfcc = features_mfcc.T
plt.matshow(features_mfcc)
plt.title(‘MFCC’)
Extract the filter bank features:
# Extract the Filter Bank features
features_fb = logfbank(signal, sampling_freq)
Print the parameters for the filter bank:
# Print the parameters for Filter Bank
print(‘\nFilter bank:\nNumber of windows =’, features_fb.shape[0])
print(‘Length of each feature =’, features_fb.shape[1])
Plot the features:
# Plot the features
features_fb = features_fb.T
plt.matshow(features_fb)
plt.title(‘Filter bank’)
plt.show()
Recognizing spoken words
Now that we have learned all the techniques to analyze speech signals, let’s go ahead and see how to recognize spoken words. Speech recognition systems take audio signals as input and recognize the words being spoken. Hidden Markov Models (HMMs) will be used for this task.
As we discussed earlier, HMMs are great at analyzing sequential
data. An audio signal is a time series signal, which is a manifestation of sequential data. The assumption is that the outputs are being generated by the system going through a series of hidden states. Our goal is to find out what these hidden states are so that we can identify the words in our signal. If you are interested in digging deeper, check out this link: https://web.stanford.edu/~jurafsky/slp3/A.pdf .
We will be using a package called hmmlearn to build our speech recognition system.You can learn more about it here: http://hmmlearn.readthedocs.org/en/latest .
Here is code link please do check:https://colab.research.google.com/drive/1yvnhkuczfhD8RHVKpnUtklygLjqueuBB?usp=sharing
That’s all folks please follow, share, clap and comment please . I would love to hear your feedback!