In this competition, we were challenged to use the Speech Commands Dataset to build an algorithm that understands simple spoken commands.By improving the recognition accuracy of open-sourced voice interface tools, we can improve product effectiveness and their accessibility.
The training set provided contains a few informational files and a folder of audio files. The audio folder contains subfolders with 1 second clips of voice commands, with the folder name being the label of the audio clip. There are more labels that should be predicted. The labels that we need to predict in Test are yes, no, up, down, left, right, on, off, stop, go. Everything else should be considered either unknown or silence. The folder background_noise contains longer clips of "silence" that we can break up and use as training input.
The test set contains an audio folder with 150,000+ files in the format clip_000044442.wav. The task is to predict the correct label.
Exploratory data analysis showed that spectrogram representations of audio files could serve as better input to a deep learning network
compared to raw audio inputs or fourier representations of the audio files. Further exploration showed that Mel power spectrograms
, which better mimic the way the human ear perceives sound, could potentially be better than log power spectrograms. MFCC's which further abstract
away information from the mel spectrogram can also be potentially used.
I generated log spectrograms, mel spectrograms and mfccs and stored them in the form of numpy arrays. I had previously tried to convert log spectrograms
into color images and store them to be used as input. But I found storing them as numpy arrays to be more memory efficient and it yielded
the same results.
So ultimately I had three numpy arrays 1) Log Spec 2) Mel Spec 3) MFCC corresponding to different representations of the input
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return np.log(spec.T.astype(np.float32) + eps)
def mel_specgram(samples, sample_rate):
S = librosa.feature.melspectrogram(samples, sr=sample_rate, n_mels=128)
# Convert to log scale (dB). We'll use the peak power (max) as reference.
log_S = librosa.power_to_db(S, ref=np.max)
return log_S
def mfcc(samples,sample_rate):
S_spec = librosa.feature.melspectrogram(samples, sr=sample_rate, n_mels=128)
log_S= librosa.power_to_db(S_spec, ref=np.max)
mfcc = librosa.feature.mfcc(S=log_S, n_mfcc=13)
return mfcc
Mel Spectrograms of Randomly Selected audiofiles labeled 'yes'
I started out with a very simple CNN architecture and used log spectrograms as input.
input_shape = (99, 81, 1)
nclass = 12
inp = Input(shape=input_shape)
norm_inp = BatchNormalization()(inp)
img_1 = Convolution2D(8, kernel_size=2, activation=activations.relu)(norm_inp)
img_1 = Convolution2D(8, kernel_size=2, activation=activations.relu)(img_1)
img_1 = MaxPooling2D(pool_size=(2, 2))(img_1)
img_1 = Dropout(rate=0.2)(img_1)
img_1 = Convolution2D(16, kernel_size=3, activation=activations.relu)(img_1)
img_1 = Convolution2D(16, kernel_size=3, activation=activations.relu)(img_1)
img_1 = MaxPooling2D(pool_size=(2, 2))(img_1)
img_1 = Dropout(rate=0.2)(img_1)
img_1 = Convolution2D(32, kernel_size=3, activation=activations.relu)(img_1)
img_1 = MaxPooling2D(pool_size=(2, 2))(img_1)
img_1 = Dropout(rate=0.2)(img_1)
img_1 = Flatten()(img_1)
dense_1 = BatchNormalization()(Dense(128, activation=activations.relu)(img_1))
dense_1 = BatchNormalization()(Dense(128, activation=activations.relu)(dense_1))
dense_1 = Dense(nclass, activation=activations.softmax)(dense_1)
model = models.Model(inputs=inp, outputs=dense_1)
opt = optimizers.Adam()
callbacks = [EarlyStopping(monitor='val_acc', patience=4, verbose=1, mode='max')]
model.compile(optimizer=opt, loss=losses.binary_crossentropy)
model.summary()
#x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.1, random_state=2017)
history=model.fit_generator(generator=generator(train_file_map,batch_size=250, mode='train'),steps_per_epoch=int(np.ceil(len(sample_map_df_train)/250)), validation_data=generator(train_file_map,batch_size=250, mode='val'), validation_steps=int(np.ceil(len(sample_map_df_val)/250)), epochs=3, shuffle=True, callbacks=callbacks)
model.save(os.path.join(model_path, 'light_cnn.model'))
During training, I achieved high accuracies on the validation set using this light cnn model but this accuracy did not translate to good performance on the hold out test set at kaggle. When I read the forums to understand why this may be, it was suggested that the test set may be from a different distribution especially when it came to unknown uknowns. For instance, here is the confusion matrix on the validation set which I got during training. While accuracies on training and validation sets were close to 96%, the accuracy I achieved on the test set was only about 74%.
Confusion Matrix on Validation Set: Light CNN
Since this simple CNN architecture was not generalizing well on the test set, I decided to use a Resnet style architecture. Resnets use residual blocks and the presence of residual blocks can help build deeper networks.
class ResNet():
"""
Usage:
sr = ResNet([4,8,16], input_size=(50,50,1), output_size=12)
sr.build()
followed by sr.m.compile(loss='categorical_crossentropy',
optimizer='adadelta', metrics=["accuracy"])
save plotted model with:
keras.utils.plot_model(sr.m, to_file = '<location>.png',
show_shapes=True)
"""
def __init__(self,
filters_list=[],
input_size=None,
output_size=None,
initializer='glorot_uniform'):
self.filters_list = filters_list
self.input_size = input_size
self.output_size = output_size
self.initializer = initializer
self.m = None
def _block(self, filters, inp):
""" one residual block in a ResNet
Args:
filters (int): number of convolutional filters
inp (tf.tensor): output from previous layer
Returns:
tf.tensor: output of residual block
"""
layer_1 = BatchNormalization()(inp)
act_1 = Activation('relu')(layer_1)
conv_1 = Conv2D(filters, (3,3),
padding = 'same',
kernel_initializer = self.initializer)(act_1)
layer_2 = BatchNormalization()(conv_1)
act_2 = Activation('relu')(layer_2)
conv_2 = Conv2D(filters, (3,3),
padding = 'same',
kernel_initializer = self.initializer)(act_2)
return(conv_2)
def build(self):
"""
Returns:
keras.engine.training.Model
"""
i = Input(shape = self.input_size, name = 'input')
x = Conv2D(self.filters_list[0], (3,3),
padding = 'same',
kernel_initializer = self.initializer)(i)
x = MaxPooling2D(padding = 'same')(x)
x = Add()([self._block(self.filters_list[0], x),x])
x = Add()([self._block(self.filters_list[0], x),x])
x = Add()([self._block(self.filters_list[0], x),x])
if len(self.filters_list) > 1:
for filt in self.filters_list[1:]:
x = Conv2D(filt, (3,3),
strides = (2,2),
padding = 'same',
activation = 'relu',
kernel_initializer = self.initializer)(x)
x = Add()([self._block(filt, x),x])
x = Add()([self._block(filt, x),x])
x = Add()([self._block(filt, x),x])
x = GlobalAveragePooling2D()(x)
x = Dense(self.output_size, activation = 'softmax')(x)
self.m = Model(i,x)
return self.m
With a Resnet architecture, I was able to generalize much better on the test set. In order to improve generalization, I used an ensemble of three models and took a majority vote. This resulted in a score of 0.87 on the public leaderboard.