Im trying to extract mfcc features from audio of a video file. I tri

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Correct wav format? about speechpy HOT 3 CLOSED

astorfi commented on May 30, 2024

Correct wav format?

from speechpy.

Comments (3)

astorfi commented on May 30, 2024

Would you please tell me why you are using signal[:,0] format?
The error does not seem to come from the package. It's a pythonic mismatch. In the test example of this repository, the reason behind using signal[:,0] is the dual-band nature of the example Wav files. Yours may only have a single channel. Try to remove signal = signal[:,0] and run the code again.

from speechpy.

taewookim commented on May 30, 2024

hey @astorfi . Ah, you were right.. i was blindly copy pasting the tutorial code and i completely forgot to check the shape. Thank you

Apologies in advance but im new to audio processing. I'm trying to do this in syncet, where the audio inputs are mfcc features and the audio model takes this input (according to model.summary())

conv1_audio (Conv2D) (None, 13, 20, 64) 640

According to the owner of syncnet repo,

I have used a library called speechpy to extract MFCC features. The function to extract MFCC features from a .wav file according to the instructions using speechpy is:

speechpy.feature.mfcc(signal, sampling_frequency, frame_length=0.010, frame_stride=0.010, num_cepstral=13)
Audio features are computed over a duration of audio. In the paper, it is mentioned that features are computed at 100 Hz => for every 0.010 seconds. Hence, frame_length=0.010, frame_stride=0.010 (no overlap).

According to the paper, audio features and video features are extracted for every 0.2 seconds.
Lip: 0.2 seconds => 0.2 * 25fps = 5 video frames
Audio: 0.2 seconds => 0.2 / 0.01(frame duration) = 20 audio frames

Hence, a 112x112x5 matrix is input to the lips model, and a 13x20 matrix is input to the audio model.

Can you help me understand how to shape peechpy.feature.mfcc return value to be 13x20 matrix? Or is this a simple .reshape() ? ( I was thinking that originally, but I was thinking that this is probably wrong especially since im completely blind in the world of audio processing.. even with all the tutorials I read)

PS: the original syncnet paper:

The input audio data is MFCC values. This is a representation of the shortterm power spectrum of a sound on a non-linear mel scale of frequency. 13 mel frequency bands are used at each time step. The features are computed at a sampling rate of 100Hz, giving 20 time steps for a 0.2-second input signal.

from speechpy.

astorfi commented on May 30, 2024

The package output is available in the official documentation. I think you should read more about the MFCC or speech features in general. A good tutorial is as follows:
Mel Frequency Cepstral Coefficients (mfccs)

from speechpy.

Correct wav format? about speechpy HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent