Giter VIP home page Giter VIP logo

Comments (3)

astorfi avatar astorfi commented on May 30, 2024

Would you please tell me why you are using signal[:,0] format?
The error does not seem to come from the package. It's a pythonic mismatch. In the test example of this repository, the reason behind using signal[:,0] is the dual-band nature of the example Wav files. Yours may only have a single channel. Try to remove signal = signal[:,0] and run the code again.

from speechpy.

taewookim avatar taewookim commented on May 30, 2024

hey @astorfi . Ah, you were right.. i was blindly copy pasting the tutorial code and i completely forgot to check the shape. Thank you

Apologies in advance but im new to audio processing. I'm trying to do this in syncet, where the audio inputs are mfcc features and the audio model takes this input (according to model.summary())

conv1_audio (Conv2D) (None, 13, 20, 64) 640

According to the owner of syncnet repo,

I have used a library called speechpy to extract MFCC features. The function to extract MFCC features from a .wav file according to the instructions using speechpy is:

speechpy.feature.mfcc(signal, sampling_frequency, frame_length=0.010, frame_stride=0.010, num_cepstral=13)
Audio features are computed over a duration of audio. In the paper, it is mentioned that features are computed at 100 Hz => for every 0.010 seconds. Hence, frame_length=0.010, frame_stride=0.010 (no overlap).

According to the paper, audio features and video features are extracted for every 0.2 seconds.
Lip: 0.2 seconds => 0.2 * 25fps = 5 video frames
Audio: 0.2 seconds => 0.2 / 0.01(frame duration) = 20 audio frames

Hence, a 112x112x5 matrix is input to the lips model, and a 13x20 matrix is input to the audio model.

Can you help me understand how to shape peechpy.feature.mfcc return value to be 13x20 matrix? Or is this a simple .reshape() ? ( I was thinking that originally, but I was thinking that this is probably wrong especially since im completely blind in the world of audio processing.. even with all the tutorials I read)

PS: the original syncnet paper:

The input audio data is MFCC values. This is a representation of the shortterm power spectrum of a sound on a non-linear mel scale of frequency. 13 mel frequency bands are used at each time step. The features are computed at a sampling rate of 100Hz, giving 20 time steps for a 0.2-second input signal.

from speechpy.

astorfi avatar astorfi commented on May 30, 2024

The package output is available in the official documentation. I think you should read more about the MFCC or speech features in general. A good tutorial is as follows:
Mel Frequency Cepstral Coefficients (mfccs)

from speechpy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.