Giter VIP home page Giter VIP logo

Comments (5)

xinjli avatar xinjli commented on August 12, 2024

Hi,

Thanks for your comments!
The underlying recognition was trained by CTC, which means the timestamp might not be very accurate, so the timestamp here
is just an approximation.
But it should not take much effort to do this.
You can modify the following part in lm/decoder.py where the decoding is taking place.

        # find all emitting frames
        for i in range(len(logits)):

            logit = logits[i]
            logit[0] /= blank_factor

            arg_max = np.argmax(logit)

            # this is an emitting frame
            if arg_max != cur_max_arg and arg_max != 0:
                emit_frame_idx.append(i)
                cur_max_arg = arg_max

Basically, the iterator i is the index of frame, which is the indicator of timestamp, each frame has a duration of 200ms and shifts by 100ms per frame.

For each recognized phone, there is usually an emitting frame corresponding to it. You can find its frame index and compute the timestamp using the 100ms shift.

from allosaurus.

eremingt avatar eremingt commented on August 12, 2024

Thanks! Will try this.

from allosaurus.

eremingt avatar eremingt commented on August 12, 2024

Started looking into this - returning emit_frame_idx through lm/decoder.compute() and then app.recognize(). However I'm getting much higher indexes than I'd expect for 100 ms shift per frame (ones that give times much longer than the length of the sample).

Is it possible that the frame duration is actually 75 ms with a shift of 30 ms? Looking at pm/feature.mfcc(), the default winstep is 0.01 and winlen is 0.025. Then in pm/mfcc.compute(), there is this block:

    # subsampling and windowing
    if self.feature_window == 3:
        feat = feature_window(feat)

Does pm/utils.feature_window() concatenate frames in groups of three?

from allosaurus.

xinjli avatar xinjli commented on August 12, 2024

Hi,

Yeah, sorry, you are correct.
The pretrained model is concatenating 3 frame into 1 frame, so the observed frame has actually 66ms duration and 33ms shift

from allosaurus.

eremingt avatar eremingt commented on August 12, 2024

I decided to return (approximate) relative position, rather than index, which I assume will be robust to any change in step size parameters.

    # find all emitting frames
    for i in range(len(logits)):

        logit = logits[i]
        logit[0] /= blank_factor

        arg_max = np.argmax(logit)

        # this is an emitting frame
        if arg_max != cur_max_arg and arg_max != 0:
            emit_frame_idx.append(i)
            cur_max_arg = arg_max

    # Position of emitted frame in recording (don't need to know step size)
    emit_frame_position = [idx/len(logits) for idx in emit_frame_idx]

Thanks again for your help!

from allosaurus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.