Would it be straightforward to modify Allosaurus to return the approximate times of th

Phone times? about allosaurus HOT 5 CLOSED

xinjli commented on August 12, 2024

Phone times?

from allosaurus.

Comments (5)

xinjli commented on August 12, 2024

Hi,

Thanks for your comments!
The underlying recognition was trained by CTC, which means the timestamp might not be very accurate, so the timestamp here
is just an approximation.
But it should not take much effort to do this.
You can modify the following part in lm/decoder.py where the decoding is taking place.

        # find all emitting frames
        for i in range(len(logits)):

            logit = logits[i]
            logit[0] /= blank_factor

            arg_max = np.argmax(logit)

            # this is an emitting frame
            if arg_max != cur_max_arg and arg_max != 0:
                emit_frame_idx.append(i)
                cur_max_arg = arg_max

Basically, the iterator i is the index of frame, which is the indicator of timestamp, each frame has a duration of 200ms and shifts by 100ms per frame.

For each recognized phone, there is usually an emitting frame corresponding to it. You can find its frame index and compute the timestamp using the 100ms shift.

from allosaurus.

eremingt commented on August 12, 2024

Thanks! Will try this.

from allosaurus.

eremingt commented on August 12, 2024

Started looking into this - returning emit_frame_idx through lm/decoder.compute() and then app.recognize(). However I'm getting much higher indexes than I'd expect for 100 ms shift per frame (ones that give times much longer than the length of the sample).

Is it possible that the frame duration is actually 75 ms with a shift of 30 ms? Looking at pm/feature.mfcc(), the default winstep is 0.01 and winlen is 0.025. Then in pm/mfcc.compute(), there is this block:

    # subsampling and windowing
    if self.feature_window == 3:
        feat = feature_window(feat)

Does pm/utils.feature_window() concatenate frames in groups of three?

from allosaurus.

xinjli commented on August 12, 2024

Hi,

Yeah, sorry, you are correct.
The pretrained model is concatenating 3 frame into 1 frame, so the observed frame has actually 66ms duration and 33ms shift

from allosaurus.

eremingt commented on August 12, 2024

I decided to return (approximate) relative position, rather than index, which I assume will be robust to any change in step size parameters.

    # find all emitting frames
    for i in range(len(logits)):

        logit = logits[i]
        logit[0] /= blank_factor

        arg_max = np.argmax(logit)

        # this is an emitting frame
        if arg_max != cur_max_arg and arg_max != 0:
            emit_frame_idx.append(i)
            cur_max_arg = arg_max

    # Position of emitted frame in recording (don't need to know step size)
    emit_frame_position = [idx/len(logits) for idx in emit_frame_idx]

Thanks again for your help!

from allosaurus.

Recommend Projects

Phone times? about allosaurus HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent