Comments (5)
Hi,
Thanks for your comments!
The underlying recognition was trained by CTC, which means the timestamp might not be very accurate, so the timestamp here
is just an approximation.
But it should not take much effort to do this.
You can modify the following part in lm/decoder.py where the decoding is taking place.
# find all emitting frames
for i in range(len(logits)):
logit = logits[i]
logit[0] /= blank_factor
arg_max = np.argmax(logit)
# this is an emitting frame
if arg_max != cur_max_arg and arg_max != 0:
emit_frame_idx.append(i)
cur_max_arg = arg_max
Basically, the iterator i is the index of frame, which is the indicator of timestamp, each frame has a duration of 200ms and shifts by 100ms per frame.
For each recognized phone, there is usually an emitting frame corresponding to it. You can find its frame index and compute the timestamp using the 100ms shift.
from allosaurus.
Thanks! Will try this.
from allosaurus.
Started looking into this - returning emit_frame_idx through lm/decoder.compute() and then app.recognize(). However I'm getting much higher indexes than I'd expect for 100 ms shift per frame (ones that give times much longer than the length of the sample).
Is it possible that the frame duration is actually 75 ms with a shift of 30 ms? Looking at pm/feature.mfcc(), the default winstep is 0.01 and winlen is 0.025. Then in pm/mfcc.compute(), there is this block:
# subsampling and windowing
if self.feature_window == 3:
feat = feature_window(feat)
Does pm/utils.feature_window() concatenate frames in groups of three?
from allosaurus.
Hi,
Yeah, sorry, you are correct.
The pretrained model is concatenating 3 frame into 1 frame, so the observed frame has actually 66ms duration and 33ms shift
from allosaurus.
I decided to return (approximate) relative position, rather than index, which I assume will be robust to any change in step size parameters.
# find all emitting frames
for i in range(len(logits)):
logit = logits[i]
logit[0] /= blank_factor
arg_max = np.argmax(logit)
# this is an emitting frame
if arg_max != cur_max_arg and arg_max != 0:
emit_frame_idx.append(i)
cur_max_arg = arg_max
# Position of emitted frame in recording (don't need to know step size)
emit_frame_position = [idx/len(logits) for idx in emit_frame_idx]
Thanks again for your help!
from allosaurus.
Related Issues (20)
- Prior.txt file path HOT 2
- Optimizing for Latency
- support for python 3.10 HOT 4
- Not able to transcribe simple word what in English HOT 5
- more model for recognition HOT 1
- The timestamp of model 'interspeech21' is incorrect HOT 5
- Unable to run interspeech21 model HOT 1
- Feature normalization can cause NaN to appear HOT 1
- Directory Name con not allowed on Windows HOT 1
- NumPy requirement is less than 1.22 and latest is 1.19.5
- Difference in outputs of splitted v/s unsplitted audio file HOT 2
- Wave error for given sample
- Any way to add new languages?
- UnicodeEncodeError: 'charmap' codec can't encode character '\u02d0' in position 28 when redirecting in WIndows
- Content of fine-tuning files?
- AttributeError: 'PosixPath' object has no attribute 'startswith' HOT 1
- Fix setup.py
- Phone inventory always the default one even after specifying model eng2102 and lang eng
- Is there any way of getting arpabet phonetic transcription for hindi language?
- How long does it theoretically take for "allosaurus" to recognize phonemes?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from allosaurus.