mpariente / pystoi Goto Github PK

View Code? Open in Web Editor NEW

319.0 319.0 60.0 815 KB

Python implementation of the Short Term Objective Intelligibility measure

License: MIT License

Python 27.23% MATLAB 72.77%

pystoi's Introduction

Hey 👋, I'm Manuel Pariente

I'm the co-founder of Pulse Audition, we're bringing back intelligibility in noise to the people who need it ! 👂 💡

⚒️ Python / Cython / Bash / Git / LaTeX / Markdown ⚒️
📦 Building 🔥 asteroid 🔥, asteroid-filterbanks, pystoi, torch_stoi, torch-audiomentations and more 📦
✏️ Linux & Bash & VSCode ✏️
🎯 Efficiency at work 🎯

Tools and Libs:

Colabs

pystoi's People

Contributors

Stargazers

Watchers

pystoi's Issues

Float64 input

Given the example on the README, I assume pystoi is meant to work with float64 audio (the default format for soundfile.read), but just to be sure, is this indeed the case? Because int16 and int32 are also common encodings and they seem to work fine, but of course don't give the same score as their float64 counterpart. Thanks!

The two signals do not match Error

Hello, I tried to use STOI to evaluate my speech signals that I get after inserting the data in a simulation room, but the shape is not same for the clean (80000,), noisy (81360,) and the enhanced signals (82959,). Is there any solution to use STOI in that case?

Is resampling really required?

Hi,
The original paper (http://cas.et.tudelft.nl/pubs/Taal2010.pdf) mentions in the start of section 2 that the metric is supposed to be used on audio at a sampling rate of 10000 Hz. Is this really necessary? I get fairly similar results regardless of whether or not I resample my audio.

Thanks!

Extended version

Hi,

are you sure that the if clause at #L69 is correct. I would expect the extended version to be the one in the else clause using the normalized segments.

Ability for batched tensor

Thank you for the code!
I have a question: Does this code has the ability to calc the stoi for batched tensor with size of [B, num_of_samples] or even with size of [B, num_speaker, num_of_samples].
I checked and I think it has not this ability, right?
Can you maybe expand the implementaion to this scenario please?

Python 3

Thanks very much for this code. I found that for Python 3 I had to change the line

import utils

from . import utils

in pystoi.py.

remove_silent_frames

I think the way the signal is being reconstructed in remove_silent_frames is not quite right. The evidence for this is that if you start with a signal that has no silence, the output of remove_silent_frames should be the same as the input, but it's not.

The problem is that the window does not satisfy the COLA constraint so the overlap-add technique needs to be modified to compensate.

I know how to fix this but I wanted to get your opinion before preparing a PR. The issue is that the problem lies in the original STOI MATLAB code, not in pystoi. So if it is fixed, tests that compare to the output of MATLAB will fail.

The magnitude of the error is probably not large, so it's a tradeoff. Is it better to have correct silence removal or consistency with MATLAB output?

STOI in pystoi==0.4.0 is inconsistent with the mathlab implementation

For the audio examples in the audio_speech.zip, the stoi computed by mathlab is 0.6739, by pystoi==0.3.3 is 0.6739177895331301, while by pystoi==0.4.0 is 0.672038201650704. The difference between mathlab implementation and pystoi==0.4.0 is close to 0.002 which is larger than 1e-3.

import soundfile as sf
import pystoi

r, fs = sf.read('audio_speech.wav')
p, fs = sf.read("audio_speech_bab_0dB.wav")
v = pystoi.stoi(r, p, fs) 
print(v) # 0.6739177895331301 for 0.3.3, 0.672038201650704 for 0.4.0, mathlab is 0.6739

audio_speech.zip

Is there any difference in Resample() between Matlab and Octave?

This code is really helpful for my study. Thank you for this awesome work! There is no issue I am going to raise. Just some questions about MATLAB and Octave.

The sample rate of my audio is 16000. According to the README file, the test will fail if I use python to do the resample. My question is how about the resample() function between MATLAB and Octave? Are they equivalent? It will be very appreciated if someone could answer this.

remove_silent_frames() modifies input even if no frames are silent

So I think there are 3 issues that I spotted.

You don't divide by sqrt(N) like here
https://github.com/mpariente/pystoi/blob/master/tests/matlab/stoi.m#L158
Though I am not sure why sqrt(N) is correct and not something window dependent like sqrt(sum(window)).
You don't divide by the window during overlap add. Since you multiplied the input with the window here, to preserve energy you need to divide by w probably here.
The output dimensions are not preserved, i.e. the output is shorter than the input. I think this can be solved via padding at the start and cropping afterwards similar to this:

orig_len = x.shape[1]
pad = framelen - orig_len % framelen
pad_front, mod = divmod(pad, 2)
pad_end = pad_front + mod
x_padded = np.pad(x, (pad_front, pad_end))

# ...

# Check if the first frame was kept
if mask[0]:  # mask is true if frame is *not* silent and we keep it
    # then there will be the 0-padding in the beginning of length pad_front
    x_sil = x_sil[pad_fron:]
    y_sil = y_sil[pad_fron:]
# Cut the remaining if the last frame was kept
if mask[-1]:
    x_sil = x_sil[-pad_end:]
    y_sil = y_sil[-pad_end:]

I think a test would be good that tests if the input is equal (or close) to the output given a loud enough signal.

TensorFlow

I am working on a version of STOI in TensorFlow. I am posting this here just in case @mpariente or someone else is also working on that. We could avoid duplicating effort.

AxisError when signal contains silence

The stoi function produces an error if a reference signal only contains a short piece of speech. This seems to be caused by the removal of silent frames.

This is a minimal example using WSJ0-2mix data. Replace wsj0_2mix_root with the root to the WSJ0-2mix data. You might have to remove the suffix _2 if you have a newer version of the WJ0-2mix database:

from pathlib import Path
from pystoi.stoi import stoi
import soundfile as sf

wsj0_2mix_root = Path('<path to WSJ0-2mix root dir>')

observation = sf.read(str(wsj0_2mix_root / 'data/2speakers/wav8k/min/cv/mix/40ba0112_1.2757_01nc0218_-1.2757.wav'))[0]
target = sf.read(str(wsj0_2mix_root / 'data/2speakers/wav8k/min/cv/s2/40ba0112_1.2757_01nc0218_-1.2757_2.wav'))[0]

stoi(target, observation, 8000)

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-167-eb5a1701f57b> in <module>
      9 
     10 
---> 11 stoi(target, observation, 8000)

.../python3.7/site-packages/pystoi/stoi.py in stoi(x, y, fs_sig, extended)
     75         # Find normalization constants and normalize
     76         normalization_consts = (
---> 77             np.linalg.norm(x_segments, axis=2, keepdims=True) /
     78             (np.linalg.norm(y_segments, axis=2, keepdims=True) + utils.EPS))
     79         y_segments_normalized = y_segments * normalization_consts

.../python3.7/site-packages/numpy/linalg/linalg.py in norm(x, ord, axis, keepdims)
   2479             # special case for speedup
   2480             s = (x.conj() * x).real
-> 2481             return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))
   2482         else:
   2483             try:

AxisError: axis 2 is out of bounds for array of dimension 1

Is this a bug in the implementation or a general flaw of the STOI metric? Do you have a suggestion on how to handle this issue?

Future warnings raised

Hi,
I'm running stoi(signal1, signal2, sr, extended=True)
where signal1 and signal2 are both numpy.ndarray

and I'm getting the following future warning:
/usr/lib/python3/dist-packages/scipy/signal/signaltools.py:2383: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
return y[keep]

Any idea how to avoid this from happening?

Thanks

Missing code for third-octave calculation

In thirdoct.m, there is a bit of code that is not mirrored in the pystoi implementation. Can you clarify why it was left out?

rnk         = sum(A, 2);
numBands    = find((rnk(2:end)>=rnk(1:(end-1))) & (rnk(2:end)~=0)~=0, 1, 'last' )+1;
A           = A(1:numBands, :);
cf          = cf(1:numBands);

One to two frames are discarded at the end of the signal

On two occasions, what seems to be remnants of a faulty MATLAB translation is causing up to 2 frames to be discarded at the end of the signal.

In remove_silent_frames:

pystoi/pystoi/utils.py

Lines 148 to 151 in 9ff1cfa

x_frames = np.array(

[w * x[i:i + framelen] for i in range(0, len(x) - framelen, hop)])

y_frames = np.array(

[w * y[i:i + framelen] for i in range(0, len(x) - framelen, hop)])

The indexing variable should iterate range(0, len(x) - framelen + 1, hop) instead. Currently the last frame is discarded if the signal fits an integer number of frames. The corresponding MATLAB line is:

pystoi/tests/matlab/removeSilentFrames.m

Line 6 in 9ff1cfa

frames = 1:K:(length(x)-N);

But in MATLAB the variable can reach the end value as opposed to Python's range, so here we need the +1.
Similarly in stft:

pystoi/pystoi/utils.py

Lines 98 to 99 in 9ff1cfa

stft_out = np.array([np.fft.rfft(w * x[i:i + win_size], n=fft_size)

for i in range(0, len(x) - win_size, hop)])

The indexing variable should iterate range(0, len(x) - win_size + 1, hop) for the same reason. The corresponding MATLAB line is:

pystoi/tests/matlab/stdft.m

Line 3 in 9ff1cfa

frames = 1:K:(length(x)-N);

Note that since remove_silent_frames removes trailing samples if they do not fit a frame, the subsequent _overlap_and_add always produces a signal that fits an integer number of frames. Because of that, stft ALWAYS discards one frame! In conclusion, the last two frames are discarded if the signal fits an integer number of frames, else only the last one.

This can be easily verified:

import numpy as np
from pystoi.stoi import stoi

n = 256 + 30*128  # exactly 31 frames
x = np.random.randn(n)
stoi(x, x, 10000)  # raises the RuntimeWarning about not enough frames

A breakpoint here shows the shape of x_spec is (257, 29) instead of (257, 31), i.e. two frames are discarded. Fixing the two indexings produces the expected shape.

Weird STOI Output

Hi,

Recently I was trying to evaluate some signals by calculating the stoi of each signals with this package. I used pystoi.stoi.stoi function to calculate the stoi. When I input two identical signals as ref_signal and processed_signal, it output 1 perfectly. However, when I replaced processed signal with microphone signals I recorded with and without background music playing, it turned out that the STOI of the signal when background music was presented is always higher, which made no sense.
I'm wondering if I'm using the function the wrong way or is there anything wrong with my audio file or understanding about STOI.

I've uploaded my audio files at the following website as well as my code to evaluate STOI.
https://github.com/nanaChang/stoiCheckFile

Thank you!

Numpy.dtype Size change

Is this the expected error message when I run the function with 16,000 Hz wav files, as opposed to 10kHz?

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

Exception For Small Inputs

Currently pystoi.stoi doesn't support small inputs, but throws a non indicative error:

In [28]:  pystoi.stoi(np.arange(100), np.arange(100), 32000, extended=False)
---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-28-3f8d814254e5> in <module>
----> 1 pystoi.stoi(np.arange(100), np.arange(100), 32000, extended=False)

~/venv/py3/lib/python3.7/site-packages/pystoi/stoi.py in stoi(x, y, fs_sig, extended)
     56 
     57     # Remove silent frames
---> 58     x, y = utils.remove_silent_frames(x, y, DYN_RANGE, N_FRAME, int(N_FRAME/2))
     59 
     60     # Take STFT

~/venv/py3/lib/python3.7/site-packages/pystoi/utils.py in remove_silent_frames(x, y, dyn_range, framelen, hop)
    122 
    123     # Compute energies in dB
--> 124     x_energies = 20 * np.log10(np.linalg.norm(x_frames, axis=1) + EPS)
    125 
    126     # Find boolean mask of energies lower than dynamic_range dB

<__array_function__ internals> in norm(*args, **kwargs)

~/venv/py3/lib/python3.7/site-packages/numpy-1.19.2-py3.7-linux-x86_64.egg/numpy/linalg/linalg.py in norm(x, ord, axis, keepdims)
   2559             # special case for speedup
   2560             s = (x.conj() * x).real
-> 2561             return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))
   2562         # None of the str-type keywords for ord ('fro', 'nuc')
   2563         # are valid for vectors

AxisError: axis 1 is out of bounds for array of dimension 1

np.AxisError when too many silent frame

Once in a lifetime, for some combination of audio files, we get :
numpy.AxisError: axis 2 is out of bounds for array of dimension 1

This is probably due to signals with too many silent frames (Thx nfurnon!)

	x_frames = np.array(
	[w * x[i:i + framelen] for i in range(0, len(x) - framelen, hop)])
	y_frames = np.array(
	[w * y[i:i + framelen] for i in range(0, len(x) - framelen, hop)])

	stft_out = np.array([np.fft.rfft(w * x[i:i + win_size], n=fft_size)
	for i in range(0, len(x) - win_size, hop)])

mpariente / pystoi Goto Github PK

pystoi's Introduction

Hey 👋, I'm Manuel Pariente

pystoi's People

Contributors

Stargazers

Watchers

Forkers

pystoi's Issues

Recommend Projects

Recommend Topics

Recommend Org