mimbres / neural-audio-fp Goto Github PK

View Code? Open in Web Editor NEW

167.0 7.0 22.0 12.16 MB

Home Page: https://mimbres.github.io/neural-audio-fp

License: MIT License

Python 98.59% Dockerfile 0.70% Jupyter Notebook 0.71%

audio-fingerprinting music-information-retrieval deep-learning fingerprint audio-search

neural-audio-fp's Introduction

Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning

About

This is an official code and dataset release by authors (since July 2021) for reproducing neural audio fingerprint.
Previously, there was a PyTorch implementation by Yi-Feng Chen.
✳️ Sound DEMO available now.

Requirements

Minimum:

NVIDIA GPU with CUDA 10+
25 GB of free SSD space for mini dataset experiments

More info

System requirements to reproduce the ICASSP result

CPU with 8+ threads
NVIDIA GPU with 11+ GB V-memory
SSD free space 500+ GB for full-scale experiment
tar extraction temporarily requires additional free space 440 GB.

Recommended batch-size for GPU

Device	Recommended BSZ
1080ti, 2080ti (11GB), Titan X, Titan V (12GB), AWS/GCP V100(16 GB)	320
Quadro RTX 6000 (24 GB), 3090 (24GB)	640
V100v2 (32GB), AWS/GCP A100 (40 GB)	1280
~~TPU~~	~~5120~~

The larger the BSZ, the higher the performance.
To allow the use of a larger BSZ than actual GPU memory, one trick is to remove allow_gpu_memory_growth() from the run.py.

Install

Docker

# CUDA 10.1-based image
docker pull mimbres/neural-audio-fp:latest
# CUDA 11.2-based image for RTX 30x0 and later
docker pull mimbres/neural-audio-fp:cuda11.2.0-cudnn8

Create a custom image from Dockerfile

Requirements

NVIDIA driver >= 450.80.02
Docker > 20.0

Create

You can create an image through Dockerfile and environment.yml.

git clone https://github.com/mimbres/neural-audio-fp.git
cd neural-audio-fp
docker build -t neural-audio-fp .

Further information

Intel CPU users can remove libopenblas from Dockerfile.
Faiss and Numpy are optimized for Intel MKL.
Image size is about 12 GB or 6.43 GB (compressed).
To optimize GPU-based search speed, install from the source.

Conda

Create a virtual environment via .yml

Requirements

NVIDIA driver >= 450.80.02, CUDA >= 11.0 and cuDNN 8 (Compatiability)
NVIDIA driver >= 440.33, CUDA == 10.2 and cuDNN 7 (Compatiability)
Anaconda3 or Miniconda3 with Python >= 3.6

Create

After checking the requirements,

git clone https://github.com/mimbres/neural-audio-fp.git
cd neural-audio-fp
conda env create -f environment.yml
conda activate fp

Create a virtual environment without .yml

# Python 3.8: installing in the same virtual environment
conda create -n YOUR_ENV_NAME 
conda install -c anaconda -c pytorch tensorflow=2.4.1=gpu_py38h8a7d6ce_0 cudatoolkit faiss-gpu=1.6.5
conda install pyyaml click matplotlib
conda install -c conda-forge librosa
pip install kapre wavio

If your installation fails at this point and you don't want to build from source...:thinking:

Try installing tensorflow and faiss-gpu=1.6.5 (not 1.7.1) in separate environments.

#After creating a tensorflow environment for training...
conda create -n YOUR_ENV_NAME
conda install -c pytorch faiss-gpu=1.6.5
conda install pyyaml, click

Now you can run search & evaluation by

python eval/eval_faiss.py --help

Dataset

	Dataset-mini v1.1 (11.2 GB)	Dataset-full v1.1 (443 GB)
tar	✳️kaggle / gdrive	dataport(open-access)
raw	gdrive	gdrive

The only difference between these two datasets is the size of 'test-dummy-db'. So you can first train and test with Dataset-mini. Dataset-full is for testing in 100x larger scale.
You can download the Dataset-mini via kaggle CLI (recommended).
- Sign in kaggle -> Account -> API -> Create New Token -> download kaggle.json

pip install --user kaggle
cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
kaggle datasets download -d mimbres/neural-audio-fingerprint

100%|███████████████████████████████████| 9.84G/9.84G [02:28<00:00, 88.6MB/s]

Dataset installation

This dataset includes all music sources, background noises, impulse-reponses (IR) samples that can be used for reproducing the ICASSP results.

Directory location

The default directory of the dataset is ../neural-audio-fp-dataset. You can change the directory location by modifying config/default.yaml.

.
├── neural-audio-fp-dataset
└── neural-audio-fp

Structure of dataset

neural-audio-fp-dataset/
├── aug
│   ├── bg         <=== Audioset, Pub/cafe etc. for background noise mix
│   ├── ir         <=== IR data for microphone and room reverb simulatio
│   └── speech     <=== subset of common-voice, NOT USED IN THE PAPER RESULT
├── extras
│   └── fma_info   <=== Meta data for music sources.
└── music
    ├── test-dummy-db-100k-full  <== 100K songs of full-lengths
    ├── test-query-db-500-30s    <== 500 songs (30s) and 2K synthesized queries
    ├── train-10k-30s            <== 10K songs (30s) for training
    └── val-query-db-500-30s     <== 500 songs (30s) for validation/mini-search

The data format is 16-bit 8000 Hz PCM Mono WAV. README.md and LICENSE is included in the dataset for more details.

Checksum for Dataset-full

Install checksumdir.

pip install checksumdir

Compare checksum.

checksumdir -a md5 neural-audio-fp-dataset
# aa90a8fbd3e6f938cac220d8aefdb134

checksumdir -a sha1 neural-audio-fp-dataset
# 5bbeec7f5873d8e5619d6b0de87c90e180363863d

Quickstart

There are 3 basic COMMAND s for each step.

# Train
python run.py train CHECKPOINT_NAME

# Generate fingreprint
python run.py generate CHECKPOINT_NAME

# Search & Evalutaion (after generating fingerprint)
python run.py evaluate CHECKPOINT_NAME CHECKPOINT_INDEX

Help for run.py client and its commands.

python run.py --help
python run.py COMMAND --help

More Features

Click to expand each topic.

Managing Checkpoint

python run.py train CHECKPOINT_NAME CHECKPOINT_INDEX

If CHECKPOINT_INDEX is not specified, the training will resume from the latest checkpoint.
In default configuration, all checkpoints are stored in logs/checkpoint/CHECKPOINT_NAME/ckpt-CHECKPOINT_INDEX.index.

Training

python run.py train CHECKPOINT --max_epoch=100 -c default

Notes:

Check batch-size that fits on your device first.
The default config is set TR_BATCH_SZ=120 with OPTIMIZER=Adam.
For TR_BATCH_SZ >= 240, OPTIMIZER=LAMB is recommended.
For TR_BATCH_SZ >= 1280, LR=1e-4 can be too small.
In NTxent loss function, the best temperature parameter TAU is in the range of [0.05, 0.1].
Augmentation strategy is quite important. This topic deserves further discussion.

Config File

The config file is located in config/CONFIG_NAME.yaml. You can edit directory location, data selection, hyperparameters for model and optimizer, batch-size, strategies for time-domain and spectral-domain augmentation chain, etc. After training, it is important to keep the config file in order to restore the model.

python run.py COMMAND -c CONFIG

When using generate command, it is important to use the same config that was used in training.

Fingerprint Generation

python run.py generate CHECKPOINT_NAME # from the latest checkpoint
python run.py generate CHECKPOINT_NAME CHECKPOINT_INDEX -c CONFIG_NAME

# Location of the generated fingerprint
.
└──logs
   └── emb
       └── CHECKPOINT_NAME
           └── CHECKPOINT_INDEX
               ├── db.mm
               ├── db_shape.npy
               ├── dummy_db.mm
               ├── dummy_db_shape.npy
               ├── query.mm
               └── query_shape.npy

By default config, generate will generate embeddings (or fingerprints) from 'dummy_db', test_query and test_db. The generated embeddings will be located in logs/emb/CHECKPOINT_NAME/CHECKPOINT_INDEX/**.mm and **.npy.

dummy_db is generated from the 100K full-length dataset.
In the DATASEL section of config, you can select options for a pair of db and query generation. The default is unseen_icassp, which uses a pre-defined test set.
It is possilbe to generate only the db and query pairs by --skip_dummy option. This is a frequently used option to avoid overwriting the most time-consuming dummy_db fingerprints in every experiment.
It is also possilbe to generate embeddings (or fingreprints) from your custom source.

python run.py generate --source SOURCE_ROOT_DIR --output FP_OUTPUT_DIR --skip_dummy # for custom audio source
python run.py generate --help # more details...

Search & Evaluation

The following command will construct a faiss.index from the generated embeddings or fingerprints located at logs/emb/CHECKPOINT_NAME/CHECKPOINT_INDEX/.

# faiss-gpu
python run.py evaluate CHECKPOINT_NAME CHECKPOINT_INDEX [OPTIONS]

# faiss-cpu
python run.py evaluate CHECKPOINT_NAME CHECKPOINT_INDEX --nogpu

In addition, you can choose one of the --index_type (default is IVFPQ) from the table below:

Type of index	Description
`l2`	L2 distance
`ivf`	Inverted File Index (IVF)
`ivfpq`	Product Quantization (PQ) with IVF 📖
`ivfpq-rr`	IVF-PQ with re-ranking
~~`ivfpq-rr-ondisk`~~	~~IVF-PQ with re-ranking on disk search~~
`hnsw`	Hierarchical Navigable Small World 📖

python run.py evaluate CHECKPOINT_NAME CHECKPOINT_INDEX --index_type IVFPQ

Currently, few options for Faiss settings are available in run.py client. Instead, you can directly run:

python eval/eval_faiss.py EMB_DIR --index_type IVFPQ --kprobe 20 --nogpu
python eval/eval_faiss.py --help

Note that eval_faiss.py does not require Tensorflow.

Tensorboard

Tensorboard is enabled by default in the ['TRAIN'] section of the config file.

# Run Tensorboard
tensorboard --logdir=logs/fit --port=8900 --host=0.0.0.0

Build DB & Search

Here is an overview of the system for building and retrieving the database. The system and 'matcher' algorithm are not detailed in the paper. But it's very simple as in this code.

Plan

Now working on tf.data-based new data pipeline for multi-GPU and TPU support.
One page Colab demo.
This project is currently based on Faiss, which provides the fastest large-scale vector searches.
Milvus is also worth watching as it is an active project aimed at industrial scale vector search.

Augmentation Demo and Scoreboard

Augmentation demo was generated by dataset2wav.py.

External links

(Unofficial) PyTorch implementation by Yi-Feng Chan.

Acknowledgement

This project has been supported by the TPU Research Cloud (TRC) program.

Cite

@conference {chang2021neural,
    author={Chang, Sungkyun and Lee, Donmoon and Park, Jeongsoo and Lim, Hyungui and Lee, Kyogu and Ko, Karam and Han, Yoonchang},
    title={Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning},
    booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)},
    year = {2021}
}

neural-audio-fp's People

Contributors

Stargazers

Watchers

neural-audio-fp's Issues

New feature for database adaptation

After thinking a bit more, it would be nice to add a few lines to allow loading from other devices. Someone like @Mihonarium might want to load a pre-trained checkpoint and further train with different music data. As mentioned in the paper section 4.2, I've seen small performance gains by using 10% of test-dummy-set as training data. I call this use case as "seen database" to distinguish with general cases of fully "unseen" test set.

...............................................................................................................................................................................................................................................
What's even more strange, the issue with a lot of warnings appears only with run.py train and doesn't appear for generate.

The notebook: https://gist.github.com/Mihonarium/e3fd355cb560b82373fd2186139f1bc2 (the last cells show that generate and training from scratch work).

Originally posted by @Mihonarium in #10 (comment)

Getting training loss value as nan and val loss as nan

I'm using NVIDIA RTX A5000 .
I'm getting loss values as nan during training with default values.I'm getting this ..please clarify and solve the issue
W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

Questions about inquiries

Hello, I would like to know if the 500 test data you mentioned in your paper were randomly selected?Did you choose it only once? Or take the average of multiple times?
Is your database set up to ensure that it contains 500 test data?Because in general, in the audio query task, there should be a difference between the recall rate and the precision rate. I understand that the accuracy rate in your paper is the precision rate.
I'm using a program implemented by someone else in pytorch, have you checked the program's implementation and results?
When I tested, I found that the accuracy of 1-second and 2-second query tests fluctuated greatly with different data. Did you find this out?
And I think that there is also the possibility of muted segments for 1-second and 2-second queries, which will have a greater impact on the results. Therefore, your 1-second and 2-second accuracy rates are as stated in your paper. Is it right? Or is it more volatile?

UnboundLocalError in run.py during training

Hello,
I executed the following command:

python run.py train CHECKPOINT --max_epoch=100 -c default

The code is throwing the following error on line 207:
if cfg['TRAIN']['SAVE_IMG'] and sim_mtx is not None:
UnboundLocalError: local variable 'sim_mtx' referenced before assignment

I am facing this issue and not sure how to fix it. Any insights on what might be causing this error?

Comparing short audio files

Hi,
I'm interested in finding near-duplicate audio files. My dataset is about 3000 thousands short audio files, between 0.5 seconds to 5 seconds. Unlike Shazam, both the "target" audio (i.e. the songs in Shazam's case) and the user input are short, and both might contain noise.

Can this library help?
If so, are there any recommendations for tuning parameters?

N.B - if a file is matched to multiple other files, it's fine - I have a less efficient algorithm that can verify which match is correct. In other words, I can handle some amount of false positives, but I don't want false negatives.

Whether to check the similarity before audio storage

Whether to check the similarity before audio storage?
Is there a simple demo for query and other operations ?

How to align and get the begin and end time of the query in database？

If I query the start and end positions of an audio, through faiss.search, it is possible that top n is a non-continuous value, how can I align it?

Why the loss computed during training is Nan？

I followed the actions in the readme documentation to configure the environment（Create a virtual environment via .yml), and downloaded the Dataset-mini v1.1 to ../ . But the loss calculated when running run.py training is nan. When debugging, I found that when the data passed through the front_conv layer of the FingerPrinter model, the values of the calculated tensor were all 0 or nan. What‘s wrong and why is this happening?

Sound demo

Upload a few examples with Top1 score

Image
Audio
Demo page

Fingerprint generation from custom dataset

Hi, first of all great work. I am trying to use this model to generate some fingerprints for a single movie audio (let's say 2 hours long). I trained the model, following your instructions, and for the -s argument in the python run.py generate command I used the directory containing the movie audio. Then the fingerprints are generated, all good. I then proceed to read the produced memmap, I convert it into a numpy array and it's of size (16353,128) with duration=1 and hop=0.5 as parameters. All good until now. However, when I inspect the produced fingerprints, they are repeated every 125 fingerprints, meaning that it produces 16353/125=131 batches which are the same. Why is that? Don't you crop the audio file into segments and then pass it through the model, or should I do that beforehand? Thanks!

question: my modle train loss:nan

My model gradient explosion

Code and data set availability

Could you give a date on which the code and data set will be available.

Regards

Reported train, val, test split sizes vs actual FMA size

First of all, thank you very much for the work, having access to the repository gave me a jump start to the field.

After reading the paper, downloading the dataset from IEEE dataport and running some experiments I noticed an issue.

The paper reports that train, val, test_query, and test_dummy sets are disjoint with corresponding sizes 10,000, 500, 500, and 100,000. When you get the union of these disjoint sets, the total size is 110,000. When you consider the NAFP dataset to be a subset of the FMA dataset, this can not be true as FMA dataset (fma_full) contains 106,574 tracks. You can check this information at the FMA Github Repository.
In fact, the IEEE dataport downloaded neural-audio-fp-dataset/music/test-dummy-db-100k-full/fma_full directory contains 93,458 wav tracks. This can be verified with running find -type f -name "*.wav" | wc -l from the directory mentioned previously.
Therefore, running the checksum does not match the provided value at this repo.

Could you clarify this, please?

I think this can be the reason why the reported metrics on the paper and the metrics that we obtain by training an Adam N=120 model or evaluating the provided 640_lamb model are inconsistent. See: Issue related.

I did some experiments to find these 6,542 tracks. I will report them on the corresponding issue.

Speed of generating fingereprints from custom source

Hi, it might be related to this, but I'm trying to generate fingerprints from custom source using the pretranied model you shared here: #10 (comment) and I was wondering if you could tell me what's the expectated time for generating a fingerprint from a single query? Since it took 1629seconds to generate fingerprints corresponding to 2 queries (1-min length) [even though in the source directory there are 3 wav files, I'm studying why this as well ]
From the CLI Output: 2/2 [==============================] - 1629s 47ms/step

I'm using a 40-cpu server with a RTX3090.

Also, can you help me understanding the shape of the resulting db? I understand that the shape is n_items x d, and n_items is #num audios x batchsize. I don't see what this batchsize mean and therefore, the resulting db shape.

Thanks in advance!

Originally posted by @guillemcortes in #8 (comment)

Evaluation with custom dataset

Hello, I am using a pretrained model that you commented in one of your issues, and I want to use it for my own dataset, I modified the dataset class:

def get_custom_db_ds(self, source_root_dir):
    """ Construc DB (or query) from custom source files. """

    query_path = sorted(
        glob.glob(source_root_dir + 'query/*.wav'))
    db_path = sorted(
        glob.glob(source_root_dir + 'db/*.wav'))
    
    _ts_n_anchor = self.ts_batch_sz # Only anchors...

    query_ds = genUnbalSequence(
        db_path,
        self.ts_batch_sz,
        _ts_n_anchor,
        self.dur,
        self.hop,
        self.fs,
        shuffle=False,
        random_offset_anchor=False,
        drop_the_last_non_full_batch=False) # No augmentations, No drop-samples.
    db_ds = genUnbalSequence(
        query_path,
        self.ts_batch_sz,
        _ts_n_anchor,
        self.dur,
        self.hop,
        self.fs,
        shuffle=False,
        random_offset_anchor=False,
        drop_the_last_non_full_batch=False)
    return query_ds, db_ds

and the get_data_source function in generate.py:

def get_data_source(cfg, source_root_dir, skip_dummy):
    dataset = Dataset(cfg)
    ds = dict()
    if source_root_dir:
        ds['custom_source'] = dataset.get_custom_db_ds(source_root_dir)
    else:
        if skip_dummy:
            tf.print("Excluding \033[33m'dummy_db'\033[0m from source.")
            pass
        else:
            ds['dummy_db'] = dataset.get_test_dummy_db_ds()

        if dataset.datasel_test_query_db in ['unseen_icassp', 'unseen_syn']:
            ds['query'], ds['db'] = dataset.get_test_query_db_ds()
        elif dataset.datasel_test_query_db == 'custom_dataset':
            ds['query'], ds['db'] = dataset.get_custom_db_ds(cfg['DIR']['SOURCE_ROOT_DIR'])
        else:
            raise ValueError(dataset.datasel_test_query_db)

    tf.print(f'\x1b[1;32mData source: {ds.keys()}\x1b[0m',
             f'{dataset.datasel_test_query_db}')
    return ds

I am using a sample of 3 db audios files and 2 query audio files.

Generating the fingerprint gave a warning that the size does not match also note that I am using skip dummy since I am using my own dataset. I want to ask if this is fine.
also regarding search and evaluation, a part of the code depend on the dummy_db like training the faiss index, how should I adapt it for my case.
The test_seq_len argument is regarding how many segment from the query we should consider to match against the db?
can this model help me get from which timestamp in the top1 detected db audio does a start from.
If a query contains parts from two database audios is it possible to get a probability score so that I can condition on a threshold to detect if it comes from more than one audio db or not.

Thank in advance, and thanks for you greate work

FileNotFoundError: [Errno 2] No such file or directory: './logs/emb/CHECKPOINT_NAME/CHECKPOINT_INDEX/query_shape.npy'

While typing this command
$python run.py evaluate CHECKPOINT_NAME CHECKPOINT_INDEX

FileNotFoundError: [Errno 2] No such file or directory: './logs/emb/CHECKPOINT_NAME/CHECKPOINT_INDEX/query_shape.npy'

Im getting this error, please help me for search and evaluation process.
Thank you!

loss to convergence

how much the loss when the model convergence

where is the code ?

hello, Can I refer to your code for my advisement retrival ?

minor fix

config file
- default tau: 0.1 --> 0.05
- added new preset for BSZ=640
run.py
- correct evaluate command's default to 'icassp'
eval/eval_faiss.py
- test_ids='all' works
model/train.py:
- validatation size 250 --> 500

permission denied for building Docker Custom Image

dinesh@fourier:~/neural-audio-fp$ docker build -t neural-audio-fp .
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/build?buildargs=%7B%7D&cachefrom=%5B%5D&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=Dockerfile&labels=%7B%7D&memory=0&memswap=0&networkmode=default&rm=1&shmsize=0&t=neural-audio-fp&target=&ulimits=null&version=1": dial unix /var/run/docker.sock: connect: permission denied

finetuning on short audios

hi, I have about 900 short audios of users speaking, after training the main model for 20 epochs(converging very well) I try to finetune it on my own data but after 130 epochs the loss does not converge and stays in the range of 0.99-0.95 . Are there any configurations I am missing that need to be altered in the case of nonmusic data

Could you please provide the checksum of fma_full only?

or checksum of other files separately so that I can locate the different files...

positve pairs and negitive pairs?

Lets say dog is barking in two seconds by more than 4 times,So that each time sound is similar to remaning barks in the same file.We have defined postive pairs are ORIGINAL and It's AUGMENTED REPLICAS.But here bark sounds from one dog is almost similar then how can they be differentiate with augmented or else will they also become postive pairs(REAL SAMPLES)?In pairwise similarity simply real circle (not dashed) presented more than ones in a row.What is the case at this moment?

Dataset update

Dataset update

Early downloaders (before July 8, 2021, 2:31 PM GMT) need this update via Dataport.
The test set included in the initial package was labeld as SNR [0,10] dB, but it was actually [10,10] dB (easier test). Mistake during directory cleanup. It's fixed now.
Now v1.1 with SNR= [0,10]dB, 0dB, -3dB queries.
I found 24 duplicate (between the dummy and test sets) songs in the previous data set for publication results. I also had to replace a few songs for the training set. Datasets v1-v1.1 is very clean, as I double-checked.
One problem (?) is that after the data set correction, the performance improved by almost 5 to 6 percent for 1 second query.

In progress:

Fast download with command-line interface (kaggle public dataset?)
md5/sha1 checksum
Upload full dataset after correcting dataset duplicates

Model Definition Front Strides

Hey team! I'm looking into the model definition here.

Should

 front_strides=[[(1,2), (2,1)], [(1,2), (2,1)],
                [(1,2), (2,1)], [(1,2), (2,1)],
                [(1,1), (2,1)], [(1,2), (2,1)],
                [(1,1), (2,1)], [(1,2), (2,1)]],

 front_strides=[[(1,2), (2,1)], [(1,2), (2,1)],
                [(1,2), (2,1)], [(1,2), (2,1)],
                [(1,2), (2,1)], [(1,2), (2,1)],
                [(1,2), (2,1)], [(1,2), (2,1)]],

Or is the former intended and, if so, why?

News & updates

Hi all, sorry for my lateness!! I've been replying to emails, but I saw the issue board for the first time today. My bad...

For someone in a hurry, I am releasing a working code (training + mini evaluation) in advance now.
I can complete everything (dataset upload, easy setup for FAISS-GPU, fully repoducing all experiments, etc.) by the end of this week.

Just wait a few more days. I'm working on this project full time now... :)

Losses coming as NaN on training with default settings

I cloned this repo.
Downloaded the dataset mini from the google drive.
Created the docker container by pulling the image provided.
Ran python run.py train test123 --max_epoch=100 -c default inside the docker.

I am seeing all NaNs in the tensorboard logs for training loss. I am running this on a machine with 3 NVIDIA GeForce 3090 GPUs. What might be going wrong?

Code of now playing model

Is it possible include code for the Google's now playing model here? Even if it's not functional.

Pretrained model

While it's relatively easy to train the model on the Dataset-mini (even Colab allows that), it's not as easy to reproduce the paper's results with the Dataset-full. It would be great if you could publish a model trained on the full dataset.

(By the way, congratulations on the paper, and thanks for publishing the work, it's really cool!)

Unexpectedly high (over 10%) search performance

I'm tring to reproduce your paper results.

In my resource, I can use only TR_BATCH_SZ=120.
So your paper said top-1 exact match rate 55.9% and near match rate 62.3% at query length 1s in Table 3.
But I get 67.95% of exact match rate and 73.0% of near match rate. (see below)
Is it an applicable result?

$ time -p python run.py evaluate test 100
cli: Configuration from ./config/default.yaml
Load 29,500 items from ./logs/emb/test/100/query.mm.
Load 29,500 items from ./logs/emb/test/100/db.mm.
Load 54,336,000 items from ./logs/emb/test/100/dummy_db.mm.
Creating index: ivfpq
Copy index to GPU.
Training index using 18.40 % of data...
Elapsed time: 31.04 seconds.
54336000 items from dummy DB
29500 items from reference DB
Added total 54365500 items to DB. 104.26 sec.
Created fake_recon_index, total 54365500 items. 0.11 sec.
test_id: icassp,  n_test: 2000
========= Top1 hit rate (%) of segment-level search =========
               ---------------- Query length ----------------
   segments      1        3       5       9       11      19     
   seconds      (1s)     (2s)    (3s)    (5s)    (6s)   (10s)    

  Top1 exact   67.95    88.90   94.30   97.35   98.10   99.15
  Top1 near    73.00    90.30   94.70   97.40   98.15   99.20
  Top3 exact   76.65    92.05   95.65   98.25   98.85   99.55
 Top10 exact   80.00    92.65   96.05   98.45   99.00   99.60
=============================================================
average search + evaluation time 18.77 ms/query
Saved test_ids and raw score to ./logs/emb/test/100/.
real 364.41
user 544.26
sys 115.40

I use dataset-full v1.1 from ieee-dataport.org.

https://ieee-dataport.org/open-access/neural-audio-fingerprint-dataset#files

I change configuration TEST_DUMMY_DB=100k_full_icassp.
Then train & generate, evaluate

Get fingerprints for custom wav input files

I want to generate just the fingerprints for custom wav file as input. I just want to generate fingerprint for one single wav file. Could someone point me to a script which can do so.
e.g. If I have a wav file of 10s and my seg length is 1s and hop 0.5s, then I want a fingerprint output of size (19 x fp_dim).

Questions Regarding Custom Data Testing in Audio Fragment Identification

Hello, I found this paper quite interesting and encountered some issues during the testing process with my custom data. I apologize if my questions seem naive, as I lack some knowledge in this area.

In my case, I don't need an algorithm to identify a specific audio from various songs using audio fragments. I have a single audio file (2-3 hours) and need to find where in this file a certain audio fragment (3-5 seconds) begins (I understand from the issue check that I need to customize the process for obtaining the start timestamp).

Is this code suitable for such a scenario?
I trained using the provided dataset mini. Then I used the command python run.py generate --source CUSTOM_SOURCE_ROOT_DIR --output FP_OUTPUT_DIR --skip_dummy to generate fingerprints for my custom data, which is an audio file. Afterward, I wanted to evaluate a short audio fragment (3-5 second wav file) but wasn't sure how to proceed. Also, is this a meaningful process?
Should my custom audio data also be included in the training?

Thank you.

Unable to open the file "../demo_template.ipynb"

Dimension of Zt

We generate segment-wise embeddings zt∈Z that can represent a unit segment of audio from the acoustic features S at time step t.
In this line do each zt is of dimension d or dimension 1.

IS THIS REPOSITORY HELPFUL FOR FOLLOWING SITUATION

Is this repository helpful for the below situation?
There could be situations where we need to detect the presence of new
sound sources for whose prerecorded acoustic signatures are available.
In order to handle such novel situations, the proposed system will be
equipped with audio search capability, which performs template
matching to detect the sound sources which match the given acoustic
signature (pre-recorded sound from that source).

Query generation questions

I tried to reproduce your paper, and my code is https://github.com/stdio2016/pfann .
In my code, I generate queries by:

Randomly slice a x second segment from test music, x is query length
Add one noise file to this segment
Add 2 IRs to this segment, one is for room reverb, and the other is for microphone IR
Save this segment as query file

It seems that your code does these:

Split test music into 1 second segments
For each segment:
Randomly time shift the segment within +/-0.2s
Add one noise segment to the segment. The added noises of each segment seem to maintain time order.
Add one IR file to the segment.
Concatenate these segments and save as query file

My question:

Why do you add different IR to different 1s segments of the same query file? I do not think that the reverb environment would change every 1s.
I use random slicing to simulate query start time, while you randomly shift each 1s segment independently. Isn't uniform time shifting enough?
In your paper, you said "microphone and room impulse response (IR) are sequentially applied by convolution operation." However, I can only find one convolution operation per segment. How do you apply 2 IRs (microphone and room IRs) in one convolution? Do you merge these two datasets, or preprocess so that every new IR is a combination of one microphone and one room IR?

mimbres / neural-audio-fp Goto Github PK

neural-audio-fp's Introduction

Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning

About

Requirements

System requirements to reproduce the ICASSP result

Recommended batch-size for GPU

Install

Requirements

Create

Further information

Requirements

Create

Dataset

Directory location

Structure of dataset

Quickstart

More Features

Build DB & Search

Plan

Augmentation Demo and Scoreboard

External links

Acknowledgement

Cite

neural-audio-fp's People

Contributors

Stargazers

Watchers

Forkers

neural-audio-fp's Issues

Recommend Projects

Recommend Topics

Recommend Org