vocalpy / crowsetta Goto Github PK

A tool to work with any format for annotating vocalizations

Home Page: https://crowsetta.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

animal-communication animal-vocalizations annotation annotation-format bioacoustic-analysis bioacoustics birdsong csv dataset python python3

crowsetta's Introduction

A core package for acoustic communication research in Python

There are many great software tools for researchers studying acoustic communication in animals¹. But our research groups work with a wide range of different data formats: for audio, for array data, for annotations. This means we write a lot of low-level code to deal with those formats, and then our code for analyses is tightly coupled to those formats. In turn, this makes it hard for other groups to read our code, and it takes a real investment to understand our analyses, workflows and pipelines. It also means that it requires significant work to translate from a pipeline or analysis worked out by a scientist-coder in a Jupyter notebook into a generalized, robust service provided by an application.

In particular, acoustic communication researchers working with the Python programming language face these problems. How can our scripts and libraries talk to each other? Luckily, Python is a great glue language! Let's use it to solve these problems.

The goals of VocalPy are to:

make it easy to work with a wide array of data formats: audio, array (spectrograms, features), annotation
provide classes that represent commonly-used data types: audio, spectograms, features, annotations
provide classes that represent common processes and steps in pipelines: segmenting audio, computing spectrograms, extracting features
make it easier for scientist-coders to flexibly and iteratively build datasets, without needing to deal directly with a database if they don't want to
make it possible to re-use code you have already written for your own research group
and finally:
- make code easier to share and read across research groups, by providing these classes, and idiomiatic ways of coding with them; think of VocalPy as an interoperability layer and a common language
- facilitate collaboration between scientist-coders writing imperative analysis scripts and research software engineers developing libraries and applications

A paper introducing VocalPy and its design has been accepted at Forum Acusticum 2023 as part of the session "Open-source software and cutting-edge applications in bio-acoustics", and will be published in the proceedings.

Features

Data types for acoustic communication data: audio, spectrogram, annotations, features

The `vocalpy.Sound` data type

Works with a wide array of audio formats, thanks to soundfile.
Also works with the cbin audio format saved by the LabView app EvTAF used by many neuroscience labs studying birdsong, thanks to evfuncs.

>>> import vocalpy as voc
>>> data_dir = ('tests/data-for-tests/source/audio_wav_annot_birdsongrec/Bird0/Wave/')
>>> wav_paths = voc.paths.from_dir(data_dir, 'wav')
>>> audios = [voc.Sound.read(wav_path) for wav_path in wav_paths]
>>> print(audios[0])
vocalpy.Sound(data=array([3.0517...66210938e-04]), samplerate=32000, channels=1),
path = tests / data -
for -tests / source / audio_wav_annot_birdsongrec / Bird0 / Wave / 0.wav)

The `vocalpy.Spectrogram` data type

Save expensive-to-compute spectrograms to array files, so you don't regenerate them over and over again

>>> import vocalpy as voc
>>> data_dir = ('tests/data-for-tests/generated/spect_npz/')
>>> spect_paths = voc.paths.from_dir(data_dir, 'wav.npz')
>>> spects = [voc.Spectrogram.read(spect_path) for spect_path in spect_paths]
>>> print(spects[0])
vocalpy.Spectrogram(data=array([[3.463...7970774e-14]]), frequencies=array([    0....7.5, 16000. ]), times=array([0.008,...7.648, 7.65 ]), 
path=PosixPath('tests/data-for-tests/generated/spect_npz/0.wav.npz'), audio_path=None)

The `vocalpy.Annotation` data type

Load many different annotation formats using the pyOpenSci package crowsetta

>>> import vocalpy as voc
>>> data_dir = ('tests/data-for-tests/source/audio_cbin_annot_notmat/gy6or6/032312/')
>>> notmat_paths = voc.paths.from_dir(data_dir, '.not.mat')
>>> annots = [voc.Annotation.read(notmat_path, format='notmat') for notmat_path in notmat_paths]
>>> print(annots[1])
Annotation(data=Annotation(annot_path=PosixPath('tests/data-for-tests/source/audio_cbin_annot_notmat/gy6or6/032312/gy6or6_baseline_230312_0809.141.cbin.not.mat'), 
notated_path=PosixPath('tests/data-for-tests/source/audio_cbin_annot_notmat/gy6or6/032312/gy6or6_baseline_230312_0809.141.cbin'), 
seq=<Sequence with 57 segments>), path=PosixPath('tests/data-for-tests/source/audio_cbin_annot_notmat/gy6or6/032312/gy6or6_baseline_230312_0809.141.cbin.not.mat'))

Classes for common steps in your pipelines and workflows

A `Segmenter` for segmentation into sequences of units

>>> import evfuncs
>>> import vocalpy as voc
>>> data_dir = ('tests/data-for-tests/source/audio_cbin_annot_notmat/gy6or6/032312/')
>>> cbin_paths = voc.paths.from_dir(data_dir, 'cbin')
>>> audios = [voc.Sound.read(cbin_path) for cbin_path in cbin_paths]
>>> segment_params = {'threshold': 1500, 'min_syl_dur': 0.01, 'min_silent_dur': 0.006}
>>> segmenter = voc.Segmenter(callback=evfuncs.segment_song, segment_params=segment_params)
>>> seqs = segmenter.segment(audios, parallelize=True)
[  ########################################] | 100% Completed | 122.91 ms
>>> print(seqs[1])
Sequence(units=[Unit(onset=2.19075, offset=2.20428125, label='-', audio=None, spectrogram=None),
                Unit(onset=2.35478125, offset=2.38815625, label='-', audio=None, spectrogram=None),
                Unit(onset=2.8410625, offset=2.86715625, label='-', audio=None, spectrogram=None),
                Unit(onset=3.48234375, offset=3.49371875, label='-', audio=None, spectrogram=None),
                Unit(onset=3.57021875, offset=3.60296875, label='-', audio=None, spectrogram=None),
                Unit(onset=3.64403125, offset=3.67721875, label='-', audio=None, spectrogram=None),
                Unit(onset=3.72228125, offset=3.74478125, label='-', audio=None, spectrogram=None),
                Unit(onset=3.8036875, offset=3.8158125, label='-', audio=None, spectrogram=None),
                Unit(onset=3.82328125, offset=3.83646875, label='-', audio=None, spectrogram=None),
                Unit(onset=4.13759375, offset=4.16346875, label='-', audio=None, spectrogram=None),
                Unit(onset=4.80278125, offset=4.814, label='-', audio=None, spectrogram=None),
                Unit(onset=4.908125, offset=4.922875, label='-', audio=None, spectrogram=None),
                Unit(onset=4.9643125, offset=4.992625, label='-', audio=None, spectrogram=None),
                Unit(onset=5.039625, offset=5.0506875, label='-', audio=None, spectrogram=None),
                Unit(onset=5.10165625, offset=5.1385, label='-', audio=None, spectrogram=None),
                Unit(onset=5.146875, offset=5.16203125, label='-', audio=None, spectrogram=None),
                Unit(onset=5.46390625, offset=5.49409375, label='-', audio=None, spectrogram=None),
                Unit(onset=6.14503125, offset=6.1565625, label='-', audio=None, spectrogram=None),
                Unit(onset=6.31003125, offset=6.346125, label='-', audio=None, spectrogram=None),
                Unit(onset=6.38996875, offset=6.4018125, label='-', audio=None, spectrogram=None),
                Unit(onset=6.46053125, offset=6.4796875, label='-', audio=None, spectrogram=None),
                Unit(onset=6.83525, offset=6.8643125, label='-', audio=None, spectrogram=None)], method='segment_song',
         segment_params={'threshold': 1500, 'min_syl_dur': 0.01, 'min_silent_dur': 0.006},
         audio=vocalpy.Sound(data=None, samplerate=None, channels=None), path=tests / data -
for -tests / source / audio_cbin_annot_notmat / gy6or6 / 032312 / gy6or6_baseline_230312_0809.141.cbin), spectrogram=None)

A `SpectrogramMaker` for computing spectrograms

>>> import vocalpy as voc
>>> wav_paths = voc.paths.from_dir('wav')
>>> audios = [voc.Sound(wav_path) for wav_path in wav_paths]
>>> spect_params = {'fft_size': 512, 'step_size': 64}
>>> spect_maker = voc.SpectrogramMaker(spect_params=spect_params)
>>> spects = spect_maker.make(audios, parallelize=True)

`Dataset`s you flexibly build from pipelines and convert to databases

The vocalpy.dataset module contains classes that represent common types of datasets
You make these classes with outputs of your pipelines, e.g. a list of vocalpy.Sequences or vocalpy.Spectrograms
Because of the design of vocalpy, these datasets capture key metadata from your pipeline:
- parameters and data provenance details; e.g., what parameters did you use to segment? What audio file did this sequence come from?
Then you can save the dataset along with metadata to databases, or later load from databases
- vocalpy comes with built-in support for persisting to SQLite, a lightweight, efficient single-file database format. It is the only database file format recommended by the US Library of Congress for archival data, and it's built into Python -- no need to install separate database software like MySQL

A `SequenceDataset` for common analyses of sequences of units

>>> import evfuncs
>>> import vocalpy as voc
>>> data_dir = 'tests/data-for-tests/source/audio_cbin_annot_notmat/gy6or6/032312/'
>>> cbin_paths = voc.paths.from_dir(data_dir, 'cbin')
>>> audios = [voc.Sound.read(cbin_path) for cbin_path in cbin_paths]
>>> segment_params = {
  'threshold': 1500,
  'min_syl_dur': 0.01,
  'min_silent_dur': 0.006,
}
>>> segmenter = voc.Segmenter(
  callback=evfuncs.segment_song,
  segment_params=segment_params
)
>>> seqs = segmenter.segment(audios)
>>> seq_dataset = voc.dataset.SequenceDataset(sequences=seqs)
>>> seq_dataset.to_sqlite(db_name='gy6or6-032312.db', replace=True)
>>> print(seq_dataset)
SequenceDataset(sequences=[Sequence(units=[Unit(onset=2.18934375, offset=2.21, label='-', audio=None, spectrogram=None),
                                           Unit(onset=2.346125, offset=2.373125, label='-', audio=None,
                                                spectrogram=None), Unit(onset=2.50471875, offset=2.51546875,
                                                                        label='-', audio=None, spectrogram=None),
                                           Unit(onset=2.81909375, offset=2.84740625, label='-', audio=None,
                                                spectrogram=None),
                                           ...
                                          >>>  # test that we can load the dataset
                                          >>> seq_dataset_loaded = voc.dataset.SequenceDataset.from_sqlite(
  db_name='gy6or6-032312.db')
                                                                    >>> seq_dataset_loaded == seq_dataset
True

Installation

With `pip`

$ conda create -n vocalpy python=3.10
$ conda activate vocalpy
$ pip install vocalpy

With `conda`

$ conda create -n vocalpy python=3.10
$ conda activate vocalpy    
$ conda install vocalpy -c conda-forge

For more detail see Getting Started - Installation

Support

To report a bug or request a feature (such as a new annotation format), please use the issue tracker on GitHub:
https://github.com/vocalpy/vocalpy/issues

To ask a question about vocalpy, discuss its development, or share how you are using it, please start a new topic on the VocalPy forum with the vocalpy tag:
https://forum.vocalpy.org/

Contribute

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Contributing Guidelines

Below we provide some quick links, but you can learn more about how you can help and give feedback
by reading our Contributing Guide.

To ask a question about vocalpy, discuss its development, or share how you are using it, please start a new "Q&A" topic on the VocalPy forum with the vocalpy tag:
https://forum.vocalpy.org/

To report a bug, or to request a feature, please use the issue tracker on GitHub:
https://github.com/vocalpy/vocalpy/issues

CHANGELOG

You can see project history and work in progress in the CHANGELOG

License

The project is licensed under the BSD license.

Citation

If you use vocalpy, please cite the DOI:

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Ralph Emilio Peterson}
🤔 📓 📖 🐛 💻

_{Tetsuo Koyama}
📖

This project follows the all-contributors specification. Contributions of any kind welcome!

For a curated collection, see https://github.com/rhine3/bioacoustics-software. ↩

crowsetta's People

Contributors

Stargazers

Watchers

Forkers

ntrouvain mbsantiago globalmanagement

crowsetta's Issues

add Praat module

Inspiration, from Kyles and Kylers:
https://github.com/kylebgorman/textgrid
https://github.com/kylerbrown/textgrid

add contributing.md

should link to overall vocalpy/contributing.md

add `format2seq_func` parameter to `seq2csv`

so that user can avoid writing their own format2csv function

The argument will be a function such as notmat2seq, and if not None then seq2csv will take the seq argument and run it through this format2seq_func like so:

def seq2csv(seq, ..., format2seq_func=None):
     if format2seq_func is not None:
        seq = format2seq_func(seq)

make it so notmat2annot doesn't require .rec files

because a user can have .not.mat files but not have .rec files, e.g. song was acquired with software that is not evTAF but they still used evsonganaly to annotate

add `header_segment_map` parameter to `csv2seq` function

so if user has a csv with header different from segment fields, can just provide mapping (i.e. a dict) that specifies which header fields (csv columns) correspond to Segment attributes

so with this header
Onsets, Offsets, Filename, SegmentLabel
you'd use

header_segment_map = {
    'Onsets': 'onsets_s',
    'Offsets': 'offsets_s',
    'Filename': 'file',
    'SegmentLabel': 'label'
    }
crowsetta.csv.csv2seq(csv_filename='my.csv', header_segment_map=header_segment_map)

change parameters from 'audio_file'/'annot_file' to 'audio_path'/'annot_path'

because 'path' is more specific + accurate than 'file'
for consistency with vak

add `from_excel` function / module

mainly as easier way to get stuff out of SAP?
Would be a convenience wrapper around csv2seq that knows to use Excel dialect and look for SAP field names

make Annotation class

that can have a Stack attribute or a Sequence attribute

mainly because it feels weird and counterintuitive to write

annot : crowsetta.Sequence

in docstrings. No-one will get why an annotation is a Sequence.

should have a mandatory annot_file attribute
and optional audio_file' and spect_file` attributes

add to_textgrid and to_aud so it's useful with vak.predict

have Sequence.segments return a "pretty printed" version?

Seems like __repr__ should be something like

Sequence(segments=15)

and then a pretty_print method would give something like

Sequence with 15 segments:
    Segment 1: label='a', onset_Hz=16000, offset_Hz=17500, onset_s=None, offset_s=None, file='0.wav'
    Segment 2: label='b', onset_Hz=18000, offset_Hz=19500, onset_s=None, offset_s=None, file='0.wav'
   ...

rename 'Annotation' to `Vocalization'

because it's not really an "annotation"

it's the high-level abstract object that lets us associate an annotation file with the sequence of annotated segments within that file, and the file that the annotation annotates, e.g. an audio file

so it should be something like:
Vocalization, with attributes 'sequence', 'annot_pathand (optionally)source_path`

allow for user-defined `tiers` for a Segment, like Praat?

Praat allows for multiple user-defined tiers per segment, e.g. "phoneme", "syllable", "word", "sentence".

http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html

Not sure if that would be easy to add for Crowsetta.
I was thinking it would require the ability to dynamically add attributes to the Segment class, but I guess there could be an optional tiers attribute that's a dict mapping an annotation to each tier for any instance of a Segment.
But even then seq2csv would have to be able to handle mapping these extra tiers. I guess that's not too painful though if we're iterating over Segments anyway. Just would have to make sure all Segments have the same tiers.

rewrite doc with example using Praat on real data right off the bat

and converting e.g. from Textgrid to Audacity, then using TIMIT

Later talk about code stuff. Like, buried deep in the docs later

fix TMI Sequence repr, don't need to show all segments

override __repr__ to say Sequence with x segments: Segment(info), Segment(info), ...
(actually showing ellipsis in __repr__)
and then change repr of segment list, maybe using UserList from collections?

add logo

"crowsetta stone" image?
- in doc/index.rst & README.md
maybe also image showing GUI with labeling | Filenames | Sequence objects | csv output

add Stack class

programatically instantiated attrs class where each attribute is a Sequence.
A Stack is made up of 2 or more Sequences

make doc/notebooks/ dir with originals, have Makefile that makes crowsetta/notebooks from those

Make should remove the raw cells with .rst directives that are just used in the .doc files

Is there some way to get more control flow with nbconvert, i.e. "use these cells for the notebook --> notebook conversion and use these other cells for the notebook --> rst"?

add entry point for those who want them

change default value for `koumura2annot.Wave` parameter

causing vak to crash because the default is written relative to the current working directory, ./Wave

This only works if user is in the right place

Instead default should be written relative to the Annotation.xml path, which will always be in the parent directory of the Wave directory, unless someone was actually using the same format somewhere outside this dataset. In which case they could specify the correct location with the non-default Wave argument

add sap module

add tests for Transcriber.to_format

use Annotation.mat example in doc, have example function in src/bin

Add utils module

with utility functions for labels and annotations
for labels just steal from vak
annotations utilities would be e.g. duration of all annotations

use Textgridtools for Praat in place of current textgrid package?

pros: on PyPI, citable
(possible) cons: Python 3? Maintained? Some issues -- does it work with as many diff't versions?
https://github.com/hbuschme/TextGridTools
https://pub.uni-bielefeld.de/download/2561620/2563287

converting from_csv fails when to_csv had None values (e.g. onsets_Hz)

Need to have to_csv insert something (NaN or None?) when value of Segment field is None for on/offsets_s/Hz

add 'bark' and 'arf' modules

see https://github.com/kylerbrown/bark

add examples of installing a user-defined format via entry points

with poetry or setup.py

DOC: add images of spectrograms with annotation, use example data for this

first-order, add _static/spectrogram for use with tutorial.rst + tutorial.ipynb
and/or add librosa as docs dependency, create spects when building docs
later, use vocalpy for this

add formats module, info about each format in its modules docstring that format uses?

in spirit of DRY, instead of having a separate dict in the data module,
the top-level docstring for each format's module should have this metadata,
and there should be a formats module that knows how to parse this

Better if this could be linked with the internal config.ini somehow.

Maybe a Makefile that generates the config.ini?

Or ... each formats module has its own config_dict at the top, and then that gets used through an entry point maybe?

integrate vocal-annotation-formats, use for tests

e.g. in README with info on how to contribute:
https://github.com/NickleDave/vocal-annotation-formats

change annot_file / audio_file attributes of annotation to be Path objects

to not get

TypeError: ("'annot_file' must be <class 'str'> (got PosixPath('/home/ildefonso/Documents/repos/coding/birdsong/tweetynet/tests/test_data/mat/llb3_annot_subset.mat') that is a <class 'pathlib.PosixPath'>).", Attribute(name='annot_file', default=NOTHING, validator=<instance_of validator for type <class 'str'>>, repr=True, eq=True, order=True, hash=None, init=True, metadata=mappingproxy({}), type=None, converter=None, kw_only=False), <class 'str'>, PosixPath('/home/ildefonso/Documents/repos/coding/birdsong/tweetynet/tests/test_data/mat/llb3_annot_subset.mat'))

fix circular import bug in .formats

should import within function show() when called
have a similar function load() that does this and then show() calls load() if formats not loaded
and then build these function calls into Transcriber
how to test?

add `seqID` attribute to `Segment`

This will get used when one annotation file contains multiple sequences, and/or each sequence does not correspond to one audio file.
E.g., in the Koumura data set there are multiple sequences per audio file.
Similarly, canary song can be annotated by phrase and the user might want to preserve this annotation.

add `unique_labels` function?

all_labels = [a_seq.labels.tolist() for a_seq in seq]
all_labels = [label for labellist in all_labels for label in labellist]
uniq_labels = set(all_labels)
return uniq_labels

`koumura2annot` throws an error when annot_file is a Path not a str

Traceback (most recent call last):
  File "/home/art/anaconda3/envs/vak-dev/bin/vak", line 11, in <module>
    load_entry_point('vak', 'console_scripts', 'vak')()
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/__main__.py", line 43, in main
    config_file=args.configfile)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/cli.py", line 18, in cli
    prep(toml_path=config_file)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/prep.py", line 162, in prep
    logger=logger)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/dataframe.py", line 124, in from_files
    annot_list = scribe.from_file(annot_file=annot_file)
  File "/home/art/anaconda3/envs/vak-dev/lib/python3.6/site-packages/crowsetta/koumura.py", line 53, in koumura2annot
    if not annot_file.endswith('.xml'):
AttributeError: 'PosixPath' object has no attribute 'endswith'

less testing required for different returned types
clearer what expected type returned is for downstream user -- won't have to test all their code for e.g. Annot + list of Annots

users will be able to e.g. write a list comprehension so it's not actually that useful to include this extra functionality

add "why" and "how" at top of docs

why:
- club project for ppl studying vocalizations
- tool for munging datasets of vocalizations that have annotated segments
  - so that when working with the dataset, there is no need to be aware of where different files are,
    e.g., the annotation file or files, the audio files, etc.
- assumes you care about the "segments" part
  - need to include illustration of annotated segments right at top of docs
how:
- Python classes that faciliate representing these datasets
  - a Vocalization that consists of its annotation and the files associated with it
- end product: a .csv / DataFrame where each row is an annotated segment

switch to using attrs object for Sequence instead of namedtuple

make `user_config` less fragile

module crashes with a relative path like ./mymodule.py

to_csv and to_format have to be 'None' (if not using), not

None

which is annoying to type

seq_dict = seq.to_dict()
file = seq_dict['file']

just to get the audio filename

vocalpy / crowsetta Goto Github PK

crowsetta's Introduction

A core package for acoustic communication research in Python

Features

Data types for acoustic communication data: audio, spectrogram, annotations, features

The vocalpy.Sound data type

The vocalpy.Spectrogram data type

The vocalpy.Annotation data type