marl / jams Goto Github PK

View Code? Open in Web Editor NEW

178.0 16.0 25.0 30.47 MB

A JSON Annotated Music Specification for Reproducible MIR Research

License: ISC License

Python 99.25% Shell 0.44% Reason 0.19% C++ 0.11%

jams's Introduction

jams

A JSON Annotated Music Specification for Reproducible MIR Research.

Please, refer to documentation for a comprehensive description of JAMS.

What

JAMS is a JSON-based music annotation format.

We provide:

A formal JSON schema for generic annotations
The ability to store multiple annotations per file
Schema definitions for a wide range of annotation types (beats, chords, segments, tags, etc.)
Error detection and validation for annotations
A translation layer to interface with mir eval for evaluating annotations

Why

Music annotations are traditionally provided as plain-text files employing simple formatting schema (comma or tab separated) when possible. However, as the field of MIR has continued to evolve, such annotations have become increasingly complex, and more often custom conventions are employed to represent this information. And, as a result, these custom conventions can be unwieldy and non-trivial to parse and use.

Therefore, JAMS provides a simple, structured, and sustainable approach to representing rich information in a human-readable, language agnostic format. Importantly, JAMS supports the following use-cases:

multiple types annotations
multiple annotations for a given task
rich file level and annotation level metadata

How

This library is offered as a proof-of-concept, demonstrating the promise of a JSON-based schema to meet the needs of the MIR community. To install, clone the repository into a working directory and proceed thusly.

The full documentation can be found here.

Who

To date, the initial JAMS effort has evolved out of internal needs at MARL@NYU, with some great feedback from our friends at LabROSA.

If you want to get involved, do let us know!

Details

JAMS is proposed in the following publication:

[1] Eric J. Humphrey, Justin Salamon, Oriol Nieto, Jon Forsyth, Rachel M. Bittner, and Juan P. Bello, "JAMS: A JSON Annotated Music Specification for Reproducible MIR Research", Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

The JAMS schema and data representation used in the API were overhauled significantly between versions 0.1 (initial proposal) and 0.2 (overhauled), see the following technical report for details:

[2] B. McFee, E. J. Humphrey, O. Nieto, J. Salamon, R. Bittner, J. Forsyth, J. P. Bello, "Pump Up The JAMS: V0.2 And Beyond", Technical report, October 2015.

jams's People

Contributors

Stargazers

Watchers

jams's Issues

jams_to_lab.py

To be good citizens of the MIR world, we should provide a script that can crunch a jams file out to one or more .lab files. (Of course, such a thing shouldn't be necessary, but it will be useful as an intermediate solution until the rest of the universe gets up to speed.)

I'm thinking the following:

$ jams_to_lab.py my_jam.jams prefix

will create a set of lab files of the form

prefix__namespace__index.lab

where index counts off how many annotations of type namespace there are. For example, a SALAMI jams would give

prefix__segment_salami_function__0.lab
prefix__segment_salami_function__1.lab
prefix__segment_salami_upper__0.lab
prefix__segment_salami_upper__1.lab
prefix__segment_salami_lower__0.lab
prefix__segment_salami_lower__1.lab

each lab file would have JAMSy column headers, and comment headers including the file metadata and annotation metadata. (Sandboxes will be omitted, I guess.)

Desired namespaces or annotators may be user-specified on the command line.

jams vs pyjams?

This may be a dumb question, but why is the python module called "pyjams" instead of "jams"? Was there a conflict with an existing module name?

On Datatypes

The original JAMS proposal set out what we dubbed four atomic datatypes: Observation, Event, Range and TimeSeries. Somewhat unfortunately, at least in the sense consistent with chemistry, Event and Range are actually molecules composed of Observation atoms, so we were wrong in this regard. Not super crucial, but worth noting.

During development, we (or at least I) thought, "oh noes, we don't have a Table type!" This raised concerns about rows, headers, dimensionality, and all of these other concerns.

Having conversations with folks at ISMIR about datatypes, it first seemed like we'd really only need two fundamental datatypes: an interval, for sparse phenomena, and a series, for dense data. In theory, everything could be stored as sparse points, but the main argument against it is that logical groups should be kept together, e.g. the frequency values of a pitch contour.

An effort to wrangle a modified schema, along with a validated JSON, can be found at c0dbcc6:

However, spelling it out this way, it's interesting to consider that information can be sparse or dense in both over time and instantaneously:

onset times of various drums are sparse in time and voicing
pitch contours of multiple voices are dense in time, but sparsely voiced
guitar chords are sparse in time, but should be stored as an array of six values
instrument activations are dense in both time and voicing

For what it's worth, MIDI has solved these problems somewhat, by making everything a sparse (numerically coded) event.

One challenge in building a schema around these ideas is whether to strictly assign tasks to types, probably through inheritance, or falling back to namespaces and letting something more powerful than a schema parse the data. It's an appealing idea to perform all validation in the schema, but it every conceivable structure would need to be defined explicitly.

RFC: chord_roman and augmented triads

[summarizing an offline conversation with @ejhumphrey and soliciting feedback from others, eg @jonforsyth @rabitt @justinsalamon ]

The chord_roman namespace is ported over from the loose spec defined in the rock corpus, but constrained to cover only the quality/categorical symbols used in the database.

An issue that popped up in writing documentation #23 is that the spec doesn't quite match the data. Specifically, the spec says that augmented chords are marked with 'a' (eg superstition_tdc), but they seem to also be marked by '+' (eg fast_car_tdc).

Two questions:

Should we eliminate one of these formats to reduce confusion? If so, which one? The resulting trc parser will need to handle translation, but this doesn't seem like too big of a deal. (Note: the original spec allowed for arbitrary strings, which I really would like to avoid.)
Or should we keep both, allowing for redundancy/ambiguity, but keeping the translation simple?

My personal vote is for adhering to the intended spec, dropping '+' and keeping 'a', but I could be persuaded either way.

Validation error should be more informative

A couple of weeks ago, every time I tried to validate a non-valid JAMS file I got an error reporting the exact source of the problem.

However, now I only get this:

...
  File "/home/uri/Projects/jams/jams/pyjams.py", line 1091, in save
    self.validate(strict=strict)
  File "/home/uri/Projects/jams/jams/pyjams.py", line 1124, in validate
    valid &= ann.validate(strict=strict)
  File "/home/uri/Projects/jams/jams/pyjams.py", line 781, in validate
    six.raise_from(SchemaError, invalid)
  File "/usr/local/lib/python2.7/dist-packages/six-1.9.0-py2.7.egg/six.py", line 692, in raise_from
    raise value
jams.exceptions.SchemaError

Is it possible to print out the SchemaError message by default?

Version 2 Design Considerations

Following the presentation and discussion at / after ISMIR2014, a variety of ideas have been flying around regarding next steps and possible improvements to JAMS:

"Group-by-task" versus "explicit typing", or somewhere in between
Namespaces for interpreting the meaning of observations
"One general type with conditional validation" versus "Many specific types with singular validation"
How, if at all, should this interface with linked data?
How, if at all, should this be influenced / consistent with the MPEG-7 specification?
What is the value / goal of readability in the raw JSON?
JAMS Quick Look: "Text editors" versus "specialized tools"

These are the high-level ones, and I'd like this issue to simply sponge topics for posterity; each individual topic should probably be broken out as separate conversations.

namespace corrections in general, and segment_tut

It looks to me like segment_tut's namespace has inherited some typos from the annotations that it's meant to describe. I'm particularly talking about this sort of thing, which looks like an obvious transcription error to me.

The question to you all: should we codify this kind of thing in the namespace definition? Or instead, correct the namespace definition and use it to detect (and correct) errors in the source files?

(I vote for the latter, obviously.)

Publication of jams2

@justinsalamon has repeatedly kicked around the idea of writing a paper to document the changes in jams over the last year. I think this is probably a good idea, if only for publicity.

As for venues, the next obvious one is ismir2015 late breaking session (deadline: 2015-10-26). This would be an extended abstract (2 pages), which clearly doesn't suffice to document all the restructuring.

It would be nice to have a longer document as well, either a tech report or an arxiv preprint, which the late breaking session abstract can refer back to for more details. This means we'd have to crank out the paper by october, which should be plenty doable.

Who's in?

Security flaw in JSON deserialization

This had occurred to me recently, and finally remembered to mention it. There's a pretty big security flaw in JSON deserialization: https://github.com/marl/jams/blob/pandas/pyjams/pyjams.py#L749

For example (don't load):

{
    "object_type": "import os;os.system('rm -rf ~/')"
}

One fix is to register objects in the module with a dictionary, and then deserialize through look-ups in that dictionary; all the functionality you want with none of the SQL-injection style hakz.

JSON support may conflict with other libraries

In providing somewhat seamless JSON integration, we're over-riding the built-in encode/decode methods of the library. However, if any other libraries attempt to do the same thing, one of the two will break as a result.

Rather than globally altering the json module, we should use a context manager to wrap calls to json and only achieve the desired functionality when it's needed.

validation fails hard with jsonschema 2.5.1

Title says it all. This doesn't show up in travis because conda still has jsonschema 2.4.0.

Running the test suite locally, I get the following:

======================================================================
ERROR: jams_test.test_jams_validate_good
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/bmcfee/git/jams/jams/tests/jams_test.py", line 539, in test_jams_validate_good
    j1.validate()
  File "/home/bmcfee/git/jams/jams/core.py", line 1187, in validate
    valid = super(JAMS, self).validate(strict=strict)
  File "/home/bmcfee/git/jams/jams/core.py", line 461, in validate
    jsonschema.validate(self.__json__, schema.JAMS_SCHEMA)
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 478, in validate
    cls(schema, *args, **kwargs).validate(instance)
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 122, in validate
    for error in self.iter_errors(*args, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 98, in iter_errors
    for error in errors:
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/_validators.py", line 291, in properties_draft4
    schema_path=property,
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 114, in descend
    for error in self.iter_errors(instance, schema):
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 98, in iter_errors
    for error in errors:
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/_validators.py", line 42, in items
    for error in validator.descend(item, items, path=index):
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 114, in descend
    for error in self.iter_errors(instance, schema):
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 98, in iter_errors
    for error in errors:
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/_validators.py", line 199, in ref
    scope, resolved = validator.resolver.resolve(ref)
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 336, in resolve
    return url, self._remote_cache(url)
  File "/usr/local/lib/python2.7/dist-packages/functools32/functools32.py", line 400, in wrapper
    result = user_function(*args, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 346, in resolve_from_url
    raise RefResolutionError(exc)
RefResolutionError: unknown url type:

A schema for collections?

Going back to this comment, we punted on the idea of managing extrinsic data (eg, file paths) explicitly from within a JAMS object. Now that the dust has settled a bit on JAMS schema, I'm wondering if we can come up with a better solution than sandboxing this stuff.

I bring this up because maintaining links between audio content and annotations is still kind of a pain, and I'd prefer to not solve it over and over again.

How do people feel about introducing an interface/schema for managing collections of jamses? At the most basic level, this would provide a simple index of audio content, jams content, and collection-level information. (It might also be useful to index which annotation namespaces are present in each jams file.) This kind of thing can spiral out of control easily, so if we do it, we should keep it tightly scoped.

Schema error when saving .jams file from jams.util.import_lab()

I was trying to save SALAMI .lab files into .jams files but encountered this error.
The code was like this

for lab in labs:
    jam, annotation = jams.util.import_lab('segment_open',lab_path+lab)
    jam.save(dump_path+'SALAMI_'+lab[:-4]+'.jams')

I tried 'segment_salami_function' and 'segment_salami_upper'. Both return the same error. Please let me know if I misunderstood the usage of jams or any pointers. Thanks!!

Error messages:

SchemaError Traceback (most recent call last)
in ()
1 for lab in labs:
2 jam, annotation = jams.util.import_lab('segment_open',lab_path+lab)
----> 3 jam.save(dump_path+'SALAMI_'+lab[:-4]+'.jams')

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/jams-0.2.0-py2.7.egg/jams/core.pyc in save(self, path_or_file, strict, fmt)
1168 """
1169
-> 1170 self.validate(strict=strict)
1171
1172 with _open(path_or_file, mode='w', fmt=fmt) as fdesc:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/jams-0.2.0-py2.7.egg/jams/core.pyc in validate(self, strict)
1198
1199 '''
-> 1200 valid = super(JAMS, self).validate(strict=strict)
1201
1202 for ann in self.annotations:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/jams-0.2.0-py2.7.egg/jams/core.pyc in validate(self, strict)
476 except jsonschema.ValidationError as invalid:
477 if strict:
--> 478 raise SchemaError(str(invalid))
479 else:
480 warnings.warn(str(invalid))

SchemaError: None is not of type u'number'

Failed validating u'type' in schema[u'properties'][u'file_metadata'][u'properties'][u'duration']:
{u'minimum': 0.0, u'type': u'number'}

On instance[u'file_metadata'][u'duration']:
None

Python API doesn't validate intervals when created / written.

Caught a bug (I think?) in the Isophonics import, where a start time > end time:

          {
          "start": {
            "value": 188.854
          }, 
          "end": {
            "value": 188.827
          }, 
          "label": {
            "value": "silence"
          }

https://github.com/marl/jams/blob/master/datasets/Isophonics/The%20Beatles/10CD1_-_The_Beatles/CD1_-_04_-_Ob-La-Di%2C_Ob-La-Da.jams#L4126

Not really sure how to fix / hack this. The duration of the file I've got is 188.839, which doesn't match the annotation, nor is it smaller than this final interval.

extensible namespaces

Add functionality to the namespace submodule to support dynamic extensions.

JAMS object not JSON serializable?

I've been trying to update the SALAMI parser to support the current version of JAMS and SALAMI v2.0.

However, I get the following error when trying to write the JAMS file into a new file:

Traceback (most recent call last):
  File "./salami_parser.py", line 211, in <module>
    process(args.in_dir, args.out_dir)
  File "./salami_parser.py", line 189, in process
    os.path.basename(metadata[0]) + ".jams"))
  File "./salami_parser.py", line 170, in create_JAMS
    json.dump(jam, f, indent=2)
  File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/usr/lib/python2.7/json/encoder.py", line 442, in _iterencode
    o = _default(o)
  File "/usr/lib/python2.7/json/encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <JAMS: sandbox, annotations, file_metadata> is not JSON serializable

I believe I might be doing something wrong when creating the new JAMS object. Or maybe this has something to do with #27?

JAMS annotation querying/filtering

It would be useful to have basic query functionality for selecting annotations from a jams object.

The top-level groupings are useful for organizing data into tasks, but provide no mechanism for filtering by namespace. As a concrete example, each SALAMI jam has segment annotations from 3 namespaces (function, upper, and lower), and we should have an interface to pull out only the data relating to one of them.

More generally, one might want to select out multiple namespaces ['salami_upper', 'salami_lower'] or 'segment_*'. This might be most simply accomplished by allowing regular expression queries.

As long as we're allowing querying on namespace, would it make sense to support querying on other fields? Annotator/curator? Metadata? Everything? Where does the madness stop?

It's probably good to implement this both at the level of AnnotationArray first, and then propagate up to JAMS.

Any comments/suggestions before I dive in and start hacking this out?

Namespaces

I'm starting a new issue to manage namespace definition/schema construction.

[Tagging @ejhumphrey @urinieto @justinsalamon @rabitt @jonforsyth ]

To recap the (offline) discussion from this morning, a jams namespace will be used to define the syntax and (some) semantics of the values field for an observation array, as well as the preferred packing format (dense or sparse).

Rather than bake each namespace into the schema directly, we'll take a more modular approach, and allow each namespace to sit in its own json file. This will serve to decouple task semantics from jams syntax, and allow users to extend or modify the collection of namespaces without modifying the schema.

I'm thinking that the namespace files can be loosely grouped by task, and structured in the package as

schema/namespaces
schema/namespaces/chord
schema/namespaces/chord/harte.json
schema/namespaces/tag
schema/namespaces/tag/open.json
schema/namespaces/tag/gtzan.json
schema/namespaces/tag/cal500.json
...

Each namespace object is a dict, keyed by its (unique) identifier. As a simple example, here's the specification for an open vocabulary tag annotation:

{
    "tag_open": {
        "value": {
            "type": "string"
        },
        "dense": false,
        "descrpition": "Open tag vocabularies allow all strings"
    }
}

Namespaces must define the following fields:

value : schema-style type declaration to define the type of the value field
dense : whether to store the observation array as a dense (column-wise) array or sparse (row-wise) array
description : a short description of the namespace.

I've checked in a few examples already, demonstrating various characteristics:

melody_hz
This one allows numeric observations with dense packing
chord_harte
This one allows string observations that validate against the chord-matching regular expression
tag_gtzan
This one only allows strings from a pre-determined vocabulary

To flesh this out, we'll need to accomplish the following:

Specify a reasonable collection of namespaces for each task:
- beat
  - position
- chord
  - harte
  - roman
- ~~[ ] genre~~
- key
- pitch
  - pitch_hz
  - pitch_midi
  - pitch_class
- mood
  - mood_thayer : time-varying positional coordinates in valence-arousal space
- onset
- pattern
- segment
  - isophonics
  - TUT
  - salami_functions
  - salami_upper
  - salami_lower
  - open
- ~~[ ] source~~
- tag
  - gtzan
  - cal500
  - cal10k
  - open
  - ... ?
- lyrics
Implement the namespace management system in pyjams
Start re-translating datasets

JAMS Quick Look

From #2:

JAMS Quick Look: "Text editors" versus "specialized tools"

There are loads of json editors out there, and some even accept schema! Check out http://jeremydorn.com/json-editor/ (https://github.com/jdorn/json-editor) I bet it wouldn't be hard to fork that one to do dynamic namespace validation as well.

So the main question here is how/should we provide convenience tools for (batch)editing jams files (outside the python API)

Namespace for numeric values (eg regression targets)

Just like the mood_thayer namespace, but for data of arbitrary dimension.

The use-cases here are things like latent factor prediction.

[pyjams] discussion: data frame layer?

I've been trying to work with pyjams, and keep running into the same stumbling blocks that we've discussed offline. To summarize:

It's a little too cumbersome to get at the data programmatically. For instance, to interface with mir_eval currently requires a considerable amount of slicing, dicing, and repacking.
Modifying a jams (pyjams) object in memory is similarly difficult. What if I trim a track's audio file and want to modify all time boundaries to match? There's no simple way to do this right now.

After digging around a bit, I think it makes a lot of sense to use pandas DataFrames as an in-memory representation for several observation types. Rather than go into detail of how exactly this might look, I mocked up a notebook illustrating the main concepts with a chord annotation example here.

The key features are:

Hierarchical indexing, so that a field like "start" can have both "value" and "confidence" measurements associated with it
Mixed data type storage
All the pandas goodies (support for querying/slicing, missing data, etc)
Distinguishing time-typed and categorical data from other numeric values. (This bit would have to be enforced by the schema, but I think that's probably okay.)
Easy to slice and reinterpret a frame as a numpy array, allowing easy interaction with mir_eval (among others).

The way I see this playing out is that the json objects would stay pretty much the same on disk (modulo any schema changes we cook up for convenience). However, when serializing/deserializing, certain arrays get translated into pandas dataframes, which is the primary backing store in memory.

For example, a chord annotation record my_jams.chord[0].data, rather than being a list of Range objects, can instead be a single DataFrame which encapsulates all measurements asociated with the annotation. The jams object still retains hierarchy and multi-annotation collections, but each annotation becomes a single object, rather than a list of more atomic types.

If people are on board with this idea, I'd like to propose adopting a couple of conventions to make life easier:

Any measurement value corresponding to time (eg, event or interval boundary markers) should use the timedelta[ns] data type, rather than a float. Aside from making time values easy to find, pandas also provides some nice functionality for working with time series data.
Anything that could be interpreted as a "label" (eg, chord labels, segment ids, pitch class, instrument, etc) should use dtype 'category' rather than 'str' or 'object'.
If a value is not measured, use nan rather than hallucinating data or leaving the record empty. (I'm looking at you, confidence values.)

I realize that this adds yet another dependency and source of complexity to something that's already pretty monstrous, but I think it'll be worth it in the long run. Of course, this is all intended as a long-winded question/request for comments, so feel free to shoot it all down.

RFC: Forking vs expanding the chord_harte namespace?

[started offline in discussion with @ejhumphrey ]

The chord_harte namespace is arguably the most obvious representation for chord data, but we've had to modify it a bit from its original definition. (For instance, we added support for suspended chords.)

It's also currently missing other qualities, eg, 13. Since we've already modified the original namespace, I don't have a problem with adding more options. However, this is somewhat open-ended, and may be misleading down the road.

I see two options here:

Continue adding/correcting the existing chord_harte namespace as needed.
Fork the namespace. Keep chord_harte as originally defined in its paper, and make a separate namespace chord_??? which includes any expansions or corrections we see fit to include. (Another candidate for correction is mixing of flats and sharps in a note specification; harte allows this, but I think we probably don't want it.)

What do people think about these options? On the one hand, it's good to have clarity and solid reference material for what the namespaces do. On the other hand, if we fork it, I suspect that nobody will ever use the original chord_harte namespace since it's a strict subset of valid chord labels.

Remove JamsFrame.factory()

... In favor of a proper constructor. This was always intended as a temporary hack, and now that we have proper tests in place, it should be much quicker to slice it out.

number type in namespace definition allows complex numbers

The number type in namespace definitions accepts complex numbers. This means that this test for an invalid pitch_hz sequence will not raise an exception for all the test cases, because 1j is of type number. Admittedly this is a bit of a marginal case, but do we want to explicitly disallow complex numbers in the namespace definitions? (and can we?)

Duration-agnostic measurements

From #13 @justinsalamon ...

Is there a thread where this is being discussed?

Okay team. What do we want to do about measurements that are intended to span the entire track? Is that a null-duration event, do we explicitly include the track duration?

Arguments for null-duration markup:

easy to recognize
semi-obvious semantics

Arguments for explicit durations:

the annotations are more complete
makes it easy to validate that the audio matches the annotation

Arguments for zero-duration:

no good ones :)
we should keep this convention solely for instantaneous events

A third option might combine the first two: use null-duration for "weak supervision" (ie, the tag applies somewhere in the track but I don't know where), and full-duration for "strong supervision" (ie, this entire track is hip-hop).

spellcheck the docstrings

@ejhumphrey noticed some typos while browsing docs yesterday. We should fix this before pushing the final 0.2.

Namespace converters

Wherever it makes sense, we should have converter scripts to move between namespaces.

This could get pretty messy.

From the current set, the following seem feasible:

(I'm not sure about that last one; is it possible?)

There are some issues to hammer out with regards to additional parameters of the conversion. For instance, pitch_class_to_midi would probably require an octave indicator, unless we just make everything start at C4 or something.

Of course, all of this is up for discussion.

Fix annotations in import_lab?

I am trying to use import_lab, but there are some files in the Isophonics dataset that have empty range annotations (e.g., chordlab/Carole King/06 Way Over Yonder.lab). This causes the JAMS validation to fail.

So, my question is: should we modify import_lab to such that it "fixes" these annotations or should this function not alter the lab at all (and maybe modify the "broken" annotations in the parsers)?

I would go for the former, and maybe raise a warning every time an alteration of the original lab occurs.

Encapsulate exceptions?

Should we provide encapsulated exceptions for jams? For background, read this.

Currently, we throw a mixture of TypeError, RuntimeError, ValueError, and jsonschema.ValidationError, depending on the situation. This makes it difficult to separate exceptions raised by JAMS from those raised in other packages.

Proposed solution:

Define a root exception class JamsError to encapsulate our own exceptions. Then derive subclasses for the various types of exceptions we may encounter:

NamespaceError Incorrect namespace for an annotation, eg, in the eval module
ValidationError as in jsonschema
other types, as needed?

Retro-tag

Could someone go back and tag the repo at the revision corresponding to the ISMIR paper? And maybe even the initial submission, if you can track it down?

These tags can be really helpful when tracking changes between published and current work, especially when the work in question is a proposed standard.

rewrite unit tests

... since the schema's all different, and much of the (python) codebase has been simplified or rewritten, this is probably necessary.

@ejhumphrey Any objection to switching from unittest.TestCase objects (old and busted) over to nosetest functions (new hotness)? This will make tests easier to write, and play nicely with #25 .

Do we still need top-level groupings of annotations?

Do we, @ejhumphrey @justinsalamon @urinieto ?

At this point, they're redundant with the namespace fields.

Eliminating the top-level groupings would simplify the search logic #15 and make the schema automatically extensible for new tasks by adding new schema entries.

What we lose is a little coherence/human readability, but it's still encoded by the namespace field, so maybe it's not so bad?

Vote?

local namespace repository

Dynamically adding custom namespaces on import is a real drag.

Instead, we should allow custom namespaces to be stored in a directory specified by an environment variable or rc file.

JAMS structure: Task-major vs Annotation-major

The initial proposal suggested that the JAMS hierarchy be task-major, as follows:

JAMS object/
    beat/
        annotation0/
            annotation_metadata/
            data/
                event
                ...
    chord/
        ...
    ...

I'm concerned that this isn't very future proof, and presents some practical issues:

Though annotations have been constrained to single "tasks" in the past, I'd contend this is more a function of conventional research methodology than how things should be done going forward. Carving out concept from the start pre-supposes that (a) every annotation you collect should serve an existing task and (b) these observations are somehow independent of each other.
A single annotation containing information for multiple "tasks" must either (a) be divided when writing to file, and all annotation metadata must be duplicated, or (b) written into a separate top-level bucket linked via unique annotation keys. In the former scenario, it becomes possible for the same information to become inconsistent if one of the annotations is modified without the rest. In the latter, data integrity is forced to rely on linking conventions which, in JSON, is somewhat of a hack. Either way, this arguably results in a deficient representation, because you undermine the ability to enforce the schema.

Two alternatives include "Annotation-major, task-minor" or "Annotation-major, task-agnostic".

In "Annotation-major, task-minor", the structure would look like the following:

JAMS object/
    annotation0/
        annotation_metadata/
        beat/
            observations/
                interval
                ...
        chord/
            ...
        ...
    ...

In this case, the entire annotation is kept together, and the data are grouped internally by task. However, what I would love to prevent more than anything is the wonky scenario where "genre" and "tags" and "mood" and "usage" are different buckets. They're all just tags (and, more often than not, unfortunate ones at that). But this is a slippery slope, because, by the same logic, so are chord labels ... and so are structure labels, and beats, and onsets, and, well, just about everything except for melody. And then what's the point of splitting them out? They're all the same datatype with different namespaces / meanings. What's more, I don't know that we want to bake this namespace structure into the format, because it's not all that flexible should the schema need to evolve.

What does seem more flexible, is to introduce namespaces internal to the observations themselves and have either "type-minor" (group by schema type) or "type/task-agnostic" grouping inside of an annotation. Ignoring usability for a moment and focusing solely on forwards / backwards format compatibility, it seems this gives JAMS the best chance to evolve gracefully over a decade.

Arguably, the file format / schema should be focused on being strict today and flexible tomorrow, while the programmatic interface / API should be easy to use, always. But who knows, maybe there's a compromise to be made here.

Namespace required fields

~~[ref #13, #26]~~

The current namespace implementation does not provide a way to indicate whether value or confidence are required fields. This is ultimately namespace-dependent, since beat and onset do not require values, but something like chord_harte certainly does. Without explicit required fields, it's impossible to properly unit-test the namespaces.

In general json-schema, you would indicate required fields in the type definition with the required = [ ... ] field. However, since the namespace schemas aren't complete definitions (only the specs for the value and confidence fields), this isn't directly possible by adding required to the ns schema. We can work around this by adding required to the namespace defs and re-routing it to the correct place from within jams.ns.ns_schema.

Disregard this issue. All observations should have all fields always.

Why does JObject inherit from object instead of dict?

(paging @ejhumphrey )

Legit question, I'm curious if this was a conscious decision, and if so, what the reasons are/were.

Seems like inheriting from dict would let us ditch several one-liners which reimplement (most of) the dict interface?

mir_eval integration

Quick thoughts about making integration with mir_eval simple:

Provide functionality to generate and populate Annotation objects from .lab files
Provide wrapper functions for all mir_eval submodules
- This can be implemented via a handful of transparent decorators
- The decorated evaluators accept two Annotation objects (ref_annotation, est_annotation), does the necessary unpacking into mir_eval array/list format, and returns the result
- We may need multiple decorators: events, labeled events, intervals, and labeled intervals. Possibly more may be necessary depending on the task/metric.
- We may also need to do docstring mangling to change the function signature while retaining the description.

Validation caching

It would be nice if we could somehow checksum and cache expensive operations, such as schema validation.

Once we validate a jams object, we shouldn't need to validate it again if it hasn't changed.

licensing

Everyone's favorite topic!

I was just poking through the code, and noticed what seem to be some ambiguity in licensing, comparing the top-level license to other bits of the code which seem to be GPL (eg, in parsers/*.py).

Maybe it's worth thinking about picking a single license to use across the board?

GPL is probably not a great idea here, given that it's a library that could be embedded in all kinds of things. I would recommend a BSD-style license, having just gone through all this for librosa. See also: http://choosealicense.com/ .

Maybe put it to a vote? Just opening up the discussion here.

MATJAMS?

There was some discussion over lunch about updating, forking, or abandoning the matlab jams implementation. This mostly stemmed from a total lack of interest in developing or maintaining such a thing, and the difficulty of working with json in matlab.

Can we discuss this more thoroughly before committing to a plan?

What do people think about JSONlab? It might not have all the bells and whistles we've grown to love in python, but it might suffice for simple io.

jams.load not working?

I tried to use jams.load on this file and I get the following error:

ParameterError: Unknown JAMS extension format: "jams"

I am pretty sure this file could be loaded with pyjams just a few weeks ago. Is there some problem with this specific JAMS file itself? The function can load properly files like this.

If there's a problem with the file, the error should be more specific about why JAMS can't open it (the error message is clearly wrong, since jams should be a well-known format).

Refactoring schema functionality?

What do people think about moving all schema management functionality into a single submodule? We currently have the ns module for handling dynamic namespace schemata, and the core jams schema is handled directly in pyjams.

Since these two things are not entirely independent -- validation code needs to touch both, for example -- I'm thinking it would be simpler if we move jams.ns to jams.schema and then have all schema loading/manipulation done from there.

The reason to do this comes from the validation method for annotations, which has to supply the schema for a SparseObservation object to the namespace schema constructor. This is the only value ever passed in to ns.schema(), so it's silly to have a parameter and condition checking for it. We can't just hard-wire the current implementation though, since it would introduce a circular dependency in the submodules.

Moving pyjams.__SCHEMA__ to reside in the same submodule fixes the dependency cycle and simplifies some api and internal logic. Also, in case we do decide to include other schemata later on (eg, collections), having a catch-all submodule for schema handling makes a lot of sense.

From the end-user perspective, nothing important should change.

Some of the internals to pyjams will need to change if we do this, but it's simply a matter of pointing to the new location of the schema object. Similarly, tests and docs will have to change, but that's pretty trivial.

Opinions?

JAMS v2 Schema

@ejhumphrey I've been snooping around the v2 schema, and wanted to discuss the series type.

If it has an array of of "values", why should it also have time and duration fields?
is there any way to identify what the series actually represents? i.e. how do I know if this is a time series or an frequency series?

travis-ci

We should set up continuous integration tests for jams. Aside from being generally useful, it will specifically let us test and validate on a variety of different host configurations.

IO compression and alternative serialization stores

JAMS files can get pretty large in plain text. It would be nice to support compression of some kind, eg, by allowing a gzip file handle instead of a filename. This will probably take a little bit of refactoring to do properly, but I see no downside in directly supporting jams.gz (or jamz, if you will) as a format.

While we're at it, what are folks' opinions on generalizing the backend from JSON? I can imagine use-cases where pickle or bson might be preferable. Ideally, this would all be transparent to the user, and all load/save operations would work out of the box.

Arguments against doing this:

Supporting multiple serialization backends might break interoperability
Loss of plaintext interpretability

Arguments in favor:

Greater flexibility for users
Binary formats would be more efficient (on disk) than json

In all cases, we'd still use json schema validation, so functionally nothing would change.

So... thoughts?

Rewrite Parsers

It is time to rewrite the dataset parsers, so that they work again with the current JAMS version.

@bmcfee already (re?)wrote the SMC one, and I'm working on SALAMI and Isophonics.

I'm also gonna assign this to some of you, because my roots come from Spain, and we are used to ************(s) there.

documentation update

we need to update the documentation to match the new version.

JAMS beyond music?

Just opening up a separate thread here (rather than the already bloated #13): is it worth considering designing JAMS to be extensible into domains outside of music/time-series annotation?

I think the general architecture is flexible enough to make this possible with roughly zero overhead, and it might be a good idea.

From what I can tell, all that we'd have to do is restructure the schema a little so that "*Observation" is slightly more generic. We currently define two (arguably redundant) observation types that both encode tuples of (time, duration, value, confidence). It wouldn't be hard to extend this into multiple observation forms, say for images with bounding-box annotations, we would have (x, x_extent, y, y_extent, value, confidence). For video, we would have (x, x_extent, y, y_extent, t, duration, value, confidence), etc.

Within the schema, nothing would really change, except that we change "DenseObservation" to "DenseTimeObservation" (and analogous for Sparse), and then some time down the road, allow other observation schema to be added.

I don't think we need to tackle this for the immediate (next) release, except insofar as we can design to support it in the future in a backwards-compatible way.

Opinions?

Streamlining the API

Still in the interest of minimizing the effort for #26 , I'm looking for ~~corners to cut~~ places in which the API can be streamlined, IMO.

Since this keeps recurring, I figured we should have a dedicated thread.

AnnotationArray.create_annotation() -- construct a new Annotation, append it to the end of an existing AnnotationArray, and return the new annotation.

This is a minor convenience method. Since both the AnnotationArray and the new Annotation will be in the caller's scope afterwards, I don't see why it's necessary. For instance, the following are equivalent:

>>> new_ann = jams.Annotation(namespace='foo', data='bar')
>>> existing_jam.annotations.append(new_ann)

vs.

>>> new_ann = existing_jam.annotations.create_annotation(namespace='foo', data='bar')

I prefer the former because it's entirely explicit about what's created and where. In the latter case, appending to the annotation array is a side-effect of construction, which feels a little messy.

The only case in which it would make a qualitative difference is when the return value of create_annotation is ignored, but you can get the same effect by saying

>>> existing_jam.annotations.append(jams.Annotation(namespace='foo', data='bar'))

which I think is also more clear, and more "pythonic" in that it uses existing API (inherited from list) rather than creating new terminology.

Any counter-arguments?

import_lab crashes

import_lab crashes when it tries to parse something like:

0.432471655 New Point
0.873650793 New Point
1.358004535 1
1.822993197 2
2.287006802 3
2.740000000 4

(this example is copied from the Isophonics beat annotations for the track When I Get Home, from the album A Hard Day's Night, by The Beatles)

Pandas complains about not getting a float, but a string:

  File "./isophonics_parser.py", line 183, in <module>
    process(args.in_dir, args.out_dir)
  File "./isophonics_parser.py", line 128, in process
    jam=jam)
  File "/home/uri/Projects/jams/jams/util.py", line 81, in import_lab
    data.loc[:, 1] -= data[0]
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 182, in f
    result = method(self, other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 524, in wrapper
    arr = na_op(lvalues, rvalues)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 475, in na_op
    result[mask] = op(x[mask], _values_from_object(y[mask]))
TypeError: unsupported operand type(s) for -: 'str' and 'float'

Should we default to 0 when a char/string is found in the last column?