marl / jams Goto Github PK
View Code? Open in Web Editor NEWA JSON Annotated Music Specification for Reproducible MIR Research
License: ISC License
A JSON Annotated Music Specification for Reproducible MIR Research
License: ISC License
we need to update the documentation to match the new version.
We should set up continuous integration tests for jams. Aside from being generally useful, it will specifically let us test and validate on a variety of different host configurations.
This may be a dumb question, but why is the python module called "pyjams" instead of "jams"? Was there a conflict with an existing module name?
Following the presentation and discussion at / after ISMIR2014, a variety of ideas have been flying around regarding next steps and possible improvements to JAMS:
These are the high-level ones, and I'd like this issue to simply sponge topics for posterity; each individual topic should probably be broken out as separate conversations.
@ejhumphrey I've been snooping around the v2 schema, and wanted to discuss the series type.
Could someone go back and tag the repo at the revision corresponding to the ISMIR paper? And maybe even the initial submission, if you can track it down?
These tags can be really helpful when tracking changes between published and current work, especially when the work in question is a proposed standard.
It is time to rewrite the dataset parsers, so that they work again with the current JAMS version.
@bmcfee already (re?)wrote the SMC one, and I'm working on SALAMI and Isophonics.
I'm also gonna assign this to some of you, because my roots come from Spain, and we are used to ************(s) there.
In providing somewhat seamless JSON integration, we're over-riding the built-in encode/decode methods of the library. However, if any other libraries attempt to do the same thing, one of the two will break as a result.
Rather than globally altering the json
module, we should use a context manager to wrap calls to json and only achieve the desired functionality when it's needed.
(paging @ejhumphrey )
Legit question, I'm curious if this was a conscious decision, and if so, what the reasons are/were.
Seems like inheriting from dict would let us ditch several one-liners which reimplement (most of) the dict interface?
Still in the interest of minimizing the effort for #26 , I'm looking for corners to cut places in which the API can be streamlined, IMO.
Since this keeps recurring, I figured we should have a dedicated thread.
AnnotationArray.create_annotation()
-- construct a new Annotation
, append it to the end of an existing AnnotationArray
, and return the new annotation.This is a minor convenience method. Since both the AnnotationArray and the new Annotation will be in the caller's scope afterwards, I don't see why it's necessary. For instance, the following are equivalent:
>>> new_ann = jams.Annotation(namespace='foo', data='bar')
>>> existing_jam.annotations.append(new_ann)
vs.
>>> new_ann = existing_jam.annotations.create_annotation(namespace='foo', data='bar')
I prefer the former because it's entirely explicit about what's created and where. In the latter case, appending to the annotation array is a side-effect of construction, which feels a little messy.
The only case in which it would make a qualitative difference is when the return value of create_annotation
is ignored, but you can get the same effect by saying
>>> existing_jam.annotations.append(jams.Annotation(namespace='foo', data='bar'))
which I think is also more clear, and more "pythonic" in that it uses existing API (inherited from list) rather than creating new terminology.
Any counter-arguments?
... In favor of a proper constructor. This was always intended as a temporary hack, and now that we have proper tests in place, it should be much quicker to slice it out.
The original JAMS proposal set out what we dubbed four atomic datatypes: Observation, Event, Range and TimeSeries. Somewhat unfortunately, at least in the sense consistent with chemistry, Event and Range are actually molecules composed of Observation atoms, so we were wrong in this regard. Not super crucial, but worth noting.
During development, we (or at least I) thought, "oh noes, we don't have a Table type!" This raised concerns about rows, headers, dimensionality, and all of these other concerns.
Having conversations with folks at ISMIR about datatypes, it first seemed like we'd really only need two fundamental datatypes: an interval, for sparse phenomena, and a series, for dense data. In theory, everything could be stored as sparse points, but the main argument against it is that logical groups should be kept together, e.g. the frequency values of a pitch contour.
An effort to wrangle a modified schema, along with a validated JSON, can be found at c0dbcc6:
However, spelling it out this way, it's interesting to consider that information can be sparse or dense in both over time and instantaneously:
For what it's worth, MIDI has solved these problems somewhat, by making everything a sparse (numerically coded) event.
One challenge in building a schema around these ideas is whether to strictly assign tasks to types, probably through inheritance, or falling back to namespaces and letting something more powerful than a schema parse the data. It's an appealing idea to perform all validation in the schema, but it every conceivable structure would need to be defined explicitly.
The current namespace implementation does not provide a way to indicate whether value
or confidence
are required fields. This is ultimately namespace-dependent, since beat
and onset
do not require values, but something like chord_harte
certainly does. Without explicit required fields, it's impossible to properly unit-test the namespaces.
In general json-schema, you would indicate required fields in the type definition with the required = [ ... ]
field. However, since the namespace schemas aren't complete definitions (only the specs for the value and confidence fields), this isn't directly possible by adding required
to the ns schema. We can work around this by adding required
to the namespace defs and re-routing it to the correct place from within jams.ns.ns_schema
.
Disregard this issue. All observations should have all fields always.
Everyone's favorite topic!
I was just poking through the code, and noticed what seem to be some ambiguity in licensing, comparing the top-level license to other bits of the code which seem to be GPL (eg, in parsers/*.py).
Maybe it's worth thinking about picking a single license to use across the board?
GPL is probably not a great idea here, given that it's a library that could be embedded in all kinds of things. I would recommend a BSD-style license, having just gone through all this for librosa. See also: http://choosealicense.com/ .
Maybe put it to a vote? Just opening up the discussion here.
The number type in namespace definitions accepts complex numbers. This means that this test for an invalid pitch_hz sequence will not raise an exception for all the test cases, because 1j is of type number. Admittedly this is a bit of a marginal case, but do we want to explicitly disallow complex numbers in the namespace definitions? (and can we?)
This had occurred to me recently, and finally remembered to mention it. There's a pretty big security flaw in JSON deserialization: https://github.com/marl/jams/blob/pandas/pyjams/pyjams.py#L749
For example (don't load):
{
"object_type": "import os;os.system('rm -rf ~/')"
}
One fix is to register objects in the module with a dictionary, and then deserialize through look-ups in that dictionary; all the functionality you want with none of the SQL-injection style hakz.
It looks to me like segment_tut
's namespace has inherited some typos from the annotations that it's meant to describe. I'm particularly talking about this sort of thing, which looks like an obvious transcription error to me.
The question to you all: should we codify this kind of thing in the namespace definition? Or instead, correct the namespace definition and use it to detect (and correct) errors in the source files?
(I vote for the latter, obviously.)
It would be nice if we could somehow checksum and cache expensive operations, such as schema validation.
Once we validate a jams object, we shouldn't need to validate it again if it hasn't changed.
@ejhumphrey noticed some typos while browsing docs yesterday. We should fix this before pushing the final 0.2.
I'm starting a new issue to manage namespace definition/schema construction.
[Tagging @ejhumphrey @urinieto @justinsalamon @rabitt @jonforsyth ]
To recap the (offline) discussion from this morning, a jams namespace will be used to define the syntax and (some) semantics of the values
field for an observation array, as well as the preferred packing format (dense or sparse).
Rather than bake each namespace into the schema directly, we'll take a more modular approach, and allow each namespace to sit in its own json file. This will serve to decouple task semantics from jams syntax, and allow users to extend or modify the collection of namespaces without modifying the schema.
I'm thinking that the namespace files can be loosely grouped by task, and structured in the package as
schema/namespaces
schema/namespaces/chord
schema/namespaces/chord/harte.json
schema/namespaces/tag
schema/namespaces/tag/open.json
schema/namespaces/tag/gtzan.json
schema/namespaces/tag/cal500.json
...
Each namespace object is a dict, keyed by its (unique) identifier. As a simple example, here's the specification for an open vocabulary tag annotation:
{
"tag_open": {
"value": {
"type": "string"
},
"dense": false,
"descrpition": "Open tag vocabularies allow all strings"
}
}
Namespaces must define the following fields:
value
: schema-style type declaration to define the type of the value fielddense
: whether to store the observation array as a dense (column-wise) array or sparse (row-wise) arraydescription
: a short description of the namespace.I've checked in a few examples already, demonstrating various characteristics:
To flesh this out, we'll need to accomplish the following:
Going back to this comment, we punted on the idea of managing extrinsic data (eg, file paths) explicitly from within a JAMS object. Now that the dust has settled a bit on JAMS schema, I'm wondering if we can come up with a better solution than sandboxing this stuff.
I bring this up because maintaining links between audio content and annotations is still kind of a pain, and I'd prefer to not solve it over and over again.
How do people feel about introducing an interface/schema for managing collections of jamses? At the most basic level, this would provide a simple index of audio content, jams content, and collection-level information. (It might also be useful to index which annotation namespaces are present in each jams file.) This kind of thing can spiral out of control easily, so if we do it, we should keep it tightly scoped.
Just opening up a separate thread here (rather than the already bloated #13): is it worth considering designing JAMS to be extensible into domains outside of music/time-series annotation?
I think the general architecture is flexible enough to make this possible with roughly zero overhead, and it might be a good idea.
From what I can tell, all that we'd have to do is restructure the schema a little so that "*Observation" is slightly more generic. We currently define two (arguably redundant) observation types that both encode tuples of (time, duration, value, confidence)
. It wouldn't be hard to extend this into multiple observation forms, say for images with bounding-box annotations, we would have (x, x_extent, y, y_extent, value, confidence)
. For video, we would have (x, x_extent, y, y_extent, t, duration, value, confidence)
, etc.
Within the schema, nothing would really change, except that we change "DenseObservation" to "DenseTimeObservation" (and analogous for Sparse), and then some time down the road, allow other observation schema to be added.
I don't think we need to tackle this for the immediate (next) release, except insofar as we can design to support it in the future in a backwards-compatible way.
Opinions?
From #2:
JAMS Quick Look: "Text editors" versus "specialized tools"
There are loads of json editors out there, and some even accept schema! Check out http://jeremydorn.com/json-editor/ (https://github.com/jdorn/json-editor) I bet it wouldn't be hard to fork that one to do dynamic namespace validation as well.
So the main question here is how/should we provide convenience tools for (batch)editing jams files (outside the python API)
[summarizing an offline conversation with @ejhumphrey and soliciting feedback from others, eg @jonforsyth @rabitt @justinsalamon ]
The chord_roman
namespace is ported over from the loose spec defined in the rock corpus, but constrained to cover only the quality/categorical symbols used in the database.
An issue that popped up in writing documentation #23 is that the spec doesn't quite match the data. Specifically, the spec says that augmented chords are marked with 'a'
(eg superstition_tdc), but they seem to also be marked by '+'
(eg fast_car_tdc).
Two questions:
My personal vote is for adhering to the intended spec, dropping '+'
and keeping 'a'
, but I could be persuaded either way.
Do we, @ejhumphrey @justinsalamon @urinieto ?
At this point, they're redundant with the namespace fields.
Eliminating the top-level groupings would simplify the search logic #15 and make the schema automatically extensible for new tasks by adding new schema entries.
What we lose is a little coherence/human readability, but it's still encoded by the namespace field, so maybe it's not so bad?
Vote?
Caught a bug (I think?) in the Isophonics import, where a start time > end time:
{
"start": {
"value": 188.854
},
"end": {
"value": 188.827
},
"label": {
"value": "silence"
}
Not really sure how to fix / hack this. The duration of the file I've got is 188.839, which doesn't match the annotation, nor is it smaller than this final interval.
I've been trying to update the SALAMI parser to support the current version of JAMS and SALAMI v2.0.
However, I get the following error when trying to write the JAMS file into a new file:
Traceback (most recent call last):
File "./salami_parser.py", line 211, in <module>
process(args.in_dir, args.out_dir)
File "./salami_parser.py", line 189, in process
os.path.basename(metadata[0]) + ".jams"))
File "./salami_parser.py", line 170, in create_JAMS
json.dump(jam, f, indent=2)
File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
for chunk in iterable:
File "/usr/lib/python2.7/json/encoder.py", line 442, in _iterencode
o = _default(o)
File "/usr/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <JAMS: sandbox, annotations, file_metadata> is not JSON serializable
I believe I might be doing something wrong when creating the new JAMS object. Or maybe this has something to do with #27?
Wherever it makes sense, we should have converter scripts to move between namespaces.
This could get pretty messy.
From the current set, the following seem feasible:
beat_position
-> beat
tag_*
-> tag_open
segment_*
-> segment_label_open
pitch_hz
<-> pitch_midi
pitch_class
-> pitch_midi
, pitch_hz
chord_roman
-> chord_harte
(?)(I'm not sure about that last one; is it possible?)
There are some issues to hammer out with regards to additional parameters of the conversion. For instance, pitch_class_to_midi
would probably require an octave indicator, unless we just make everything start at C4 or something.
Of course, all of this is up for discussion.
[started offline in discussion with @ejhumphrey ]
The chord_harte
namespace is arguably the most obvious representation for chord data, but we've had to modify it a bit from its original definition. (For instance, we added support for suspended chords.)
It's also currently missing other qualities, eg, 13. Since we've already modified the original namespace, I don't have a problem with adding more options. However, this is somewhat open-ended, and may be misleading down the road.
I see two options here:
chord_harte
namespace as needed.chord_harte
as originally defined in its paper, and make a separate namespace chord_???
which includes any expansions or corrections we see fit to include. (Another candidate for correction is mixing of flats and sharps in a note specification; harte allows this, but I think we probably don't want it.)What do people think about these options? On the one hand, it's good to have clarity and solid reference material for what the namespaces do. On the other hand, if we fork it, I suspect that nobody will ever use the original chord_harte
namespace since it's a strict subset of valid chord labels.
Add functionality to the namespace submodule to support dynamic extensions.
To be good citizens of the MIR world, we should provide a script that can crunch a jams file out to one or more .lab files. (Of course, such a thing shouldn't be necessary, but it will be useful as an intermediate solution until the rest of the universe gets up to speed.)
I'm thinking the following:
$ jams_to_lab.py my_jam.jams prefix
will create a set of lab files of the form
prefix__namespace__index.lab
where index
counts off how many annotations of type namespace
there are. For example, a SALAMI jams would give
prefix__segment_salami_function__0.lab
prefix__segment_salami_function__1.lab
prefix__segment_salami_upper__0.lab
prefix__segment_salami_upper__1.lab
prefix__segment_salami_lower__0.lab
prefix__segment_salami_lower__1.lab
each lab file would have JAMSy column headers, and comment headers including the file metadata and annotation metadata. (Sandboxes will be omitted, I guess.)
Desired namespaces or annotators may be user-specified on the command line.
Dynamically adding custom namespaces on import is a real drag.
Instead, we should allow custom namespaces to be stored in a directory specified by an environment variable or rc file.
I was trying to save SALAMI .lab files into .jams files but encountered this error.
The code was like this
for lab in labs:
jam, annotation = jams.util.import_lab('segment_open',lab_path+lab)
jam.save(dump_path+'SALAMI_'+lab[:-4]+'.jams')
I tried 'segment_salami_function' and 'segment_salami_upper'. Both return the same error. Please let me know if I misunderstood the usage of jams or any pointers. Thanks!!
Error messages:
SchemaError Traceback (most recent call last)
in ()
1 for lab in labs:
2 jam, annotation = jams.util.import_lab('segment_open',lab_path+lab)
----> 3 jam.save(dump_path+'SALAMI_'+lab[:-4]+'.jams')/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/jams-0.2.0-py2.7.egg/jams/core.pyc in save(self, path_or_file, strict, fmt)
1168 """
1169
-> 1170 self.validate(strict=strict)
1171
1172 with _open(path_or_file, mode='w', fmt=fmt) as fdesc:/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/jams-0.2.0-py2.7.egg/jams/core.pyc in validate(self, strict)
1198
1199 '''
-> 1200 valid = super(JAMS, self).validate(strict=strict)
1201
1202 for ann in self.annotations:/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/jams-0.2.0-py2.7.egg/jams/core.pyc in validate(self, strict)
476 except jsonschema.ValidationError as invalid:
477 if strict:
--> 478 raise SchemaError(str(invalid))
479 else:
480 warnings.warn(str(invalid))SchemaError: None is not of type u'number'
Failed validating u'type' in schema[u'properties'][u'file_metadata'][u'properties'][u'duration']:
{u'minimum': 0.0, u'type': u'number'}On instance[u'file_metadata'][u'duration']:
None
@justinsalamon has repeatedly kicked around the idea of writing a paper to document the changes in jams over the last year. I think this is probably a good idea, if only for publicity.
As for venues, the next obvious one is ismir2015 late breaking session (deadline: 2015-10-26). This would be an extended abstract (2 pages), which clearly doesn't suffice to document all the restructuring.
It would be nice to have a longer document as well, either a tech report or an arxiv preprint, which the late breaking session abstract can refer back to for more details. This means we'd have to crank out the paper by october, which should be plenty doable.
Who's in?
There was some discussion over lunch about updating, forking, or abandoning the matlab jams implementation. This mostly stemmed from a total lack of interest in developing or maintaining such a thing, and the difficulty of working with json in matlab.
Can we discuss this more thoroughly before committing to a plan?
What do people think about JSONlab? It might not have all the bells and whistles we've grown to love in python, but it might suffice for simple io.
The initial proposal suggested that the JAMS hierarchy be task-major, as follows:
JAMS object/
beat/
annotation0/
annotation_metadata/
data/
event
...
chord/
...
...
I'm concerned that this isn't very future proof, and presents some practical issues:
Two alternatives include "Annotation-major, task-minor" or "Annotation-major, task-agnostic".
In "Annotation-major, task-minor", the structure would look like the following:
JAMS object/
annotation0/
annotation_metadata/
beat/
observations/
interval
...
chord/
...
...
...
In this case, the entire annotation is kept together, and the data are grouped internally by task. However, what I would love to prevent more than anything is the wonky scenario where "genre" and "tags" and "mood" and "usage" are different buckets. They're all just tags (and, more often than not, unfortunate ones at that). But this is a slippery slope, because, by the same logic, so are chord labels ... and so are structure labels, and beats, and onsets, and, well, just about everything except for melody. And then what's the point of splitting them out? They're all the same datatype with different namespaces / meanings. What's more, I don't know that we want to bake this namespace structure into the format, because it's not all that flexible should the schema need to evolve.
What does seem more flexible, is to introduce namespaces internal to the observations themselves and have either "type-minor" (group by schema type) or "type/task-agnostic" grouping inside of an annotation. Ignoring usability for a moment and focusing solely on forwards / backwards format compatibility, it seems this gives JAMS the best chance to evolve gracefully over a decade.
Arguably, the file format / schema should be focused on being strict today and flexible tomorrow, while the programmatic interface / API should be easy to use, always. But who knows, maybe there's a compromise to be made here.
Title says it all. This doesn't show up in travis because conda still has jsonschema 2.4.0.
Running the test suite locally, I get the following:
======================================================================
ERROR: jams_test.test_jams_validate_good
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/bmcfee/git/jams/jams/tests/jams_test.py", line 539, in test_jams_validate_good
j1.validate()
File "/home/bmcfee/git/jams/jams/core.py", line 1187, in validate
valid = super(JAMS, self).validate(strict=strict)
File "/home/bmcfee/git/jams/jams/core.py", line 461, in validate
jsonschema.validate(self.__json__, schema.JAMS_SCHEMA)
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 478, in validate
cls(schema, *args, **kwargs).validate(instance)
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 122, in validate
for error in self.iter_errors(*args, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 98, in iter_errors
for error in errors:
File "/usr/local/lib/python2.7/dist-packages/jsonschema/_validators.py", line 291, in properties_draft4
schema_path=property,
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 114, in descend
for error in self.iter_errors(instance, schema):
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 98, in iter_errors
for error in errors:
File "/usr/local/lib/python2.7/dist-packages/jsonschema/_validators.py", line 42, in items
for error in validator.descend(item, items, path=index):
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 114, in descend
for error in self.iter_errors(instance, schema):
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 98, in iter_errors
for error in errors:
File "/usr/local/lib/python2.7/dist-packages/jsonschema/_validators.py", line 199, in ref
scope, resolved = validator.resolver.resolve(ref)
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 336, in resolve
return url, self._remote_cache(url)
File "/usr/local/lib/python2.7/dist-packages/functools32/functools32.py", line 400, in wrapper
result = user_function(*args, **kwds)
File "/usr/local/lib/python2.7/dist-packages/jsonschema/validators.py", line 346, in resolve_from_url
raise RefResolutionError(exc)
RefResolutionError: unknown url type:
JAMS files can get pretty large in plain text. It would be nice to support compression of some kind, eg, by allowing a gzip file handle instead of a filename. This will probably take a little bit of refactoring to do properly, but I see no downside in directly supporting jams.gz
(or jamz
, if you will) as a format.
While we're at it, what are folks' opinions on generalizing the backend from JSON? I can imagine use-cases where pickle or bson might be preferable. Ideally, this would all be transparent to the user, and all load/save operations would work out of the box.
Arguments against doing this:
Arguments in favor:
In all cases, we'd still use json schema validation, so functionally nothing would change.
So... thoughts?
I am trying to use import_lab, but there are some files in the Isophonics dataset that have empty range annotations (e.g., chordlab/Carole King/06 Way Over Yonder.lab
). This causes the JAMS validation to fail.
So, my question is: should we modify import_lab
to such that it "fixes" these annotations or should this function not alter the lab at all (and maybe modify the "broken" annotations in the parsers)?
I would go for the former, and maybe raise a warning every time an alteration of the original lab occurs.
From #13 @justinsalamon ...
Is there a thread where this is being discussed?
Okay team. What do we want to do about measurements that are intended to span the entire track? Is that a null-duration event, do we explicitly include the track duration?
Arguments for null-duration markup:
Arguments for explicit durations:
Arguments for zero-duration:
A third option might combine the first two: use null-duration for "weak supervision" (ie, the tag applies somewhere in the track but I don't know where), and full-duration for "strong supervision" (ie, this entire track is hip-hop).
What do people think about moving all schema management functionality into a single submodule? We currently have the ns
module for handling dynamic namespace schemata, and the core jams schema is handled directly in pyjams
.
Since these two things are not entirely independent -- validation code needs to touch both, for example -- I'm thinking it would be simpler if we move jams.ns
to jams.schema
and then have all schema loading/manipulation done from there.
The reason to do this comes from the validation
method for annotations, which has to supply the schema for a SparseObservation
object to the namespace schema constructor. This is the only value ever passed in to ns.schema()
, so it's silly to have a parameter and condition checking for it. We can't just hard-wire the current implementation though, since it would introduce a circular dependency in the submodules.
Moving pyjams.__SCHEMA__
to reside in the same submodule fixes the dependency cycle and simplifies some api and internal logic. Also, in case we do decide to include other schemata later on (eg, collections), having a catch-all submodule for schema handling makes a lot of sense.
From the end-user perspective, nothing important should change.
Some of the internals to pyjams
will need to change if we do this, but it's simply a matter of pointing to the new location of the schema object. Similarly, tests and docs will have to change, but that's pretty trivial.
Opinions?
It would be useful to have basic query functionality for selecting annotations from a jams object.
The top-level groupings are useful for organizing data into tasks, but provide no mechanism for filtering by namespace. As a concrete example, each SALAMI jam has segment annotations from 3 namespaces (function, upper, and lower), and we should have an interface to pull out only the data relating to one of them.
More generally, one might want to select out multiple namespaces ['salami_upper', 'salami_lower'] or 'segment_*'. This might be most simply accomplished by allowing regular expression queries.
As long as we're allowing querying on namespace, would it make sense to support querying on other fields? Annotator/curator? Metadata? Everything? Where does the madness stop?
It's probably good to implement this both at the level of AnnotationArray
first, and then propagate up to JAMS
.
Any comments/suggestions before I dive in and start hacking this out?
A couple of weeks ago, every time I tried to validate a non-valid JAMS file I got an error reporting the exact source of the problem.
However, now I only get this:
...
File "/home/uri/Projects/jams/jams/pyjams.py", line 1091, in save
self.validate(strict=strict)
File "/home/uri/Projects/jams/jams/pyjams.py", line 1124, in validate
valid &= ann.validate(strict=strict)
File "/home/uri/Projects/jams/jams/pyjams.py", line 781, in validate
six.raise_from(SchemaError, invalid)
File "/usr/local/lib/python2.7/dist-packages/six-1.9.0-py2.7.egg/six.py", line 692, in raise_from
raise value
jams.exceptions.SchemaError
Is it possible to print out the SchemaError
message by default?
Quick thoughts about making integration with mir_eval simple:
(ref_annotation, est_annotation)
, does the necessary unpacking into mir_eval
array/list format, and returns the resultI've been trying to work with pyjams, and keep running into the same stumbling blocks that we've discussed offline. To summarize:
After digging around a bit, I think it makes a lot of sense to use pandas DataFrames as an in-memory representation for several observation types. Rather than go into detail of how exactly this might look, I mocked up a notebook illustrating the main concepts with a chord annotation example here.
The key features are:
The way I see this playing out is that the json objects would stay pretty much the same on disk (modulo any schema changes we cook up for convenience). However, when serializing/deserializing, certain arrays get translated into pandas dataframes, which is the primary backing store in memory.
For example, a chord annotation record my_jams.chord[0].data
, rather than being a list of Range
objects, can instead be a single DataFrame which encapsulates all measurements asociated with the annotation. The jams object still retains hierarchy and multi-annotation collections, but each annotation becomes a single object, rather than a list of more atomic types.
If people are on board with this idea, I'd like to propose adopting a couple of conventions to make life easier:
timedelta[ns]
data type, rather than a float. Aside from making time values easy to find, pandas also provides some nice functionality for working with time series data.nan
rather than hallucinating data or leaving the record empty. (I'm looking at you, confidence values.)I realize that this adds yet another dependency and source of complexity to something that's already pretty monstrous, but I think it'll be worth it in the long run. Of course, this is all intended as a long-winded question/request for comments, so feel free to shoot it all down.
Just like the mood_thayer namespace, but for data of arbitrary dimension.
The use-cases here are things like latent factor prediction.
I tried to use jams.load
on this file and I get the following error:
ParameterError: Unknown JAMS extension format: "jams"
I am pretty sure this file could be loaded with pyjams just a few weeks ago. Is there some problem with this specific JAMS file itself? The function can load properly files like this.
If there's a problem with the file, the error should be more specific about why JAMS can't open it (the error message is clearly wrong, since jams
should be a well-known format).
... since the schema's all different, and much of the (python) codebase has been simplified or rewritten, this is probably necessary.
@ejhumphrey Any objection to switching from unittest.TestCase
objects (old and busted) over to nosetest functions (new hotness)? This will make tests easier to write, and play nicely with #25 .
import_lab
crashes when it tries to parse something like:
0.432471655 New Point
0.873650793 New Point
1.358004535 1
1.822993197 2
2.287006802 3
2.740000000 4
(this example is copied from the Isophonics beat annotations for the track When I Get Home, from the album A Hard Day's Night, by The Beatles)
Pandas complains about not getting a float, but a string:
File "./isophonics_parser.py", line 183, in <module>
process(args.in_dir, args.out_dir)
File "./isophonics_parser.py", line 128, in process
jam=jam)
File "/home/uri/Projects/jams/jams/util.py", line 81, in import_lab
data.loc[:, 1] -= data[0]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 182, in f
result = method(self, other)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 524, in wrapper
arr = na_op(lvalues, rvalues)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 475, in na_op
result[mask] = op(x[mask], _values_from_object(y[mask]))
TypeError: unsupported operand type(s) for -: 'str' and 'float'
Should we default to 0 when a char/string is found in the last column?
Should we provide encapsulated exceptions for jams? For background, read this.
Currently, we throw a mixture of TypeError
, RuntimeError
, ValueError
, and jsonschema.ValidationError
, depending on the situation. This makes it difficult to separate exceptions raised by JAMS from those raised in other packages.
Proposed solution:
Define a root exception class JamsError
to encapsulate our own exceptions. Then derive subclasses for the various types of exceptions we may encounter:
NamespaceError
Incorrect namespace for an annotation, eg, in the eval moduleValidationError
as in jsonschemaA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.