commonsmachinery / catalog Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU Affero General Public License v3.0
License: GNU Affero General Public License v3.0
Most of the Media operations on works in modules/core/lib/work.js
need to get the work to check permissions, and possibly also Work.media
to be able to manipulate the list. The relevant calls to db.Work.findByIdAsync
should add the parameter to select the minimal set of fields to avoid dragging in annotations etc.
Better list the permission fields as a constant, refering to it later, to make it easy to add as we extend the permission system.
Current list of affected functions are:
At present, it seems the load script only deals with ~1,5M works per day, which means that with a hashing rate of at least 3M per day, the load script will never catch up and just get more and more out of sync.
What can we do to speed up loading massively? @artfwo has suggested loading in parallel. @petli would appreciate your thoughts on this if you have some bandwidth soon?
When the same URI appears in both an identifier and a locator, it ends up being shown twice (or more!) in search results by URI, since both URIs get stuffed into lookup.uri. Update the API docs to clarify that /lookup/uri can return more than one result for the same work, and that it's up to the client to sort them out.
A lof of added/updated fields doesn't have to be set:
Work.updated_by
should not be set on first creation, it is sufficient with Work.added_by
.
Work.updated_at
should not be set on first creation, it is sufficient with Work.added_at
.
Work.annotations.updated_by
should not be set when the annotation is added (or updated) by the same user as Work.added_by
.
Work.annotations.updated_at
could be skipped if the time is within e.g. 1 minute of Work.added_by
. (or shorter?)
We've forgotten to update the data package specification, now that we introduced the collection property. Since it's rather essential, we should include it, as well as its supported values, in the specification.
Right now, the (browser extension) supported collectionLink URLs are:
Do we need the "value" field in the data package? It feels like redundant information that we might get away from. @artfwo ?
Over time, the catalog should scale to be able to include upwards of ½ billion works, and quite a bit of work may be needed to get us there. We can look at this in a few different stages:
We need to scale both mongodb and the hashdb, while retaining lookup speeds on both. Approaches to scale could include ripping auth/frontend work from the catalog to create a lookup-only endpoint API, which is more lightweight than the r/w endpoint API.
config.search.hashDb
should change to an URL to either specify an URL to a Kyoto Tycoon server (likely just tcp://host:port
). This requires support in hmsearch-node for connect()
first:
commonsmachinery/hmsearch-node#1
It would be incredibly useful to have a web page which can perform some basic sanity checks on images and URLs. When I don't get a match that I would expect, or get a wrong match, I usually follow roughly the same sequence of events, and this could probably be automated greatly:
Given two image URLs, one original, and one which you expect should match that original:
a) Calculate blockhash of both with JS and C blockhash, signal error if >4 bit difference between JS and C
b) Lookup original in DB by hash, signal error if it's not found (not possible to match something against an original which doesn't exist)
c) hamming distance between the original and copy, signal error if >10 bits (wouldn't match)
d) Lookup copy hash in DB, signal error if not found (at this point, it should be found if original is there and <10 bits distance)
Most people surf different images, but caching lookups might be good to explore at some point.
Improve the structure and design of the work views.
Depends on #76.
when updating work and source: updated value is returned as integer while creation date is sent as string.
when getting work, source or post: date values are strings instead of number.
Correct me if I'm wrong, but I think this would be better to address from the backend.
Check the K-Samsök API http://www.ksamsok.se/api/ (in Swedish) to validate whether information we need for inclusion in the catalog is delivered through that API, as well as check on the size of the images that can be retrieved.
When the gravatar email is changed in the user profile, the gravatar itself isn't updated on the page or in the header when saving.
MongoDB enforces a limit on index key, which results in a broken URI index on search.lookups. This problem breaks loading the WMC datapackage which contains works with long URLs.
Temporary workaround is to set failIndexKeyTooLong
parameter to false.
Possible long-term solution would be to index URL hashes instead.
The array of sources are missing the href property, and added_by and source_works aren't expanded into objects. This is currently returned:
"sources" : [{
"id" : "53df9e03a814c23c1c06cfc1",
"source_work" : "53df9e03a814c23c1c06cfbe",
"added_by" : "53df9e02a814c23c1c06cf4c"
"added_at" : "2014-08-04T14:51:47.805Z",
}]
But it should look like this:
"sources": [{
"id": "6e592d7d63613d7321ee5391",
"href": "https://catalog.elog.io/works/5396e592d7d163613d7321ee/sources/6e592d7d63613d7321ee5391",
"source_work": {
"id": "sourceWorkID",
"href": "https://catalog.elog.io/works/sourceWorkID"
},
"added_by": {
"id": "anotherUserID",
"href": "https://catalog.elog.io/users/anotherUserID"
},
"added_at": "2014-04-14T02:15:15Z"
}],
Some CC licensed works include specific instructions for how they want to be attributed. This is an example:
(check the Attribution metadata field). It would be interesting to see how this kind of information could be expressed using W3C Media Annotations so that it can be picked up and used by Elog.io.
Commit 21b2c0b moves replacing about:resource with the newly generated entry URI to the frontend. This only replaces the subject, but it should also replace instances of about:resource as an object too.
If the entry URI already exist as a subject, the predicates for the two subjects must be merged.
As part of the API documentation, it would be useful to have code samples in Python and JavaScript, each which connects to the API, searches for an image by url and then by hash, and returns a prettyprinted json or so from it.
The intention in the data model is that Work.alias
should be optional. However, when creating works without an alias it is still included in this index, despite the sparse flag.
Work.index({ 'owner.user': 1, 'alias': 1 }, { unique: true, sparse: true });
Which results in a duplicate key error when the same users creates another work without an alias.
The source_work parameter should be expanded into an id/href pair. This is currently returned:
{
"id" : "53df9e03a814c23c1c06cfc1",
"href" : "http://localhost:8004/works/53df9e03a814c23c1c06cfbb/sources/53df9e03a814c23c1c06cfc1",
"source_work" : "53df9e03a814c23c1c06cfbe",
"added_by" : {
"href" : "http://localhost:8004/users/53df9e02a814c23c1c06cf4c",
"id" : "53df9e02a814c23c1c06cf4c"
}
"added_at" : "2014-08-04T14:51:47.805Z",
}
{
"id": "6e592d7d63613d7321ee5391",
"href": "https://catalog.elog.io/works/5396e592d7d163613d7321ee/sources/6e592d7d63613d7321ee5391",
"source_work": {
"id": "sourceWorkID",
"href": "https://catalog.elog.io/works/sourceWorkID"
},
"added_by": {
"id": "anotherUserID",
"href": "https://catalog.elog.io/users/anotherUserID"
},
"added_at": "2014-04-14T02:15:15Z"
}
Running make test
goes without errors, but when getting entries (apitest or browser) I am getting
[2014-05-08 21:56:59,765: ERROR/MainProcess] Task catalog.tasks.query_works_simple[f1b3e533-82b8-4ea2-a5a0-314251b96538] raised unexpected: RuntimeError('Unknown property type resource',)
Traceback (most recent call last):
File "/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 421, in __protected_call__
return self.run(*args, **kwargs)
File "/---/CommonsMachine/catalog/backend/catalog/tasks.py", line 738, in query_works_simple
return store.query_works_simple(user_uri, offset, limit, query)
File "/---/CommonsMachine/catalog/backend/catalog/store.py", line 919, in query_works_simple
results.append(self.get_work(user_uri=user_uri, work_uri=str(work_subject)))
File "/---/CommonsMachine/catalog/backend/catalog/store.py", line 528, in get_work
work = Work.from_model(self._model, work_uri, user_uri)
File "/---/CommonsMachine/catalog/backend/catalog/store.py", line 151, in from_model
raise RuntimeError("Unknown property type %s" % property_type)
RuntimeError: Unknown property type resource
backend task query_works_simple failed: {"status":"FAILURE","traceback":"Traceback (most recent call last):\n File \"/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py\", line 240, in trace_task\n R = retval = fun(*args, **kwargs)\n File \"/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py\", line 421, in __protected_call__\n return self.run(*args, **kwargs)\n File \"/---/CommonsMachine/catalog/backend/catalog/tasks.py\", line 738, in query_works_simple\n return store.query_works_simple(user_uri, offset, limit, query)\n File \"/---/CommonsMachine/catalog/backend/catalog/store.py\", line 919, in query_works_simple\n results.append(self.get_work(user_uri=user_uri, work_uri=str(work_subject)))\n File \"/---/CommonsMachine/catalog/backend/catalog/store.py\", line 528, in get_work\n work = Work.from_model(self._model, work_uri, user_uri)\n File \"/---/CommonsMachine/catalog/backend/catalog/store.py\", line 151, in from_model\n raise RuntimeError(\"Unknown property type %s\" % property_type)\nRuntimeError: Unknown property type resource\n","result":{"exc_message":"Unknown property type resource","exc_type":"RuntimeError"},"task_id":"f1b3e533-82b8-4ea2-a5a0-314251b96538","children":[]}
``
Though "visibility" is more correct, it is very prone to typos and we've had enough bugs caused by that now.
When getting public information (users, orgs, public works, media/sources/annotations of public works) a CORS header should be added to allow access from third-party web pages:
Access-Control-Allow-Origin: *
There's a bug somewhere that makes this sometimes fail with a "bad hash format". It doesn't happen always, but about every third or fourth time:
jonas@silk:~$ time curl -s -H "Accept: application/json" http://catalog.elog.io/lookup/blockhash?hash=fffffffffffffffffffffffff1fff0fff0840000000000000000000000000000|json_pp
{
"message" : "Bad hash format",
"code" : "bad_hash"
}
The RedisLock
now creates a new connection to Redis on each lock attempt. This connection should be reused, probably with thread-local storage unless the Redis class is thread-safe.
frontend/lib/rest.js:validatePaging in throws an Error, resulting in a HTTP 500, on invalid page and per_page parameters.
While page/per_page < 1 is nonsense, it should result in a 400 Bad request instead of a 500 Server error. (The middleware function can do a res.send(400)
instead of next()
)
A per_page > maxWorksPerPage should just be forced to the max value, rather than throwing an error. This is preferable, since the paging links are returned and a client can thus recover from getting an unexpected page size, but it cannot easily recover from just getting an error and having to guess what an acceptable page size might be.
We need three views in the index/ that can be used to display search results as well as specific works when requested with text/html instead of application/json. This story is about setting up the basic template structure for this with Jade views and Bootstrap styles. This can later be taken over by Jonas to do parts of the graphical styling.
For now, just create these views in the index frontend itself. Later on we'll merge the full catalog frontend with this.
The views are static with no javascript needed.
The following pages should be created:
/works/:workID
/lookup/uri
(including paging links for first/prev/next when available)/lookup/blockhash
(including paging links for first/prev/next when available)When running ./bootstrap.sh it installs the dependencies in the virtual environment but Redis and Redland bindings for python. Prints the following messages:
Redland not installed, downloading...
./bootstrap.sh: 43: ./bootstrap.sh: curl: not found
gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
./bootstrap.sh: 44: cd: can't cd to redland-bindings-1.0.16.1
./bootstrap.sh: 45: ./bootstrap.sh: ./configure: not found
./bootstrap.sh: 46: cd: can't cd to python
make: *** No rule to make target `install'. Stop.
Using Ubuntu 12.04 34-bit
Work.annotations.property.value
should be optional. If not present, the populateAnnotation function should set it based on the link or label (as appropriate for the particular annotation). For unknown annotations no such population would be done.
The commonshasher script can then stop generating value objects.
Should createWorkAnnoation/updateWorkAnnotation even remove the value property if present and identical to the main link/label field before saving the object to the database?
The current session info (primarily email address used to log in) is set as attributes on the body element when rendering the web pages. Since the browser uses ETags it will just get a 304 when requesting e.g. a work page that it has previously seen, and use the cache page. This results in the session code picking up the session info in that cached page, even if it is no longer current.
TBD how to solve this. Alternatives:
Requesting the data URI of the complete metadata of a work returns a 503 status. The only data sent with the request is the authentication string (base64).
This is likely a backend problem.
127.0.0.1 - - [Mon, 28 Apr 2014 02:45:20 GMT] "GET /works/239/completeMetadata HTTP/1.1" 503 110 "-" "-"
[2014-04-27 20:45:20,514: ERROR/MainProcess] Task catalog.tasks.get_complete_metadata[2af29cd0-153f-4bcc-b6bc-663868668d3c] raised unexpected: TypeError("in method 'librdf_model_to_string', argument 3 of type 'char const *'",)
Traceback (most recent call last):
File "---/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "---/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 421, in __protected_call__
return self.run(*args, **kwargs)
File "---/catalog/backend/catalog/tasks.py", line 701, in get_complete_metadata
return store.get_complete_metadata(user_uri, work_uri, format)
File "---/catalog/backend/catalog/store.py", line 785, in get_complete_metadata
result = temp_model.to_string(name=format, base_uri=None)
File "---/catalog/build/backend/local/lib/python2.7/site-packages/RDF.py", line 1160, in to_string
return Redland.librdf_model_to_string(self._model, rbase_uri, name, mime_type, rtype_uri)
TypeError: in method 'librdf_model_to_string', argument 3 of type 'char const *'
Right now the frontend expects source_work to be a plain ID, but the POST body should look like this:
{ "source_work": { "href": "https://catalog.elog.io/works/321ee5396e592d7d163613d7" } }
or this:
{ "source_work": { "id": "321ee5396e592d7d163613d7" } }
Some websites, like xkcd.com, update very frequently but have a static design, which (as proposed by @jonasob on Twitter) would make it possible (and possibly more efficient) to have one rule to match the name and URL of the image instead of generating a new JSON object thrice a week.
If that (or something similar) was possible in your architecture, that may be something to think about.
Visibility stays the same after update operation. No error is shown in the server log:
input data:
{ visibility: 'public',
state: 'published',
resource: 'http://localhost:8004/works/106',
updated: undefined,
creator: 'http://localhost:8004/users/dev_1' }
output data:
{ updated: '2014-04-28T16:34:37Z',
resource: 'http://localhost:8004/works/106',
creator: 'http://localhost:8004/users/dev_1',
created: '2014-04-28T16:34:33Z',
visibility: 'private',
metadataGraph: {},
state: 'published',
updatedBy: 'http://localhost:8004/users/dev_1',
type: 'Work',
id: 106,
metadata: 'http://localhost:8004/works/106/metadata' }
Status code: 200
GET /works/ID/annotations/ID should render the standard work view, and have the backbone routing make sure that this particular annotation is highlighted.
command.execute/logEvent
should check if the event transfer job is keeping up, and if not throw an exception that results in the frontend returning a 503 to the client.
An assumption has to be made by the event rate, and then ensure that say 10 minutes of events fit in the capped collection for temporary event storage in core.
This can then be a fairly simple implementation:
The old frontend views in catalog, and the new views in the index frontend, should be merged using the new bootstrap layout.
Also consider changing from Jade+Backbone+Stickit to Backbone+React, rendering pages both on server and client.
Apiary helpfully points out that Content-type
ought to be application/json; charset=utf-8
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.