commonsmachinery / catalog Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 7.0 5.17 MB

License: GNU Affero General Public License v3.0

JavaScript 74.25% Perl 0.30% Makefile 0.35% Shell 0.47% HTML 8.51% Python 0.25% CSS 15.87%

catalog's People

Stargazers

Watchers

Forkers

petli suchetaghoshal jonasob pombredanne ebalder ffkp

catalog's Issues

Fetch minimal set of fields when doing only permissions checks

Most of the Media operations on works in modules/core/lib/work.js need to get the work to check permissions, and possibly also Work.media to be able to manipulate the list. The relevant calls to db.Work.findByIdAsync should add the parameter to select the minimal set of fields to avoid dragging in annotations etc.

Better list the permission fields as a constant, refering to it later, to make it easy to add as we extend the permission system.

Current list of affected functions are:

getWorkMedia
createWorkMedia
removeMediaFromWork
unlinkAllMedia
addMediaToWork

Speeding up inserts

At present, it seems the load script only deals with ~1,5M works per day, which means that with a hashing rate of at least 3M per day, the load script will never catch up and just get more and more out of sync.

What can we do to speed up loading massively? @artfwo has suggested loading in parallel. @petli would appreciate your thoughts on this if you have some bandwidth soon?

Update API docs for /lookup/uri

When the same URI appears in both an identifier and a locator, it ends up being shown twice (or more!) in search results by URI, since both URIs get stuffed into lookup.uri. Update the API docs to clarify that /lookup/uri can return more than one result for the same work, and that it's up to the client to sort them out.

Reduce index size: drop superfluous updated_by/at fields

A lof of added/updated fields doesn't have to be set:

Work.updated_by should not be set on first creation, it is sufficient with Work.added_by.

Work.updated_at should not be set on first creation, it is sufficient with Work.added_at.

Work.annotations.updated_by should not be set when the annotation is added (or updated) by the same user as Work.added_by.

Work.annotations.updated_at could be skipped if the time is within e.g. 1 minute of Work.added_by. (or shorter?)

Update data package specification

We've forgotten to update the data package specification, now that we introduced the collection property. Since it's rather essential, we should include it, as well as its supported values, in the specification.

Right now, the (browser extension) supported collectionLink URLs are:

http://commons.wikimedia.org
http://commonsmachinery.se/

Do we need value?

Do we need the "value" field in the data package? It feels like redundant information that we might get away from. @artfwo ?

Test scaling to ½ billion works

Over time, the catalog should scale to be able to include upwards of ½ billion works, and quite a bit of work may be needed to get us there. We can look at this in a few different stages:

50M works
100M
250M
500M

We need to scale both mongodb and the hashdb, while retaining lookup speeds on both. Approaches to scale could include ripping auth/frontend work from the catalog to create a lookup-only endpoint API, which is more lightweight than the r/w endpoint API.

Allow hash database to be handled by Kyoto Tycoon

config.search.hashDb should change to an URL to either specify an URL to a Kyoto Tycoon server (likely just tcp://host:port). This requires support in hmsearch-node for connect() first:
commonsmachinery/hmsearch-node#1

Web based test suite

It would be incredibly useful to have a web page which can perform some basic sanity checks on images and URLs. When I don't get a match that I would expect, or get a wrong match, I usually follow roughly the same sequence of events, and this could probably be automated greatly:

Given two image URLs, one original, and one which you expect should match that original:

a) Calculate blockhash of both with JS and C blockhash, signal error if >4 bit difference between JS and C
b) Lookup original in DB by hash, signal error if it's not found (not possible to match something against an original which doesn't exist)
c) hamming distance between the original and copy, signal error if >10 bits (wouldn't match)
d) Lookup copy hash in DB, signal error if not found (at this point, it should be found if original is there and <10 bits distance)

Cache hash lookups

Most people surf different images, but caching lookups might be good to explore at some point.

Work view design

Improve the structure and design of the work views.

Depends on #76.

Incorrect data types on date values (for some requests)

when updating work and source: updated value is returned as integer while creation date is sent as string.

when getting work, source or post: date values are strings instead of number.

Correct me if I'm wrong, but I think this would be better to address from the backend.

K-Samsök

Check the K-Samsök API http://www.ksamsok.se/api/ (in Swedish) to validate whether information we need for inclusion in the catalog is delivered through that API, as well as check on the size of the images that can be retrieved.

Gravatar isn't updated on profile change

When the gravatar email is changed in the user profile, the gravatar itself isn't updated on the page or in the header when saving.

Lookup index fails to build when loading works with long URLs

MongoDB enforces a limit on index key, which results in a broken URI index on search.lookups. This problem breaks loading the WMC datapackage which contains works with long URLs.

Temporary workaround is to set failIndexKeyTooLong parameter to false.

Possible long-term solution would be to index URL hashes instead.

GET /works/X: source objects missing properties

The array of sources are missing the href property, and added_by and source_works aren't expanded into objects. This is currently returned:

"sources" : [{
     "id" : "53df9e03a814c23c1c06cfc1",
     "source_work" : "53df9e03a814c23c1c06cfbe",
     "added_by" : "53df9e02a814c23c1c06cf4c"
     "added_at" : "2014-08-04T14:51:47.805Z",
  }]

But it should look like this:

"sources": [{
        "id": "6e592d7d63613d7321ee5391",
        "href": "https://catalog.elog.io/works/5396e592d7d163613d7321ee/sources/6e592d7d63613d7321ee5391",

        "source_work": {
            "id": "sourceWorkID",
            "href": "https://catalog.elog.io/works/sourceWorkID"
        },

        "added_by": {
            "id": "anotherUserID",
            "href": "https://catalog.elog.io/users/anotherUserID"
        },
        "added_at": "2014-04-14T02:15:15Z"
}],

Include attribution in metadata?

Some CC licensed works include specific instructions for how they want to be attributed. This is an example:

http://commons.wikimedia.org/w/api.php?action=query&titles=File:-Fernsehstudio-Journalistengespraech-crop.jpg&prop=imageinfo&iiprop=extmetadata

(check the Attribution metadata field). It would be interesting to see how this kind of information could be expressed using W3C Media Annotations so that it can be picked up and used by Elog.io.

[frontend] Replace about:resource in the whole graph

Commit 21b2c0b moves replacing about:resource with the newly generated entry URI to the frontend. This only replaces the subject, but it should also replace instances of about:resource as an object too.

If the entry URI already exist as a subject, the predicates for the two subjects must be merged.

Prepare code samples

As part of the API documentation, it would be useful to have code samples in Python and JavaScript, each which connects to the API, searches for an image by url and then by hash, and returns a prettyprinted json or so from it.

Works without aliases shouldn't be included in user,alias index

The intention in the data model is that Work.alias should be optional. However, when creating works without an alias it is still included in this index, despite the sparse flag.

Work.index({ 'owner.user': 1, 'alias': 1 }, { unique: true, sparse: true });

Which results in a duplicate key error when the same users creates another work without an alias.

GET /works/X/sources and GET /works/X/sources/Y: source_work not expanded into object

The source_work parameter should be expanded into an id/href pair. This is currently returned:

{
   "id" : "53df9e03a814c23c1c06cfc1",
   "href" : "http://localhost:8004/works/53df9e03a814c23c1c06cfbb/sources/53df9e03a814c23c1c06cfc1",

   "source_work" : "53df9e03a814c23c1c06cfbe",

   "added_by" : {
      "href" : "http://localhost:8004/users/53df9e02a814c23c1c06cf4c",
      "id" : "53df9e02a814c23c1c06cf4c"
   }
   "added_at" : "2014-08-04T14:51:47.805Z",
}

{
    "id": "6e592d7d63613d7321ee5391",
    "href": "https://catalog.elog.io/works/5396e592d7d163613d7321ee/sources/6e592d7d63613d7321ee5391",

    "source_work": {
        "id": "sourceWorkID",
        "href": "https://catalog.elog.io/works/sourceWorkID"
    },

    "added_by": {
        "id": "anotherUserID",
        "href": "https://catalog.elog.io/users/anotherUserID"
    },
    "added_at": "2014-04-14T02:15:15Z"
}

'Unknown property type resource'

Running make test goes without errors, but when getting entries (apitest or browser) I am getting

[2014-05-08 21:56:59,765: ERROR/MainProcess] Task catalog.tasks.query_works_simple[f1b3e533-82b8-4ea2-a5a0-314251b96538] raised unexpected: RuntimeError('Unknown property type resource',)
Traceback (most recent call last):
  File "/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 421, in __protected_call__
    return self.run(*args, **kwargs)
  File "/---/CommonsMachine/catalog/backend/catalog/tasks.py", line 738, in query_works_simple
    return store.query_works_simple(user_uri, offset, limit, query)
  File "/---/CommonsMachine/catalog/backend/catalog/store.py", line 919, in query_works_simple
    results.append(self.get_work(user_uri=user_uri, work_uri=str(work_subject)))
  File "/---/CommonsMachine/catalog/backend/catalog/store.py", line 528, in get_work
    work = Work.from_model(self._model, work_uri, user_uri)
  File "/---/CommonsMachine/catalog/backend/catalog/store.py", line 151, in from_model
    raise RuntimeError("Unknown property type %s" % property_type)
RuntimeError: Unknown property type resource
backend task query_works_simple failed: {"status":"FAILURE","traceback":"Traceback (most recent call last):\n  File \"/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py\", line 240, in trace_task\n    R = retval = fun(*args, **kwargs)\n  File \"/---/CommonsMachine/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py\", line 421, in __protected_call__\n    return self.run(*args, **kwargs)\n  File \"/---/CommonsMachine/catalog/backend/catalog/tasks.py\", line 738, in query_works_simple\n    return store.query_works_simple(user_uri, offset, limit, query)\n  File \"/---/CommonsMachine/catalog/backend/catalog/store.py\", line 919, in query_works_simple\n    results.append(self.get_work(user_uri=user_uri, work_uri=str(work_subject)))\n  File \"/---/CommonsMachine/catalog/backend/catalog/store.py\", line 528, in get_work\n    work = Work.from_model(self._model, work_uri, user_uri)\n  File \"/---/CommonsMachine/catalog/backend/catalog/store.py\", line 151, in from_model\n    raise RuntimeError(\"Unknown property type %s\" % property_type)\nRuntimeError: Unknown property type resource\n","result":{"exc_message":"Unknown property type resource","exc_type":"RuntimeError"},"task_id":"f1b3e533-82b8-4ea2-a5a0-314251b96538","children":[]}
``

Change "visibility" to "visible"

Though "visibility" is more correct, it is very prone to typos and we've had enough bugs caused by that now.

CORS header

When getting public information (users, orgs, public works, media/sources/annotations of public works) a CORS header should be added to allow access from third-party web pages:

Access-Control-Allow-Origin: *

Sometimes returns bad hash format

There's a bug somewhere that makes this sometimes fail with a "bad hash format". It doesn't happen always, but about every third or fourth time:

jonas@silk:~$ time curl -s -H "Accept: application/json" http://catalog.elog.io/lookup/blockhash?hash=fffffffffffffffffffffffff1fff0fff0840000000000000000000000000000|json_pp
{
"message" : "Bad hash format",
"code" : "bad_hash"
}

[backend] Reuse Redis connection

The RedisLock now creates a new connection to Redis on each lock attempt. This connection should be reused, probably with thread-local storage unless the Redis class is thread-safe.

Handle incorrect paging parameters

frontend/lib/rest.js:validatePaging in throws an Error, resulting in a HTTP 500, on invalid page and per_page parameters.

While page/per_page < 1 is nonsense, it should result in a 400 Bad request instead of a 500 Server error. (The middleware function can do a res.send(400) instead of next())

A per_page > maxWorksPerPage should just be forced to the max value, rather than throwing an error. This is preferable, since the paging links are returned and a client can thus recover from getting an unexpected page size, but it cannot easily recover from just getting an error and having to guess what an acceptable page size might be.

Basic work views

We need three views in the index/ that can be used to display search results as well as specific works when requested with text/html instead of application/json. This story is about setting up the basic template structure for this with Jade views and Bootstrap styles. This can later be taken over by Jonas to do parts of the graphical styling.

For now, just create these views in the index frontend itself. Later on we'll merge the full catalog frontend with this.

The views are static with no javascript needed.

The following pages should be created:

/works/:workID
/lookup/uri (including paging links for first/prev/next when available)
/lookup/blockhash (including paging links for first/prev/next when available)

Redis and Redland bindings not downloading on bootstrap

When running ./bootstrap.sh it installs the dependencies in the virtual environment but Redis and Redland bindings for python. Prints the following messages:

Redland not installed, downloading...
./bootstrap.sh: 43: ./bootstrap.sh: curl: not found

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
./bootstrap.sh: 44: cd: can't cd to redland-bindings-1.0.16.1
./bootstrap.sh: 45: ./bootstrap.sh: ./configure: not found
./bootstrap.sh: 46: cd: can't cd to python
make: *** No rule to make target `install'.  Stop.

Using Ubuntu 12.04 34-bit

Reduce index size: don't store duplicated property value

Work.annotations.property.value should be optional. If not present, the populateAnnotation function should set it based on the link or label (as appropriate for the particular annotation). For unknown annotations no such population would be done.

The commonshasher script can then stop generating value objects.

Should createWorkAnnoation/updateWorkAnnotation even remove the value property if present and identical to the main link/label field before saving the object to the database?

Session handling in catalog frontend doesn't work with caches

The current session info (primarily email address used to log in) is set as attributes on the body element when rendering the web pages. Since the browser uses ETags it will just get a 304 when requesting e.g. a work page that it has previously seen, and use the cache page. This results in the session code picking up the session info in that cached page, even if it is no longer current.

TBD how to solve this. Alternatives:

Only use ETags for REST requests, not for web pages. It is mainly in place so that clients can request non-conflicting updates. A full catalog web site would anyway have more dynamic parts, in which case it doesn't make sense with aggressive caching (unless the dynamic parts are all added with javascript).
Include session info in the ETag, instead of just basing it on the object ID and version. This should be email or session id, plus the gravatar hash (so the page is rerendered if that has changed).
Don't store session info in the DOM, but have the client javascript fetch it from the cookies.

Getting work complete metadata returns 503 status

Requesting the data URI of the complete metadata of a work returns a 503 status. The only data sent with the request is the authentication string (base64).

This is likely a backend problem.

127.0.0.1 - - [Mon, 28 Apr 2014 02:45:20 GMT] "GET /works/239/completeMetadata HTTP/1.1" 503 110 "-" "-"
[2014-04-27 20:45:20,514: ERROR/MainProcess] Task catalog.tasks.get_complete_metadata[2af29cd0-153f-4bcc-b6bc-663868668d3c] raised unexpected: TypeError("in method 'librdf_model_to_string', argument 3 of type 'char const *'",)
Traceback (most recent call last):
  File "---/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "---/catalog/build/backend/local/lib/python2.7/site-packages/celery-3.1.10-py2.7.egg/celery/app/trace.py", line 421, in __protected_call__
    return self.run(*args, **kwargs)
  File "---/catalog/backend/catalog/tasks.py", line 701, in get_complete_metadata
    return store.get_complete_metadata(user_uri, work_uri, format)
  File "---/catalog/backend/catalog/store.py", line 785, in get_complete_metadata
    result = temp_model.to_string(name=format, base_uri=None)
  File "---/catalog/build/backend/local/lib/python2.7/site-packages/RDF.py", line 1160, in to_string
    return Redland.librdf_model_to_string(self._model, rbase_uri, name, mime_type, rtype_uri)
TypeError: in method 'librdf_model_to_string', argument 3 of type 'char const *'

POST /works/X/sources: source_work should be object

Right now the frontend expects source_work to be a plain ID, but the POST body should look like this:

{ "source_work": { "href": "https://catalog.elog.io/works/321ee5396e592d7d163613d7" } }

or this:

{ "source_work": { "id": "321ee5396e592d7d163613d7" } }

RegExp matching

Some websites, like xkcd.com, update very frequently but have a static design, which (as proposed by @jonasob on Twitter) would make it possible (and possibly more efficient) to have one rule to match the name and URL of the image instead of generating a new JSON object thrice a week.

If that (or something similar) was possible in your architecture, that may be something to think about.

work visibility isn't being updated

Visibility stays the same after update operation. No error is shown in the server log:

input data:

{ visibility: 'public',
  state: 'published',
  resource: 'http://localhost:8004/works/106',
  updated: undefined,
  creator: 'http://localhost:8004/users/dev_1' }

output data:

{ updated: '2014-04-28T16:34:37Z',
  resource: 'http://localhost:8004/works/106',
  creator: 'http://localhost:8004/users/dev_1',
  created: '2014-04-28T16:34:33Z',
  visibility: 'private',
  metadataGraph: {},
  state: 'published',
  updatedBy: 'http://localhost:8004/users/dev_1',
  type: 'Work',
  id: 106,
  metadata: 'http://localhost:8004/works/106/metadata' }

Status code: 200

GET /works/ID/annotations/ID should render work with annotation highlighed

GET /works/ID/annotations/ID should render the standard work view, and have the backbone routing make sure that this particular annotation is highlighted.

Return 503 if the event transfer job isn't keeping up with logged events

command.execute/logEvent should check if the event transfer job is keeping up, and if not throw an exception that results in the frontend returning a 503 to the client.

An assumption has to be made by the event rate, and then ensure that say 10 minutes of events fit in the capped collection for temporary event storage in core.

This can then be a fairly simple implementation:

logEvent need to keep track of the timestamp of the last event logged in the capped collection
If it checked on the event transfer with the last N minutes (e.g. 2), it will just log the event
If it has been more than N minutes, it checks the timestamp of the last event in the long-term event collection
- If that timestamp is > M minutes older (e.g. 5 minutes) than the previously logged event, it rejects the command to return a 503.
  - It remembers this status for K minutes (e.g. 1) and rejects all commands during this period, before checking next if the log transfer is again healthy.
- Otherwise all is good for another N minutes

Merge frontend views

The old frontend views in catalog, and the new views in the index frontend, should be merged using the new bootstrap layout.

Also consider changing from Jade+Backbone+Stickit to Backbone+React, rendering pages both on server and client.

Set charset in JSON responses

Apiary helpfully points out that Content-type ought to be application/json; charset=utf-8.

commonsmachinery / catalog Goto Github PK

catalog's People

Stargazers

Watchers

Forkers

catalog's Issues

Recommend Projects

Recommend Topics

Recommend Org