tdt / core Goto Github PK

View Code? Open in Web Editor NEW

83.0 83.0 31.0 21.03 MB

Transform any dataset into an HTTP API with The DataTank

Home Page: http://thedatatank.com

PHP 51.94% CSS 10.75% JavaScript 28.51% Vue 0.46% HTML 8.33%

datatank government-data open-data php

core's People

Contributors

Stargazers

Watchers

core's Issues

SPECTQL - binary operator with "null"

Currently null values are parsed as strings "null" for internal simplicity (afaik). However, when performing a comparison (e.g. x > 200), and x = "null" it will return true, for it is a string. This needs to be fixed.

Error on .about endpoint for TDTInfo/TDTAdmin

Notice: Undefined property: stdClass::$Tdtadmin in tdt/core/formatters/xmlFormatter.php on line 44
Notice: Undefined property: stdClass::$Tdtadmin in tdt/core/formatters/xmlFormatter.php on line 46

Fix file permissions

Recommended file permissions:

directories 755
files 644

See comments here: 88b5f9b#commitcomment-2766416

Make PUT in TDTAdmin idempotent & symmetrical for JSON files

Problem

At this moment, you can PUT your {package}/{resource} to TDTAdmin/Resources, in both www form encoding as in json. When requesting the TDTAdmin/Resources/{package}/{resource}.json file though, it gives a totally different json file which cannot be used to PUT again.

solution

Make sure the json can contain a similar structure to what it GETs.

CSV - PK test + REST adjustment

Currently REST parameters are handled numerically so that a REST parameter can be used to filter out a certain entry. However the index doesn't have to be numeric, it can also be a unique identifier (by defining a PK). This needs to be implemented.

RDF output

Problem

When we want RDF output, we need to have the data enriched with semantics. If a normal formatter would be used, the resulting format would be really complicated to parse.

Solution

As a model for these objects we will use ARC graphs. When triples are kept in memory, and a regular formatter is chosen, we will need a special function to print this ARC model. It should first check whether the object is a normal PHP object, or a special ARC graph object and print it accordingly.

Serializing

At this moment we have an abstract class for formatters (json, xml, csv...) which you can find here:
https://github.com/tdt/formatters/tree/master/src/tdt/formatters

When an ARC model is inserted, we now need to have a different approach to serializing this data, thus a different function.

For semantic output (rdf/xml, rdf/json, turtle, n3...) I would suggest reusing the serializers in ARC.

https://github.com/semsol/arc2/tree/master/serializers

Strategies

For every strategy we write in tdt/core, we need thus to know whether the data that is retrieved is a graph, or whether it is standard PHP object model. Maybe it would be itnerested to have an RDFStrategy abstract class to implement instead of AStrategy?

LD - Subresource

Currently if a subresource is passed via the rest parameters, there's no change in the construct query. This needs be a filter at extraction time, not a post-processing time.

The first page of a The DataTank

Problem

When your TDT got installed and you go to the first page, a README visualisation is shown how to install The DataTank. the text is a copy from the README in tdt/start.

Solution

The first page should be a page which helps the data consumer further. It should explain how to use everything, where to find the documentation when more questions come up (http://thedatatank.com/help/category/consuming/) and it should list all the resources installed in this tdt/core in a fancy way.

Order of namespaces

First declare a namespace beforing using other namespaces! NOTE: some editor like netbeans will automatically import namespace usages, and put them in the first lines of the php file, even though a namespace has been declared.

Show meta-data differently in tdtinfo/admin

Currently meta-data parameters are put in the parameters section, along with optional parameters of a certain type of resource. This is correct, but perhaps we should find a way to show which parameters are optional parameters, and which are meta-data parameters.

Proposed solution:

next to documentation, parameters, requiredparameters, add meta_data as a section? If this is the case documentation will have to be updated!

XML support

Support XML again as a resource strategy, used quite a bit, but not yet supported.

SPECTQL - CSV

It would be nice to have a SPECTQL interface on a CSV file. Clauses such as select, where,... can be implemented before the object is created from the file and then given to the SPECTQL tree, or normal Read Controller.

LD - Paging

Implement paging in the LD strategy.

Display meta-data on tdtinfo/resources

Currently only example_uri is displayed as meta-data in the tdtinfo/resources-resource. This should be expanded to all meta-data.

GUI for building SPECTQL URIs

Re-implement the GUI for creating SPECTQL queries with your current resources.

Multivalue parameters for SPARQL strategy

Problem:
When using a SPARQL resource, parameters can only be added in a single value way. An array of values should be supplied, and this should be spun out in the query as a boolean combination. Options are (GET param):
· spontaneous: ?x=1&x=2&x=3
· formatted: ?x=[1,2,3]
· spun-out: ?x.size=3&x.0=1&x.1=2&x.2=3

Solution:
Look into SPARQLscript or automatically parse and alter the sparql query.

RDF to KML

Problem

When the KML formatters is done on a graph, it doesn't convert all the geo:location towards KML documents

Solution

Query the RDF loaded in memory for all the locations and add them to the KML output when the input is an ARC2 Graph.

Paging - more info

Currently you can see the next and previous page link. It would be nice to have a notion of how many pages, and how many results there are. Note that the amount of pages will vary with the current page_size or limit that is done with the request.

Catch all errors and exceptions

Currently only a majority of the errors and exceptions that can be thrown in the code are caught. We need to catch even fatal errors resulting from SQL invalid states to return in a proper way.

Resource definition - PATCH

It should be able to update resource definitions using the PATCH HTTP Method. If you'd like to update your documentation or license parameter in the definition, then you can use PATCH instead of re-PUT'ing your resource definition again.

XLS - Paging

Problem

The excel strategy has no paging, implemented yet.

Solution

Implement paging similar to the CSV strategy.

Document how to add meta-data

Problem

Some data owners want to add meta-data to their resource and the TDTInfo/Resources information.

Solution

It is already implemented. We should just make sure that it is documented on thedatatank.com/help.

Document it using the TDTInfo/Admin.about page and how they can find the right info. Also document how to do it when a resource was already added.

Strategies - Clean up limit & offset

The calculation of the limit and offset can be done generally, there's no need to ask every strategy to calculate this for themselves. This needs to be cleaned up.

D3js as a visualisation driver?

How about implementing d3js charts as a native feature for visualisations?

http://d3js.org/

Create a script to import / export SPARQL queries

problem

Managing the SPARQL resources in tdtadmin/resources is something that has to change quickly. Thus we should be able to have a script which can quickly change for instance a sparql query.

solution

Create a bash script which can get all sparql queries when the right password is provided, and which can push all the files back to the server.

RESTfiltering: at read-time instead of post-read-time

A RESTfilter in The DataTank is meant to browse further into a data object. e.g.

localhost/schoolX/teachers providus us with the following:

[{
id: 1,
name: John Doe,
teaches: Math,Science
},
{
id:2,
name: Jane Doe,
teaches: History, Cooking
}]

If a user goes to localhost/schoolX/teachers/0, it should get the first item, instead of all the items. Currently this processing of RESTfilters, the 0 in the above example, is done in RController, after the entire object has been read. This should be propagated to the readers instead, so they can process it @read-time.

This has only one minor disadvantage, namely that implementing a reader requires to take these parameters into account.

Document data visualisation

Problem

There are a lot of visualisations possible with The DataTank, but this is poorly documented. No one actually knows what the arguments are and it can't be found anywhere.

Solution

http://thedatatank.com/help/category/consuming/

In this category a post has to be added on data visualisation: charts, bar, pie, map, grid...

GET parameters get confused

In strategies, the GET parameters are added to the object with $this, while a property 'rest_params' is added as well, containing an array. This makes it impossible to iterate separately over the parameters. This should be fixed where the restparameters are added. Also, PHP has no support for multivalue request parameters, which is fixed by a function in SPARQL strategy. This function should be moved up so it applies for any strategy.

Look into SPARQL strategy for details.

Replace include_once

Replace the include_once statements. Classes can now be loaded by the autoloader.

Request for installation procedure tdt/core - discussion

Currently issue #16 is fixed, but not in an ideal way.

Problem.

We have to rollback database transactions, and with our current ORM (RedBean) this is possible. It has a built-in transaction system with the normal components such as begin(), commit() and rollback(). Now, here's the tricky part. In order to make this work, the autocommit to the MySQL back-end should be set to false. If not, RedBean provides a function to work-around this by using R::freeze(true). However, since we work in a "fluid" environment (meaning tables get made and adjusted on the fly) using this function is not an option.
(http://stackoverflow.com/questions/10851471/why-arent-redbeans-transaction-functions-working)
Solution(s)

A first solution is to configure the MySQL back-end to auto-commit = false.
(http://stackoverflow.com/questions/2280465/how-do-i-turn-off-autocommit-for-a-mysql-client)
A second solution exists in creating an installation script that you can call upon, and works just like it used to in our previous repository (iRail/The-DataTank), which initializes the databases. After that we can use R::freeze(true). This looks(!!) like a good solution, however with a CLI client you can't use our old approach, namely browsing to our /installation and following the steps. So maybe, a resource should be made that not only initializes the database back-end, but also updates the back-end when database changes are to be applied.

RFC!

SPECTQL tweaks

Tweaks necessary:

Count - fix namespace
Max - write own max() code. The max executer was first provided by the php max() function, however when passing an array of integers, but encapsulated as a string (e.g. "4","5", "null") then "null" would be the max of the array, which is not a desirable outcome.
Sort - fix namespace

Core resources

Problem

We have several "cores" that may have administrator and info resources. For instance, TDTInfo/Resources or TDTAdmin/Export are part of tdt/core. tdt/input now wants to create a TDTAdmin/Input so we can have a RESTful interface. We could add a route to the cores.json, but then Input will not show up when you'd go to http://data.example.com/TDTAdmin.

How to solve

We need a collection somewhere where we can register our core resources. Would it be a good idea to make this a constructor array in tdt/core?

Document how to add a remote resource

Problem

We have the functionality on how to add a remote resource and it has been documented in tdtinfo/admin, but users don't know what a remote resource is and how it works exactly.

Solution

Document it in http://thedatatank.com/help/category/publishing-data/

Clean up - update

Currently there are still deprecated functions to handle "updates" on resource definitions. However this is now done by patching the resource, altering the patched parameters and putting the entire definition again for validation. This makes "update" obsolete, needs to be thrown away.

SPECTQL - Doesn't cope with a number as a resourcename.

Currently the SPECTQL grammar can't cope with resourcenames being a number. Needs to be fixed.

Update the Admin documentation

Currently the documentation on TDTInfo/Admin is somewhat outdated. i.e. DELETE on TDTInfo/Resources is no longer appicable, but should be TDTAdmin/Resources.

CC REL & License composition

CC what?

CC REL is a specification describing how license information may be described using RDF and how license information may be attached to works.

The vocabulary is specified in this paper:

http://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf

License composition

Read this paper:

http://ceur-ws.org/Vol-905/VillataAndGandon_COLD2012.pdf

Support for license composition in The DataTank

When getting data from several resources to be used in 1 result, we also need a way to add license information, and a way to compose this license information.

Link header

Metadata can be added to a HTTP response by using the link header:

http://www.w3.org/wiki/LinkHeader

Composition

A standard framework for compositing licenses must be available. 2 approaches:

Anchored data

Using anchors, we can give several parts of the data different licenses if the data is reused directly.

Through provenance of the data

When the results given over HTTP are a remix and a mash-up of other data, we need to provide the provenance of this data, and each of their license plus the resulting license.

Make uri's in the datatank case-insensitive

A resource named ResourcE1 should be the same as resource1. Currently this isn't the case, while it's a nice to have/necessary feature.

RedBean mishap

Title is a little vague, because the bug is vague. When I tried to add a CSV resource, my SQL threw an invalid state, telling me 'pk' already exists.....yes, that's true, and RedBean should add an generic_resource_csv object, with a property called pk to that datatable. What I found out was that the type the pk column was set to was "set('1')", after I deleted it and let the RedBean ORM work his magic, it made the same column into tinyint(4) ...and everything was fine. Now, as this may look like a RedBean issue, I think we can avoid these type of things by making an installer that installs the database schema in advance, this will also allow us to R:freeze() the database making it possible to rely on the RedBean rollback functionality.

SHP reader

The SHP strategy should be included again. Use the old SHP strategy, but refactor it first to PSR-0 compliance.

Document SPECTQL

http://thedatatank.com/help/category/consuming/ at this page

SPARQL resources do not create a table

When sending a put request for a SPARQL resource, the table generic_resource_sparql is not created. Strangly, this does happen for LD resources, which use a Strategy that just extends the SPARQL one...

Request details:

Request Url: http://localhost/tdtStart/public/tdtAdmin/Resources/sparql/test
Request Method: PUT
Status Code: 500
Params: {
"resource_type": "generic",
"generic_type": "SPARQL",
"documentation": "Test sparql parameters",
"endpoint": "http://157.193.213.125:8890/sparql",
"query": "SELECT * WHERE { ?s ?p ?o }"
}

Configuration validation

Problem

If you want to use the ResourcesModel, you need to construct ResourcesModel with a certain configuration array. This array is not validated automatically, it is only validated when being used elsewhere, and this can give weird errors. The user however just wants a message stating that he has done something wrong, and wants to be informed about how he can correct his/her mistake.

Solution

Create a JSON schema for your configuration and make ResourcesModel validate the configuration in the constructor using a JSON schema validator (this is installable using composer: justinrainbow/json-schema)

Paging in SPARQL strategy

tdt/formatters#11

Rollback when something goes wrong while PUT'ing a resource

A Rollback should be used when something goes wrong (exception or fatal error) while PUT'ing a resource definition.

Pitfall: The RedBean supports this, however we should catch every error/exception that happens while adding a resource, thus in our exception/error handler we have to know what HTTP Request was done (check headers)

Storing Package/Resource as 2 packages

When a SPARQL, or LD resource (and others) is added, the Package is inserted twice into the DB, while the resource name is empty.

For example:
PUT .../TDTAdmin/Resources/Airports/Regions/

inserts the following lines into table Package

id package_name timestamp parent_package full_package_name
1 Airports 1360674579 NULL Airports
2 Regions 1360674580 1 Airports/Regions

and the following lines into table Resource

id package_id resource_name creation_timestamp last_update_timestamp type
1 2 1360674580 1360674580 generic

Foresee a documentation reset

Currently when a file (another strategy for example) is added, the documentation will have to be updated. Since no business logic so far is implemented to know if something has been added, we can do two things:

Foresee a resource that simply resets the documentation
Foresee functionality that resets the documentation when some folders have been altered.

The second option is a bit trickier, as it will delay the return of a cached object in relation with how much work it has to perform to check whether or not the documentation can be returned from cache, or has to be remade.

tdtinfo - show uri

Display the uri of a resource in tdtinfo/resources, this would allow for quick referencing, instead of having to manually put together the total uri by looking at the package and resource name.

Document page/page_size and limit/offset

Document the paging parameters in the API doc.

Error handling and rollback with adding resources

I add a resource, but an error occurs between the creation of package/resource and the generic type. This results into a 500, which is not proper error handling If I, for example, perform a check which executes a query that results into a max memory size error, I should get a "400 Bad request" with the message "query result to big".
Also, when this error occurs, my Package and Resource are already added to the DB, but my generic resource table is not added/filled. There should be a rollback that deletes all insertions, i.e. DB transactions.

tdt / core Goto Github PK

core's People

Contributors

Stargazers

Watchers

Forkers

core's Issues

Problem

solution

Problem

Solution

Serializing

Strategies

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

problem

solution

Problem

Solution

Problem

How to solve

Problem

Solution

CC what?

License composition

Support for license composition in The DataTank

Link header

Composition

Anchored data

Through provenance of the data

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org