tdt / core Goto Github PK
View Code? Open in Web Editor NEWTransform any dataset into an HTTP API with The DataTank
Home Page: http://thedatatank.com
Transform any dataset into an HTTP API with The DataTank
Home Page: http://thedatatank.com
Currently null values are parsed as strings "null" for internal simplicity (afaik). However, when performing a comparison (e.g. x > 200), and x = "null" it will return true, for it is a string. This needs to be fixed.
Notice: Undefined property: stdClass::$Tdtadmin in tdt/core/formatters/xmlFormatter.php on line 44
Notice: Undefined property: stdClass::$Tdtadmin in tdt/core/formatters/xmlFormatter.php on line 46
Recommended file permissions:
See comments here: 88b5f9b#commitcomment-2766416
At this moment, you can PUT your {package}/{resource} to TDTAdmin/Resources, in both www form encoding as in json. When requesting the TDTAdmin/Resources/{package}/{resource}.json file though, it gives a totally different json file which cannot be used to PUT again.
Make sure the json can contain a similar structure to what it GETs.
Currently REST parameters are handled numerically so that a REST parameter can be used to filter out a certain entry. However the index doesn't have to be numeric, it can also be a unique identifier (by defining a PK). This needs to be implemented.
When we want RDF output, we need to have the data enriched with semantics. If a normal formatter would be used, the resulting format would be really complicated to parse.
As a model for these objects we will use ARC graphs. When triples are kept in memory, and a regular formatter is chosen, we will need a special function to print this ARC model. It should first check whether the object is a normal PHP object, or a special ARC graph object and print it accordingly.
At this moment we have an abstract class for formatters (json, xml, csv...) which you can find here:
https://github.com/tdt/formatters/tree/master/src/tdt/formatters
When an ARC model is inserted, we now need to have a different approach to serializing this data, thus a different function.
For semantic output (rdf/xml, rdf/json, turtle, n3...) I would suggest reusing the serializers in ARC.
https://github.com/semsol/arc2/tree/master/serializers
For every strategy we write in tdt/core, we need thus to know whether the data that is retrieved is a graph, or whether it is standard PHP object model. Maybe it would be itnerested to have an RDFStrategy abstract class to implement instead of AStrategy?
Currently if a subresource is passed via the rest parameters, there's no change in the construct query. This needs be a filter at extraction time, not a post-processing time.
When your TDT got installed and you go to the first page, a README visualisation is shown how to install The DataTank. the text is a copy from the README in tdt/start.
The first page should be a page which helps the data consumer further. It should explain how to use everything, where to find the documentation when more questions come up (http://thedatatank.com/help/category/consuming/) and it should list all the resources installed in this tdt/core in a fancy way.
First declare a namespace beforing using other namespaces! NOTE: some editor like netbeans will automatically import namespace usages, and put them in the first lines of the php file, even though a namespace has been declared.
Currently meta-data parameters are put in the parameters section, along with optional parameters of a certain type of resource. This is correct, but perhaps we should find a way to show which parameters are optional parameters, and which are meta-data parameters.
Proposed solution:
next to documentation, parameters, requiredparameters, add meta_data as a section? If this is the case documentation will have to be updated!
Support XML again as a resource strategy, used quite a bit, but not yet supported.
It would be nice to have a SPECTQL interface on a CSV file. Clauses such as select, where,... can be implemented before the object is created from the file and then given to the SPECTQL tree, or normal Read Controller.
Implement paging in the LD strategy.
Currently only example_uri is displayed as meta-data in the tdtinfo/resources-resource. This should be expanded to all meta-data.
Re-implement the GUI for creating SPECTQL queries with your current resources.
Problem:
When using a SPARQL resource, parameters can only be added in a single value way. An array of values should be supplied, and this should be spun out in the query as a boolean combination. Options are (GET param):
· spontaneous: ?x=1&x=2&x=3
· formatted: ?x=[1,2,3]
· spun-out: ?x.size=3&x.0=1&x.1=2&x.2=3
Solution:
Look into SPARQLscript or automatically parse and alter the sparql query.
When the KML formatters is done on a graph, it doesn't convert all the geo:location towards KML documents
Query the RDF loaded in memory for all the locations and add them to the KML output when the input is an ARC2 Graph.
Currently you can see the next and previous page link. It would be nice to have a notion of how many pages, and how many results there are. Note that the amount of pages will vary with the current page_size or limit that is done with the request.
Currently only a majority of the errors and exceptions that can be thrown in the code are caught. We need to catch even fatal errors resulting from SQL invalid states to return in a proper way.
It should be able to update resource definitions using the PATCH HTTP Method. If you'd like to update your documentation or license parameter in the definition, then you can use PATCH instead of re-PUT'ing your resource definition again.
The excel strategy has no paging, implemented yet.
Implement paging similar to the CSV strategy.
Some data owners want to add meta-data to their resource and the TDTInfo/Resources information.
It is already implemented. We should just make sure that it is documented on thedatatank.com/help.
Document it using the TDTInfo/Admin.about page and how they can find the right info. Also document how to do it when a resource was already added.
The calculation of the limit and offset can be done generally, there's no need to ask every strategy to calculate this for themselves. This needs to be cleaned up.
How about implementing d3js charts as a native feature for visualisations?
Managing the SPARQL resources in tdtadmin/resources is something that has to change quickly. Thus we should be able to have a script which can quickly change for instance a sparql query.
Create a bash script which can get all sparql queries when the right password is provided, and which can push all the files back to the server.
A RESTfilter in The DataTank is meant to browse further into a data object. e.g.
localhost/schoolX/teachers providus us with the following:
[{
id: 1,
name: John Doe,
teaches: Math,Science
},
{
id:2,
name: Jane Doe,
teaches: History, Cooking
}]
If a user goes to localhost/schoolX/teachers/0, it should get the first item, instead of all the items. Currently this processing of RESTfilters, the 0 in the above example, is done in RController, after the entire object has been read. This should be propagated to the readers instead, so they can process it @read-time.
This has only one minor disadvantage, namely that implementing a reader requires to take these parameters into account.
There are a lot of visualisations possible with The DataTank, but this is poorly documented. No one actually knows what the arguments are and it can't be found anywhere.
http://thedatatank.com/help/category/consuming/
In this category a post has to be added on data visualisation: charts, bar, pie, map, grid...
In strategies, the GET parameters are added to the object with $this, while a property 'rest_params' is added as well, containing an array. This makes it impossible to iterate separately over the parameters. This should be fixed where the restparameters are added. Also, PHP has no support for multivalue request parameters, which is fixed by a function in SPARQL strategy. This function should be moved up so it applies for any strategy.
Look into SPARQL strategy for details.
Replace the include_once statements. Classes can now be loaded by the autoloader.
Currently issue #16 is fixed, but not in an ideal way.
Problem.
We have to rollback database transactions, and with our current ORM (RedBean) this is possible. It has a built-in transaction system with the normal components such as begin(), commit() and rollback(). Now, here's the tricky part. In order to make this work, the autocommit to the MySQL back-end should be set to false. If not, RedBean provides a function to work-around this by using R::freeze(true). However, since we work in a "fluid" environment (meaning tables get made and adjusted on the fly) using this function is not an option.
(http://stackoverflow.com/questions/10851471/why-arent-redbeans-transaction-functions-working)
Solution(s)
A first solution is to configure the MySQL back-end to auto-commit = false.
(http://stackoverflow.com/questions/2280465/how-do-i-turn-off-autocommit-for-a-mysql-client)
A second solution exists in creating an installation script that you can call upon, and works just like it used to in our previous repository (iRail/The-DataTank), which initializes the databases. After that we can use R::freeze(true). This looks(!!) like a good solution, however with a CLI client you can't use our old approach, namely browsing to our /installation and following the steps. So maybe, a resource should be made that not only initializes the database back-end, but also updates the back-end when database changes are to be applied.
RFC!
Tweaks necessary:
We have several "cores" that may have administrator and info resources. For instance, TDTInfo/Resources or TDTAdmin/Export are part of tdt/core. tdt/input now wants to create a TDTAdmin/Input so we can have a RESTful interface. We could add a route to the cores.json, but then Input will not show up when you'd go to http://data.example.com/TDTAdmin.
We need a collection somewhere where we can register our core resources. Would it be a good idea to make this a constructor array in tdt/core?
We have the functionality on how to add a remote resource and it has been documented in tdtinfo/admin, but users don't know what a remote resource is and how it works exactly.
Document it in http://thedatatank.com/help/category/publishing-data/
Currently there are still deprecated functions to handle "updates" on resource definitions. However this is now done by patching the resource, altering the patched parameters and putting the entire definition again for validation. This makes "update" obsolete, needs to be thrown away.
Currently the SPECTQL grammar can't cope with resourcenames being a number. Needs to be fixed.
Currently the documentation on TDTInfo/Admin is somewhat outdated. i.e. DELETE on TDTInfo/Resources is no longer appicable, but should be TDTAdmin/Resources.
CC REL is a specification describing how license information may be described using RDF and how license information may be attached to works.
The vocabulary is specified in this paper:
http://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf
Read this paper:
http://ceur-ws.org/Vol-905/VillataAndGandon_COLD2012.pdf
When getting data from several resources to be used in 1 result, we also need a way to add license information, and a way to compose this license information.
Metadata can be added to a HTTP response by using the link header:
http://www.w3.org/wiki/LinkHeader
A standard framework for compositing licenses must be available. 2 approaches:
Using anchors, we can give several parts of the data different licenses if the data is reused directly.
When the results given over HTTP are a remix and a mash-up of other data, we need to provide the provenance of this data, and each of their license plus the resulting license.
A resource named ResourcE1 should be the same as resource1. Currently this isn't the case, while it's a nice to have/necessary feature.
Title is a little vague, because the bug is vague. When I tried to add a CSV resource, my SQL threw an invalid state, telling me 'pk' already exists.....yes, that's true, and RedBean should add an generic_resource_csv object, with a property called pk to that datatable. What I found out was that the type the pk column was set to was "set('1')", after I deleted it and let the RedBean ORM work his magic, it made the same column into tinyint(4) ...and everything was fine. Now, as this may look like a RedBean issue, I think we can avoid these type of things by making an installer that installs the database schema in advance, this will also allow us to R:freeze() the database making it possible to rely on the RedBean rollback functionality.
The SHP strategy should be included again. Use the old SHP strategy, but refactor it first to PSR-0 compliance.
http://thedatatank.com/help/category/consuming/ at this page
When sending a put request for a SPARQL resource, the table generic_resource_sparql is not created. Strangly, this does happen for LD resources, which use a Strategy that just extends the SPARQL one...
Request details:
Request Url: http://localhost/tdtStart/public/tdtAdmin/Resources/sparql/test
Request Method: PUT
Status Code: 500
Params: {
"resource_type": "generic",
"generic_type": "SPARQL",
"documentation": "Test sparql parameters",
"endpoint": "http://157.193.213.125:8890/sparql",
"query": "SELECT * WHERE { ?s ?p ?o }"
}
If you want to use the ResourcesModel, you need to construct ResourcesModel with a certain configuration array. This array is not validated automatically, it is only validated when being used elsewhere, and this can give weird errors. The user however just wants a message stating that he has done something wrong, and wants to be informed about how he can correct his/her mistake.
Create a JSON schema for your configuration and make ResourcesModel validate the configuration in the constructor using a JSON schema validator (this is installable using composer: justinrainbow/json-schema)
A Rollback should be used when something goes wrong (exception or fatal error) while PUT'ing a resource definition.
Pitfall: The RedBean supports this, however we should catch every error/exception that happens while adding a resource, thus in our exception/error handler we have to know what HTTP Request was done (check headers)
When a SPARQL, or LD resource (and others) is added, the Package is inserted twice into the DB, while the resource name is empty.
For example:
PUT .../TDTAdmin/Resources/Airports/Regions/
inserts the following lines into table Package
id package_name timestamp parent_package full_package_name
1 Airports 1360674579 NULL Airports
2 Regions 1360674580 1 Airports/Regions
and the following lines into table Resource
id package_id resource_name creation_timestamp last_update_timestamp type
1 2 1360674580 1360674580 generic
Currently when a file (another strategy for example) is added, the documentation will have to be updated. Since no business logic so far is implemented to know if something has been added, we can do two things:
The second option is a bit trickier, as it will delay the return of a cached object in relation with how much work it has to perform to check whether or not the documentation can be returned from cache, or has to be remade.
Display the uri of a resource in tdtinfo/resources, this would allow for quick referencing, instead of having to manually put together the total uri by looking at the package and resource name.
Document the paging parameters in the API doc.
I add a resource, but an error occurs between the creation of package/resource and the generic type. This results into a 500, which is not proper error handling If I, for example, perform a check which executes a query that results into a max memory size error, I should get a "400 Bad request" with the message "query result to big".
Also, when this error occurs, my Package and Resource are already added to the DB, but my generic resource table is not added/filled. There should be a rollback that deletes all insertions, i.e. DB transactions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.