datadotworld / dwapi-spec Goto Github PK

View Code? Open in Web Editor NEW

2.0 35.0 3.0 8.79 MB

data.world API Specifications (a.k.a. Swagger definition)

License: Apache License 2.0

Shell 100.00%

dwstruct-t02-saas-main

dwapi-spec's Introduction

data.world — Public API specification

This repository contains the OpenAPI (a.k.a. Swagger) specification for the data.world's API.

Documentation

Reference documentation and code examples for this API can be found at: https://apidocs.data.world/

Contributing

The data.world API specification is an open-source project. Community participation is encouraged. If you'd like to contribute, please follow the Contributing Guidelines.

License

Apache License 2.0

dwapi-spec's People

Contributors

Stargazers

Watchers

Forkers

hardworkingcoder cyberandy

dwapi-spec's Issues

Document payload expectations for POST:/sql

Current documentation doesn't specify that query is required to be passed in the POST request for SQL queries.

Document HTTP 429 and rate limiting conventions

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Error type: Cannot find preset's package (github>datadotworld/renovate-config)

Unable to add description and labels

Currently, when no description or labels have been added any of the files in a dataset, attempting to change description or labels via API fails with HTTP 404.

This seems to be because the user layer doesn't yet exist.

Title should be optional in PUT:/datasets

When allowing title changes in PUT:/datasets we made title a required attribute. With backwards-compatibility in mind, we need a sensible default (existing title) for it in case it's not provided in the request.

At the moment, this breaks some of our existing integrations (incl. Python, R and CKAN).

Include field title in table schema returned with SQL queries

Currently, the table schema included in SQL responses only includes field names. The schema should include title (original string) as well.

Missing `accessLevel` in `/user/datasets/*` responses

Set up regression tests

Return accessLevel in DatasetSummaryResponse

A new attribute should indicate the access level the current user has to datasets returned via GET:/datasets and GET:/user/datasets/*.

Allow datasets to be created with PUT:/datasets

PUT:/datasets should allow datasets to be created with a dataset id (slug) chosen by the client.

Problems with sort parameters

sort, when applied to GET:/user/datasets/liked, doesn't seem to have any effect on the response. Meanwhile, if applied to GET:/user/datasets/own or GET:/user/datasets/contributing, results in HTTP 500, no matter what value it carries.

Parameterized Queries

I'd like to propose a spec for how parameterized queries should be processed by the query endpoints - there's two parts to that.

How Parameter Values and Types are specified .
How Parameter Names and Values are provided on the URL.

Parameter Values and Types

for both SQL and SPARQL, I think we should support "simple", "safe", and "RDF" parameter values.

RDF parameters allow for complete precision in the same language the underlying query engine understands, at the cost of a pretty verbose and esoteric syntax. If you want total precision, use this syntax.
"safe" parameters allow you to be specific about String/URI parameters by wrapping values in "" or <> - definitely a necessity for user-entered content. If you're building an SDK or integration, this helps make sure that you can be precise about types without the loss of readability in RDF types.
"simple" parameters mean that we'll do the right thing with most values, defaulting to String where we can't make a better guess. This maximizes the chance that ad-hoc queries will return results when the user's meaning is clear.

to parse values, here is the algorithm:

try "RDF" parameters:

if value matches /^"(.*)"\^\^<([^<>]*)>$/ :

  "abcdef"^^<http://www.w3.org/2001/XMLSchema#string>
  "3"^^<http://www.w3.org/2001/XMLSchema#integer>
  "4.2"^^<http://www.w3.org/2001/XMLSchema#decimal>
  "true"^^<http://www.w3.org/2001/XMLSchema#boolean>

(matches two "groups" - the string type value and the URI of the type)

"Safe" parameters:

if value matches /^"(.*)"$/ :

  "abcdef"                                 <- String
  "3"                                      <- String
  "4.2"                                    <- String
  "true"                                   <- String
  "https://data.world/"                    <- String

(matches one group - the string value)

if value matches /^<(.*)>$/ :

  <https://data.world/>                    <- URI
  <abcdef>                                 <- URI
  <3>                                      <- URI

(matches one group - the URI)

"Simple" parameters:

if value matches /^([0-9]+)$/ :

  3                                        <- Integer

if value matches /^([0-9]*[.][0-9]+)$/ :

  4.2                                      <- Decimal

if value matches /^(true|false)$/ :

  true                                     <- Boolean

if value matches /^([a-z]+:\/\/.*)$/ :

  https://data.world/                      <- URI

(all of the above match one group - the value to interpret as Integer/Decimal/Boolean/URI)

otherwise :

  abcdef                                   <- String

(just treat the whole value as a String if nothing else matches)

Parameter Names and Values

For SPARQL:

SPARQL supports named parameters, and parameters in queries can be specified either as ?var or $var - it's a very common convention to use ?var for variables that are meant to be matched and $var for variables that are bound to the query execution. Because of that, using the $ syntax as query string parameters is a common way to pass bound variables on a HTTP URL. No reason we shouldn't use that syntax here:

  .../sparql/user/dataset?query=<QUERY>&$var1=<VALUE1>&$var2=<VALUE2>

where and are values according to the spec above

For SQL:

SQL only supports positional parameters. Luckily, HTTP query parameters have a straightforward way to specify an arbitrary length sequence of values for a query parameter - simply repeat the same query parameter name, and multiple instances of that will be treated as a sequence of those values. I'm proposing that we use p for the name of our parameter variable (to keep the URLs nice and short), but could do param or parameter too:

  .../sql/user/dataset?query=<QUERY>&p=<VALUE1>&p=<VALUE2>

where, again, and are values according to the spec above

In both cases (SPARQL and SQL) the way we interpret values is identical. Clearly the values will need to be URL-encoded when actually sent on a URL (as with any value)...

Upgrade implementation project to swagger-maven-plugin 3.1.5

Not all features of Swagger can be used in this project because the actual API implementation leverages swagger-maven-plugin and is limited by what that can do. Version 3.1.5 was just released and supports useful things, like global consumes and produces settings.

Implement DELETE for datasets

Implement DELETE:/datasets in a way that isn't prone to accidental deletes, by either:

Controlling access with a different token scope
Implementing it as a soft delete
Both

Improve error message for tag constraint violation

Error messages when tag constraints are violated are not helpful at all.
Ensure that message is sufficient for users to understand how to correct invalid input.

Return projectDataset flag in DatasetSummaryReponse

DatasetSummaryResponse should include a projectDataset flag indicating that the dataset in question is the default dataset for a data project.

Add examples for Projects requests/responses

Allow data dictionary updates via API

Add description for properties of ProjectSummaryResponse

Introduce new response type for queries (JSON stream)

Query endpoints should be able to produce a stream of json rows, so that clients could process them more effectively. Ideally, the header (first rows) should be a table schema, complete with column names, types and descriptions.

Create docs section about API clients (existing + swagger codegen)

Introduce page highlighting the fact that clients exist for R and Python and that swagger codegen can be used to generate clients for multiple programming languages.

Improve consistency of responses when rate limiting clients

It seems like under heavy load, the API may return different response codes and response payloads (blank, html, json, etc).

Make sure that load-balancer, web server and app are configured in a compatible fashion.

OAuth /authorize call returning HTTP 500 for invalid input

Currently, for example, if client_id or redirect_uri are invalid, app returns HTTP 500.
Instead, app should return HTTP 400.

Improve Oauth documentation

[x] In the Oauth documentation it says "redirect_uri Optional" for the authorization endpoint. But I get error saying client id and redirect_uri are required. I tested the url from the docs https://data.world/oauth/authorize?code=zac4ZV2XbleQ2e&client_id=3MVG9lKcPoNINVB&client_secret=3iQF9BsWEr6nCf&grant_type=authorization_code and get error {"error":"invalid_request","message":"A client_id and redirect_uri are required"}
[x] Say in the documentation how to obtain client_id and client_secret
[x] Change redirect_url to redirect_uri
[ ] Make code samples for various languages
[ ] Explain what to do if html for login is returned to you
[ ] Mention that if theres a hash in either client_id or client_secret that you change the # to %23

Set up CORS

Consolidate query & download endpoints under api.data.world domain

Review and update/add request and response examples

When Accept header not specified, SQL endpoint should return application/json

Send a custom header & value

Suggestions from @bryonjacob:

for every account, generate an OUTBOUND_AUTH_TOKEN - this can be a type 4 UUID. Store it on the agent record
show it to the user on /settings/advanced, right next to their API token(s). Give them the ability to reset it, maybe.
on every outbound request "webhook" generated by a user, send that value in a custom header.
the receiving user can use that to know if the request is really coming from us (and will ignore that header if they don't know what it is)