roapi / roapi Goto Github PK

View Code? Open in Web Editor NEW

3.1K 45.0 170.0 1.12 MB

Create full-fledged APIs for slowly moving datasets without writing a single line of code.

Home Page: https://roapi.github.io/docs

License: Apache License 2.0

Rust 98.45% Dockerfile 0.46% Shell 1.08%

sql graphql arrow rest-api analytics query columnar rust in-memory-database datafusion

roapi's Introduction

ROAPI

ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code. It builds on top of Apache Arrow and Datafusion. The core of its design can be boiled down to the following:

Query frontends to translate SQL, FlightSQL, GraphQL and REST API queries into Datafusion plans.
Datafusion for query plan execution.
Data layer to load datasets from a variety of sources and formats with automatic schema inference.
Response encoding layer to serialize intermediate Arrow record batch into various formats requested by client.

See below for a high level diagram:

Installation

Install pre-built binary

# if you are using homebrew
brew install roapi
# or if you prefer pip
pip install roapi

Check out Github release page for pre-built binaries for each platform. Pre-built docker images are also available at ghcr.io/roapi/roapi.

Install from source

cargo install --locked --git https://github.com/roapi/roapi --branch main --bins roapi

Usage

Quick start

Spin up APIs for test_data/uk_cities_with_headers.csv and test_data/spacex_launches.json:

roapi \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    --table "test_data/spacex_launches.json"

For windows, full scheme(file:// or filesystem://) must filled, and use double quote(") instead of single quote(') to escape windows cmdline limit:

roapi \
    --table "uk_cities=file://d:/path/to/uk_cities_with_headers.csv" \
    --table "file://d:/path/to/test_data/spacex_launches.json"

Or using docker:

docker run -t --rm -p 8080:8080 ghcr.io/roapi/roapi:latest --addr-http 0.0.0.0:8080 \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    --table "test_data/spacex_launches.json"

For MySQL and SQLite, use parameters like this.

--table "table_name=mysql://username:password@localhost:3306/database"
--table "table_name=sqlite://path/to/database"

Want dynamic register data? Add parameter -d to command. --table parameter cannot be ignored for now.

roapi \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    -d

Then post config to /api/table register data.

curl -X POST http://172.24.16.1:8080/api/table \
     -H 'Content-Type: application/json' \
     -d '[
       {
         "tableName": "uk_cities2",
         "uri": "./test_data/uk_cities_with_headers.csv"
       },
       {
         "tableName": "table_name",
         "uri": "sqlite://path/to/database"
       }
     ]'

Query tables using SQL, GraphQL or REST:

curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
curl -X POST -d "query { uk_cities(limit: 2) {city, lat, lng} }" localhost:8080/api/graphql
curl "localhost:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"

Get inferred schema for all tables:

curl 'localhost:8080/api/schema'

Config file

You can also configure multiple table sources using YAML or Toml config, which supports more advanced format specific table options:

addr:
  http: 0.0.0.0:8084
  postgres: 0.0.0.0:5433
tables:
  - name: "blogs"
    uri: "test_data/blogs.parquet"

  - name: "ubuntu_ami"
    uri: "test_data/ubuntu-ami.json"
    option:
      format: "json"
      pointer: "/aaData"
      array_encoded: true
    schema:
      columns:
        - name: "zone"
          data_type: "Utf8"
        - name: "name"
          data_type: "Utf8"
        - name: "version"
          data_type: "Utf8"
        - name: "arch"
          data_type: "Utf8"
        - name: "instance_type"
          data_type: "Utf8"
        - name: "release"
          data_type: "Utf8"
        - name: "ami_id"
          data_type: "Utf8"
        - name: "aki_id"
          data_type: "Utf8"

  - name: "spacex_launches"
    uri: "https://api.spacexdata.com/v4/launches"
    option:
      format: "json"

  - name: "github_jobs"
    uri: "https://web.archive.org/web/20210507025928if_/https://jobs.github.com/positions.json"

To run serve tables using config file:

roapi -c ./roapi.yml # or .toml

See config documentation for more options including using Google spreadsheet as a table source.

Response serialization

By default, ROAPI encodes responses in JSON format, but you can request different encodings by specifying the ACCEPT header:

curl -X POST \
    -H 'ACCEPT: application/vnd.apache.arrow.stream' \
    -d "SELECT launch_library_id FROM spacex_launches WHERE launch_library_id IS NOT NULL" \
    localhost:8080/api/sql

REST API query interface

You can query tables through REST API by sending GET requests to /api/tables/{table_name}. Query operators are specified as query params.

REST query frontend currently supports the following query operators:

columns
sort
limit
filter

To sort column col1 in ascending order and col2 in descending order, set query param to: sort=col1,-col2.

To find all rows with col1 equal to string 'foo', set query param to: filter[col1]='foo'. You can also do basic comparisons with filters, for example predicate 0 <= col2 < 5 can be expressed as filter[col2]gte=0&filter[col2]lt=5.

GraphQL query interface

To query tables using GraphQL, send the query through POST request to /api/graphql endpoint.

GraphQL query frontend supports the same set of operators supported by REST query frontend. Here how is you can apply various operators in a query:

{
  table_name(
    filter: { col1: false, col2: { gteq: 4, lt: 1000 } }
    sort: [{ field: "col2", order: "desc" }, { field: "col3" }]
    limit: 100
  ) {
    col1
    col2
    col3
  }
}

SQL query interface

To query tables using a subset of standard SQL, send the query through POST request to /api/sql endpoint. This is the only query interface that supports table joins.

Key value lookup

You can pick two columns from a table to use a key and value to create a quick keyvalue store API by adding the following lines to the config:

kvstores:
  - name: "launch_name"
    uri: "test_data/spacex_launches.json"
    key: id
    value: name

Key value lookup can be done through simple HTTP GET requests:

curl -v localhost:8080/api/kv/launch_name/600f9a8d8f798e2a4d5f979e
Starlink-21 (v1.0)%

Query through Postgres wire protocol

ROAPI can present itself as a Postgres server so users can use Postgres clients to issue SQL queries.

$ psql -h 127.0.0.1
psql (12.10 (Ubuntu 12.10-0ubuntu0.20.04.1), server 13)
WARNING: psql major version 12, server major version 13.
         Some psql features might not work.
Type "help" for help.

houqp=> select count(*) from uk_cities;
 COUNT(UInt8(1))
-----------------
              37
(1 row)

Features

Query layer:

Response serialization:

JSON application/json
Arrow application/vnd.apache.arrow.stream
Parquet application/vnd.apache.parquet
msgpack

Data layer:

Misc:

auto gen OpenAPI doc for rest layer
query input type conversion based on table schema
stream arrow encoding response
authentication layer

Development

The core of ROAPI, including query front-ends and data layer, lives in the self-contained columnq crate. It takes queries and outputs Arrow record batches. Data sources will also be loaded and stored in memory as Arrow record batches.

The roapi crate wraps columnq with a multi-protocol query layer. It serializes Arrow record batches produced by columnq into different formats based on client request.

Building ROAPI with simd optimization requires nightly rust toolchain.

Debug

To log all FlightSQL requests in console, set RUST_LOG=tower_http=trace.

Build Docker image

docker build --rm -t ghcr.io/roapi/roapi:latest .

VS Code DevContainer

Requirements

Vscode
Ensure this extension is installed on your vs code ms-vscode-remote.remote-containers

Once done you will see prompt from left side to reopen the project in dev container or open command palette and search for open with remote container:

install dependencies

apt-get update  && apt-get install --no-install-recommends -y cmake

connect to database from your local using db client of choice using the following credentials

username: user
password: user
database: test

once done create table so you can map it in -t arg or consider using sample in .devcontainer/db-migration.sql to populate some tables with data

run cargo command with mysql db as feature

cargo run --bin roapi --features database -- -a localhost:8080 -t posts=mysql://user:user@db:3306/test

otherwise if you are looking for other features you have to select appropriate one from roapi/Cargo.toml

roapi's People

Contributors

Stargazers

Watchers

Forkers

hirajanwin rustbunker clemherreman lguzzon cldershem dantodor ad-m hibuz younghai arif-basri yeohoonyun apidirectory em3ndez jayhuang75 isgasho lvheyang mschmo robdrysdale dispanser laopeng2021 ekroon polymath-is houqp emg110 dandandan shabbirhasan1 ciusji flowmint artaxerces snoe925 laplacekorea whatrocks viirya linecode lebrancebw placrosse mtvu xezpeleta tobyhede wangzhengbo tuobuxie yidianier mcjrr kopsteve novemberkilo jimmeryzou sanyuesiyuewuyue yfaming barkanido sakishum coeove weiplanet johnnycmj hhy5277 wodemax zuodh bufanliu api-ops afghanistanyn altanozlu signis-bin jimexist ajunlonglive timujiang 0ioio0 mbrukman c0c1 lcster bindx zmilan xiaolushuo sunlandli lndv xujiabin02 huangweiboy2 zeta1999 alperyilmaz leo-xzm yestechgroup frankfanslc tiphaineruy xdf020168 xiarush120220 nnsswa zemelleong ever0de akingde zhaijunyu gougu m-s-zou mu-l cxz sekmet snake19840 gclarkjr5 apiregistries jiangbo212 tempbottle geoheil roeap

roapi's Issues

JSON only parses 8192 rows?

I'm looking at the dataset located here: https://covid19.who.int/who-data/countries-by-day.json

It takes a little coercing (keys are a combination of the dimensions and metrics values, the data are in rows) but I've turned it into a CSV file and loaded it into roapi and all is good, 178,935 rows in the resulting API.

However, having similarly converted it to a more conventional JSON structure, roapi doesn't seem to serve all the records, only holding 8,192 rows (I'd presume this was an error in the conversion were it not for the fact that this batch_size used internally.)

Release Python package for version 1.1.3

As I've known in version 1.1.3 have released that have supported Date32. So, could you release it for Python package? Thank you very much.

Enable SIMD feature for docker build

We should use nightly rust toolchain in Docker build and enable simd feature for better performance.

Multi-node deployment

Hi this tool looks awesome! Do you have any experience/tutorial on how to run roapi on multiple nodes?

Look into custom memory allocator

See if switching to mimalloc and snmalloc helps with improving performance.

Add RFC7231 Accept header parsing

I would like to use ROAPI endpoints from a browser. Currently when accessing /api/tables/{table_name} endpoint with a browser you get an error like the following:

{ "code": 400, "error": "unsupported_content_type", "message": "\"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\" is not a supported response content type" }

How to parse the Accept header is specified in RFC7231 and Actix has an implementation for this. How to implement this would be impacted by #76 and if this would be implemented it might make switching harder.

An easy quicfix would be to implement a fallback to application/json if the resulting content-type is unknown.

Selecting files with Glob pattern / regexp when registering a table

Note sure if it could be interesting but:

When registering a table:

addr: 0.0.0.0:8084
tables:
  - name: "example"
    uri: "/data/"
    option:
      format: "parquet"
      use_memory_table: false

add in options:
glob

pattern: "file_typev1*.parquet"

or regexp

pattern: "\wfile_type\wv1\w*.parquet"

It would allow selecting in uri's with different extension / schema version

Expecting qualified names

Hi all,

I'm using the version 0.4.4 and the REST API and was able to successfully use roapi using a projection/sort/limit pattern with the following call:
curl -v "http://127.0.0.1:8084/api/tables/cities?&columns=City,State&sort=LatD&limit=3"

Unfortunately, when I run more than once this call, I'm starting to receive such messages: {"code":400,"error":"query_execution","message":"Failed to execute query: Error during planning: No field named 'cities.LatD'. Valid fields are 'cities.City', 'cities.State'."}* Connection #0 to host 127.0.0.1 left intact

More, I can't successfully qualify the names of the columns, I tried many variationq but without success:
curl -v "http://127.0.0.1:8084/api/tables/cities?&sort=LatD&columns=cities.City,cities.State&limit=3" which is misleading regarding that the error statement is claiming that the expected fields are 'cities.City', 'cities.State'.

Any direction will be welcome.

Automate docker release with Github action

add a job in roapi_http_release action to build, tag and push docker image to https://github.com/orgs/roapi/packages/container/package/roapi-http on every release.

roapi-http can not work on windows

run roapi-http.exe on windows with simple:

roapi-http.exe --table 'tresult=d:/data/out_result.csv'

and get error:

unsupported extension in uri: d:/data/out_result.csv

Importing parquet file produced by odbc2parquet fails silently with "Good bye!"

Hello,

When I create parquet files using odbc2parquet, columnq cannot read them and fails silently. These same files are able to be read without issue in duckdb and PowerBI. If I import the parquet file into duckdb and export it again with duckdb's default parquet export compression format, columnq generates the error: Error: Error loading Parquet: Failed to create file reader: Parquet error: Invalid Parquet file. Corrupt footer. If I export from duckdb explicitly setting it to use snappy compression, columnq can import that parquet file successfully. I am using the latest version of columnq 0.2.1

Regards,
Chris

PS C:\utils\database\odbc\parquet> odbc2parquet.exe query --column-compression-default snappy --dsn CH files_snappy.parquet "select * from files"

PS C:\utils\database\odbc\parquet> columnq --version
Columnq 0.2.1
PS C:\utils\database\odbc\parquet> columnq console --table "files=files_zstd.parquet"
Good bye!
PS C:\utils\database\odbc\parquet> columnq console --table "files=files_snappy.parquet"
Good bye!
PS C:\utils\database\odbc\parquet> columnq console --table "files=files.parquet"  (this one is gzip)
Good bye!

Error reading Parquet files from S3 with a space in the path.

Hi, I had a problem loading parquet file from s3 when there's a space in the path. I tried with %20 but it doesn't work either.

Example path:

s3://my-bucket/trusted/receita_socios/version=2021-10-30 09:00:56/my-file.parquet

Error message:

Error: Error loading Parquet: no parquet file found

The same file without space in the path works fine.

support https and other transports for csv format

currently it's hard coded to only support local files.

Support Arrow Flight endpoint

It might make sense to have a look at supporting Arrow Flight as endpoint, this looks to be a good match with roapi.

https://arrow.apache.org/docs/format/Flight.html

Run S3 integration test with minio in CI/CD

columnq binary version information does not match Release page version information

Release page for most recent asset tar.gz -> columnq-cli-v0.2.0

Output from downloaded Windows binary:

PS C:\temp\jira> columnq --version
Columnq 0.0.1

Does it support https traffic without modification?

Azure Data Lake Gen 2 support

Delta-rs (and soon, datafusion) will support ADL2, and I think it would be great to also have this in ROAPI.

If I had to implement it, I'd have a closer look at the implementation in delta-rs but I'm open to other suggestions.

make batch size configurable when loading datasets

currently batch size is hard coded to 1024 in code.

ReadMe described installation on Windows fails

Hi,

I tried to install roapi according to the ReadMe and the following error occured:

Operating System: Windows 10.0.18363

I'm not fluent in Rust and just wanted to test this cool project. So, maybe this is something that an additional command parameter could solve? Would be great if information regarding this problem could be added to the ReadMe. Thanks!

Support loading of Arrow IPC format

I think it would be useful if ROAPI can not only use Arrow IPC for response encoding, but also as a dataset format.

Alias issues on join

I have a simple setup with the following configuration and tables:

roapi.yml

addr: 0.0.0.0:8080
tables:
  - name: "countries"
    uri: "countries.csv"

  - name: "cities"
    uri: "cities.csv"

countries.csv

ID,COUNTRY
1,Germany
2,Sweden
3,Japan

cities.csv

ID,CITY,COUNTRY_ID
1,Hamburg,1
2,Stockholm,2
3,Osaka,3
4,Berlin,1
5,Göteborg,2
6,Tokyo,3
7,Kyoto,3

If I then perform the following JOIN request without any issues:

$ curl -X POST -H 'Accept: application/csv' -d "SELECT c1.CITY, c2.COUNTRY FROM cities AS c1 JOIN countries AS c2 ON c1.COUNTRY_ID = c2.ID" http://127.0.0.1:8080/api/sql

CITY,COUNTRY
Stockholm,Sweden
Göteborg,Sweden
Osaka,Japan
Tokyo,Japan
Kyoto,Japan
Hamburg,Germany
Berlin,Germany

But if I then try to additionally select the city ID things doesn't goes so well:

$ curl -X POST -H 'Accept: application/json' -d "SELECT c1.ID, c1.CITY, c2.COUNTRY FROM cities AS c1 JOIN countries AS c2 ON c1.COUNTRY_ID = c2.ID" http://127.0.0.1:8080/api/sql

{"code":400,"error":"query_execution","message":"Failed to execute query: Error during planning: The left schema and the right schema have the following columns with the same name without being on the ON statement: {Column { name: \"ID\", index: 0 }}. Consider aliasing them."}%

To me this seems as if the aliasing is not working as expected. I would expect the query engine to be able to distinguish between those two fields since I've aliased the to tables (c1 and c2 respectively). Using e.g. sqlite this query works as expected:

sqlite> SELECT c1.ID, c1.CITY, c2.COUNTRY FROM cities AS c1 JOIN countries AS c2 ON c1.COUNTRY_ID = c2.ID;
1|Hamburg|Germany
2|Stockholm|Sweden
3|Osaka|Japan
4|Berlin|Germany
5|Göteborg|Sweden
6|Tokyo|Japan
7|Kyoto|Japan

Tested on 0.3.4 and main+gf72299f.

Automate binary release through github action

GET and POST verbs

Is there a reason the GraphQL and SQL need to communicate via POST and not GET like the REST endpoint allows for?

Timestamp issues

First of all, thanks for including the Date32 in the latest commit.

I have checked out the code, build it with --release (btw, you might want to update the readme, the cargo command doesn't works because now you have two targets, roapi and columnq). Ubuntu 18, Rust updated to latest stable.

I have a few parquet files, in which datetime is stored in various formats, form Scala - Timestamp, DateTime, Long.
If I load the ones with Timestamp and DateTime, the result is the same for select * from tbl limit 1 :

[2021-04-12T06:43:04Z INFO  actix_web::middleware::logger] 127.0.0.1:33584 "POST /api/sql HTTP/1.1" 200 29 "-" "curl/7.58.0" 0.012752
thread 'actix-rt|system:0|arbiter:1' panicked at 'Unsupported datatype: Timestamp(
    Nanosecond,
    None,
)', /root/.cargo/git/checkouts/arrow-7d34e669e95701bb/cfec7bc/rust/arrow/src/json/writer.rs:326:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:435:5
   2: arrow::json::writer::set_column_for_json_rows
   3: arrow::json::writer::record_batches_to_json_rows
   4: roapi_http::encoding::json::record_batches_to_bytes
   5: roapi_http::api::encode_record_batches

Now, I was thinking that I can live with Long values, and use the to_timestamp function. But I get the same error. Here's an example from the output:

curl -X POST -d "select to_timestamp('2020-09-08T12:00:00+00:00') from tbl limit 1" localhost:8091/api/sql
[2021-04-12T06:58:35Z INFO  actix_web::middleware::logger] 127.0.0.1:33678 "POST /api/sql HTTP/1.1" 400 170 "-" "curl/7.58.0" 0.003689
thread 'actix-rt|system:0|arbiter:4' panicked at 'Unsupported datatype: Timestamp(
    Nanosecond,
    None,
)', /root/.cargo/git/checkouts/arrow-7d34e669e95701bb/cfec7bc/rust/arrow/src/json/writer.rs:326:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:435:5
   2: arrow::json::writer::set_column_for_json_rows
   3: arrow::json::writer::record_batches_to_json_rows
   4: roapi_http::encoding::json::record_batches_to_bytes
   5: roapi_http::api::encode_record_batches

And the value above comes from the Datafusion tests, which are presumably working : https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/rust/datafusion/tests/sql.rs#L1768

I am a little puzzled right now. Do you see a way in which I can use timestamps in conjuction with roapi ?
Is my understanding correct, that at this moment I can use Date, but not DateTime, Timestamp ?

Error: no CA certificates found

Hi,

When I configure a Google Spreadsheet using the Docker environment, I get the following error:

[2021-11-12T11:35:25Z INFO  roapi_http::api] loading `uri(https://docs.google.com/spreadsheets/d/<redacted>)` as table `<redacted>`
thread 'main' panicked at 'no CA certificates found', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-rustls-0.22.1/src/connector.rs:45:13

Installing the package ca-certificates with APT in the Docker image fixed the issue.

Thanks

Produce optimized build for official releases

We should adopt all tips from Datafusion's doc: https://arrow.apache.org/datafusion/user-guide/library.html#optimized-configuration.

docker is not working due to GLIBC issue: roapi-http: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by roapi-http)

Running docker run -t --rm -p 8080:8080 ghcr.io/roapi/roapi-http:latest --addr 0.0.0.0:8080 \ --table "uk_cities=test_data/uk_cities_with_headers.csv" \ --table "test_data/spacex_launches.json"

is generating error:

roapi-http: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by roapi-http)

Fields omitted for null values?

Thanks again for fixing #132. I'm still tinkering and noticed that when validating the CSV vs. JSON outputs the record-count didn't seem to tally when I was checking for certain fields.

It seems that if a field has a null value in the input CSV or JSON, that field is omitted from the output rather than being set to null.

For example:

CSV, integer values, empty field:

a
1
""

(CSV, string values don't really have a null value and work as expected.)

JSON, integer values, null field:

[{"a": 1}, {"a": null}]

JSON, string values, null field:

[{"a": "b"}, {"a": null}]

In all of the above cases the a field is dropped from the output in the latter record:

❯ curl --silent "http://127.0.0.1:8080/api/tables/test" 
[{"a":"b"},{}]

❯ curl --silent "http://127.0.0.1:8080/api/tables/test" 
[{"a":1},{}]

Not sure if this is a bug, expected behaviour or an issue with with underlying data parsing…?

optimize IO with bufreader

When loading dataset from filesystem and network, we should use BufReader instead of Read trait to reduce system calls. This means partitions_from_uri functions' signatures need to be changed to accept a callback that takes BufReader instead of Read.

Support for large data sets (avoiding `MemTable`)

Hi,

Related to #57,
I have a use case where I need visualize a small part (few thousand rows) from a large dataset (a few hundred TBs) by using a highly selective query that is typically only hitting a few parquet files out of a large number of partitions. Using spark, these queries can be served in a few seconds (assuming that the file meta data fits in memory) - even on a single-node system.

I've been playing with using datafusion directly, but actually roapi is doing exactly what I need. Is there any design choice / project direction that prevents the various table implementations from returning some dyn TableProvider instead of MemTable?

I understand that roapi supports various cloud storage systems, which is not supported by datafusion yet, but given the recent developments in apache/datafusion#811 there is hopefully a solution coming.

FWIW, I hammered together a small PoC by ripping out the parquet loader and directly registering the parquet reader over at https://github.com/dispanser/roapi/tree/lazy-load-parquet . This approach seems to work, but is definitely not mergeable :-).

A possibly implementation could add a configuration flag that flips from MemTable to TableProvider code paths, supporting both the existing in-memory scenario as well as a new code path that would register parquet / delta / ... directly.

Unsupported datatype: Date32

I had an error when querying the data that had the Date32 type. I checked DataFusion library but it works on Date32.

thread 'actix-rt|system:0|arbiter:5' panicked at 'Unsupported datatype: Date32', /Users/runner/.cargo/git/checkouts/arrow-7d34e669e95701bb/f7cf157/rust/arrow/src/json/writer.rs:326:13

Reproduce:
Config:

addr: 0.0.0.0:8080
tables:
- name: han_shop
  option:
    format: csv
  uri: ../examples.csv

SQL query:

SELECT name FROM example

CSV data:

period,name,
2021-03-01,Linh,
2021-03-01,Han,

Schema:

{
    "name": "period",
    "data_type": "Date32",
    "nullable": false,
    "dict_id": 0,
    "dict_is_ordered": false
}

Command:

roapi-http -c ./config-1615885382668.yml

Add support for websockets

It would be great to add support for websockets too - https://actix.rs/docs/websockets/. Thanks a lot!

Inlude new files in registry of tables dynamically

As ROAPI User
I would like to point to folder containing files (JSON/Parque/etc)
and be able to include new files (as in added after ROAPI start) in query results.

Add parquet encoding support

This is already supported by upstream arrow crate, just need to invoke the relevant API.

Support for azure blob storage

To my understanding, this is not supported yet.

Would it be something along the lines of adding an adapter for azure the same way as https://github.com/roapi/roapi/blob/main/columnq/src/io/s3.rs works with s3?

Maybe also related to #74

HTTP2 Support

I was wondering if the maintainers can add HTTP2 support as a feature request. So it can be worked later to be added to the framework.
Thanks.

consistent and deterministic logical plan construction across query interfaces

Follow on for #110. Order of logical plan node should be applied in a consistent and deterministic order between different query interfaces.

Here is the order we have in the sql planner today, would be good to keep the REST and GraphQL query interface in sync on this as well:

filter
project/select
sort
limit

Dataset JOIN query support

Datafusion only has partial SQL join support, we need to extend Datafusion to support full fledged JOIN queryies.

Add HTTP2 integration test in CI

to make sure we don't have regression on dropping support for http2 going forward. The test can just be a simple http2 curl call to roapi.

Failed to load Delta table with partitions

When I tried to load a Delta table with partitions, it throws the following exception

Error: DeltaTable error: Failed to apply transaction log: Invalid JSON in log record
Caused by:
0: Failed to apply transaction log: Invalid JSON in log record
1: Invalid JSON in log record
2: invalid type: null, expected a string at line 1 column 195

I believe it fails because the table commit log has an entry

{"add":{"path":"file_year=__HIVE_DEFAULT_PARTITION__/file_month=__HIVE_DEFAULT_PARTITION__/part-00000-4cfe8fb8-3905-4ae6-8b44-e7612753064c.c000.snappy.parquet","partitionValues":{"file_year":null,"file_month":null},"size":3331,"modificationTime":1628075908000,"dataChange":true}}

issue compiling roapi-http v0.1.13 could not compile `actix-cors`

Hello,
I try to compile roapi-http from the source zip

I have an error when I compile

cargo install --verbose --path ./roapi-http --bin roapi-http
building
...
...

error[E0053]: method `error_response` has an incompatible type for trait                                                              │
  --> /usr/local/cargo/git/checkouts/actix-extras-3214827614a42f65/ab3bdb6/actix-cors/src/error.rs:53:5                               │
   |                                                                                                                                  │
53 |     fn error_response(&self) -> HttpResponse {                                                                                   │
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `BaseHttpResponse`, found struct `actix_web::HttpResponse`          │
   |                                                                                                                                  │
   = note: expected fn pointer `fn(&CorsError) -> BaseHttpResponse<actix_web::dev::Body>`                                             │
              found fn pointer `fn(&CorsError) -> actix_web::HttpResponse`                                                            │
                                                                                                                                      │
error: aborting due to previous error                                                                                                 │
                                                                                                                                      │
For more information about this error, try `rustc --explain E0053`.                                                                   │
error: could not compile `actix-cors`                                                                                                 │

Unable to load a large Delta table

I am able to load smaller Delta tables but unable to load larger Delta table (4.2 billions rows consists of 1K Parquet files). I tried to bump up batch_size but it did not help. Currently in ROAPI, I do not see an option to pre-filter in the Arrow dataset level. No option to distribute the dataset in multiple nodes (distributed dataset). Any suggestions how to achieve this in ROAPI? Are these known issues with ROAPI? Please, let me know how to fix it. Thanks.

Looking into source of logging overhead

https://tech.marksblogg.com/roapi-rust-data-api.html mentioned that changing logging level from info to error resulted in 50% speed boost, this is certainly not expected :(

Switch to glibc for automated Linux binary releases

It's been reported by other projects that musl could lead to performance issues, so we should switch our Linux binary releases to use glibc instead to avoid these kind of potential pitfalls.

Timestamps in parquet files

I've run into issues reading timestamp columns from parquet files.

The snapshot DataFusion library did not work at all to convert a string to a timestamp in query.
e.g. select * from parquet_table where date > CAST('2020-01-01T00:00:00.000Z' as Timestamp).
The required upgrading to datatfusion 4.0.0, which has now been released.

However there is still an issue reading timestamps from parquet. All of my timestamps are coming through as a 1970 date. It appears that they are being interpreted as a timestamp(ns), rather than a timestamp(us). Given ns/nanoseconds is deprecated in parquet this is very strange. The file was written using pandas and pyarrow, so in theory is the same underlings.

Support roapi for AWS Lambda

I want to use roapi for serverless but I have to do some tricks. For example, I've to build a python image, and run roapi as a subprocess then use restful API to call it. I checked the repo and saw the subproject columnq. So do you have any plan to split columnq to another repo so that it would be easy to use in serverless or integrate with another backend code?

Why support serverless?
In my company, we have a lot of teams. Each team has different resources, so I want them can take the initiative to select their resources. When they invoke the serverless function, it will create a YAML config (with their resources) in lambda tmp direction and another subprocess for roapi will be invoked by reading this config.

How to contribute?
If you have a plan to support serverless, then how can I help you to contribute it? I'm also excited about this project.

Docker image:

FROM public.ecr.aws/lambda/python:3.8
WORKDIR /app
COPY . .
RUN pip3 install -r requirements.txt
CMD ["/app/app.hello"]

Check roapi is running:

def start_roapi(addr, config_file_name):
    process = subprocess.Popen([
        "roapi-http -c {filename}".format(filename=config_file_name)],
        shell=True,
        # stdout=subprocess.PIPE,
        # stderr=subprocess.PIPE
    )

    url = "http://{addr}".format(addr=addr)
    for n in range(1000):
        try:
            time.sleep(0.5)
            res = requests.request("GET", url)
            print("started", res.text)
            break
        except:
            pass

    return process

Then, I've to query to localhost:

url = "http://{addr}{path}?{qs}".format(addr=addr, path=event["rawPath"], qs=event["rawQueryString"])
response = requests.request(method, url, headers=event["headers"], data=data)

Switch from actix-web to a better maintained web framework

From #75 (comment). We should evaluate both warp and https://github.com/tokio-rs/axum.

Docker image not working

Hi,

I'm just trying roapi.

Docker image with the latest tag (and version >=0.4.3) is not working. It ends with the exit code 132 and no error message.

$ docker run -t --rm -p 8081:8080 ghcr.io/roapi/roapi-http:v0.4.2 --addr 0.0.0.0:8080 --table "uk_cities=test_data/uk_cities_with_headers.csv" --table "test_data/spacex_launches.json"
$ echo $?
132

It works ok with v0.4.2

$ docker run -t --rm -p 8081:8080 ghcr.io/roapi/roapi-http:v0.4.3 --addr 0.0.0.0:8080 --table "uk_cities=test_data/uk_cities_with_headers.csv" --table "test_data/spacex_launches.json"
[2021-11-11T12:31:49Z INFO  roapi_http::api] loading `uri(test_data/uk_cities_with_headers.csv)` as table `uk_cities`
[2021-11-11T12:31:49Z INFO  roapi_http::api] registered `uri(test_data/uk_cities_with_headers.csv)` as table `uk_cities`
[2021-11-11T12:31:49Z INFO  roapi_http::api] loading `uri(test_data/spacex_launches.json)` as table `spacex_launches`
[2021-11-11T12:31:50Z INFO  roapi_http::api] registered `uri(test_data/spacex_launches.json)` as table `spacex_launches`
[2021-11-11T12:31:50Z INFO  actix_server::builder] Starting 2 workers
[2021-11-11T12:31:50Z INFO  actix_server::builder] Starting "actix-web-service-0.0.0.0:8080" service on 0.0.0.0:8080