Giter VIP home page Giter VIP logo

roapi's Introduction

ROAPI

build Documentation discord

ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code. It builds on top of Apache Arrow and Datafusion. The core of its design can be boiled down to the following:

  • Query frontends to translate SQL, FlightSQL, GraphQL and REST API queries into Datafusion plans.
  • Datafusion for query plan execution.
  • Data layer to load datasets from a variety of sources and formats with automatic schema inference.
  • Response encoding layer to serialize intermediate Arrow record batch into various formats requested by client.

See below for a high level diagram:

roapi-design-diagram

Installation

Install pre-built binary

# if you are using homebrew
brew install roapi
# or if you prefer pip
pip install roapi

Check out Github release page for pre-built binaries for each platform. Pre-built docker images are also available at ghcr.io/roapi/roapi.

Install from source

cargo install --locked --git https://github.com/roapi/roapi --branch main --bins roapi

Usage

Quick start

Spin up APIs for test_data/uk_cities_with_headers.csv and test_data/spacex_launches.json:

roapi \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    --table "test_data/spacex_launches.json"

For windows, full scheme(file:// or filesystem://) must filled, and use double quote(") instead of single quote(') to escape windows cmdline limit:

roapi \
    --table "uk_cities=file://d:/path/to/uk_cities_with_headers.csv" \
    --table "file://d:/path/to/test_data/spacex_launches.json"

Or using docker:

docker run -t --rm -p 8080:8080 ghcr.io/roapi/roapi:latest --addr-http 0.0.0.0:8080 \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    --table "test_data/spacex_launches.json"

For MySQL and SQLite, use parameters like this.

--table "table_name=mysql://username:password@localhost:3306/database"
--table "table_name=sqlite://path/to/database"

Want dynamic register data? Add parameter -d to command. --table parameter cannot be ignored for now.

roapi \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    -d

Then post config to /api/table register data.

curl -X POST http://172.24.16.1:8080/api/table \
     -H 'Content-Type: application/json' \
     -d '[
       {
         "tableName": "uk_cities2",
         "uri": "./test_data/uk_cities_with_headers.csv"
       },
       {
         "tableName": "table_name",
         "uri": "sqlite://path/to/database"
       }
     ]'

Query tables using SQL, GraphQL or REST:

curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
curl -X POST -d "query { uk_cities(limit: 2) {city, lat, lng} }" localhost:8080/api/graphql
curl "localhost:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"

Get inferred schema for all tables:

curl 'localhost:8080/api/schema'

Config file

You can also configure multiple table sources using YAML or Toml config, which supports more advanced format specific table options:

addr:
  http: 0.0.0.0:8084
  postgres: 0.0.0.0:5433
tables:
  - name: "blogs"
    uri: "test_data/blogs.parquet"

  - name: "ubuntu_ami"
    uri: "test_data/ubuntu-ami.json"
    option:
      format: "json"
      pointer: "/aaData"
      array_encoded: true
    schema:
      columns:
        - name: "zone"
          data_type: "Utf8"
        - name: "name"
          data_type: "Utf8"
        - name: "version"
          data_type: "Utf8"
        - name: "arch"
          data_type: "Utf8"
        - name: "instance_type"
          data_type: "Utf8"
        - name: "release"
          data_type: "Utf8"
        - name: "ami_id"
          data_type: "Utf8"
        - name: "aki_id"
          data_type: "Utf8"

  - name: "spacex_launches"
    uri: "https://api.spacexdata.com/v4/launches"
    option:
      format: "json"

  - name: "github_jobs"
    uri: "https://web.archive.org/web/20210507025928if_/https://jobs.github.com/positions.json"

To run serve tables using config file:

roapi -c ./roapi.yml # or .toml

See config documentation for more options including using Google spreadsheet as a table source.

Response serialization

By default, ROAPI encodes responses in JSON format, but you can request different encodings by specifying the ACCEPT header:

curl -X POST \
    -H 'ACCEPT: application/vnd.apache.arrow.stream' \
    -d "SELECT launch_library_id FROM spacex_launches WHERE launch_library_id IS NOT NULL" \
    localhost:8080/api/sql

REST API query interface

You can query tables through REST API by sending GET requests to /api/tables/{table_name}. Query operators are specified as query params.

REST query frontend currently supports the following query operators:

  • columns
  • sort
  • limit
  • filter

To sort column col1 in ascending order and col2 in descending order, set query param to: sort=col1,-col2.

To find all rows with col1 equal to string 'foo', set query param to: filter[col1]='foo'. You can also do basic comparisons with filters, for example predicate 0 <= col2 < 5 can be expressed as filter[col2]gte=0&filter[col2]lt=5.

GraphQL query interface

To query tables using GraphQL, send the query through POST request to /api/graphql endpoint.

GraphQL query frontend supports the same set of operators supported by REST query frontend. Here how is you can apply various operators in a query:

{
  table_name(
    filter: { col1: false, col2: { gteq: 4, lt: 1000 } }
    sort: [{ field: "col2", order: "desc" }, { field: "col3" }]
    limit: 100
  ) {
    col1
    col2
    col3
  }
}

SQL query interface

To query tables using a subset of standard SQL, send the query through POST request to /api/sql endpoint. This is the only query interface that supports table joins.

Key value lookup

You can pick two columns from a table to use a key and value to create a quick keyvalue store API by adding the following lines to the config:

kvstores:
  - name: "launch_name"
    uri: "test_data/spacex_launches.json"
    key: id
    value: name

Key value lookup can be done through simple HTTP GET requests:

curl -v localhost:8080/api/kv/launch_name/600f9a8d8f798e2a4d5f979e
Starlink-21 (v1.0)%

Query through Postgres wire protocol

ROAPI can present itself as a Postgres server so users can use Postgres clients to issue SQL queries.

$ psql -h 127.0.0.1
psql (12.10 (Ubuntu 12.10-0ubuntu0.20.04.1), server 13)
WARNING: psql major version 12, server major version 13.
         Some psql features might not work.
Type "help" for help.

houqp=> select count(*) from uk_cities;
 COUNT(UInt8(1))
-----------------
              37
(1 row)

Features

Query layer:

  • REST API GET
  • GraphQL
  • SQL
  • join between tables
  • access to array elements by index
  • access to nested struct fields by key
  • column index
  • protocol
    • Postgres
    • FlightSQL
  • Key value lookup

Response serialization:

  • JSON application/json
  • Arrow application/vnd.apache.arrow.stream
  • Parquet application/vnd.apache.parquet
  • msgpack

Data layer:

Misc:

  • auto gen OpenAPI doc for rest layer
  • query input type conversion based on table schema
  • stream arrow encoding response
  • authentication layer

Development

The core of ROAPI, including query front-ends and data layer, lives in the self-contained columnq crate. It takes queries and outputs Arrow record batches. Data sources will also be loaded and stored in memory as Arrow record batches.

The roapi crate wraps columnq with a multi-protocol query layer. It serializes Arrow record batches produced by columnq into different formats based on client request.

Building ROAPI with simd optimization requires nightly rust toolchain.

Debug

To log all FlightSQL requests in console, set RUST_LOG=tower_http=trace.

Build Docker image

docker build --rm -t ghcr.io/roapi/roapi:latest .

VS Code DevContainer

Requirements

  • Vscode
  • Ensure this extension is installed on your vs code ms-vscode-remote.remote-containers

Once done you will see prompt from left side to reopen the project in dev container or open command palette and search for open with remote container:

  1. install dependencies
apt-get update  && apt-get install --no-install-recommends -y cmake
  1. connect to database from your local using db client of choice using the following credentials
username: user
password: user
database: test

once done create table so you can map it in -t arg or consider using sample in .devcontainer/db-migration.sql to populate some tables with data

  1. run cargo command with mysql db as feature
cargo run --bin roapi --features database -- -a localhost:8080 -t posts=mysql://user:user@db:3306/test 

otherwise if you are looking for other features you have to select appropriate one from roapi/Cargo.toml

roapi's People

Contributors

afterthought avatar alperyilmaz avatar clemherreman avatar dandandan avatar dispanser avatar ekroon avatar elliot14a avatar geoheil avatar hibuz avatar holicc avatar houqp avatar jeankhawand avatar jimexist avatar jychen7 avatar lguzzon avatar maks-d avatar messense avatar mschmo avatar novemberkilo avatar prince-mishra avatar ralfnorthman avatar rgieseke avatar ryanrussell avatar tiphaineruy avatar tobyhede avatar togami2864 avatar whatrocks avatar xezpeleta avatar yfaming avatar zemelleong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

roapi's Issues

JSON only parses 8192 rows?

I'm looking at the dataset located here: https://covid19.who.int/who-data/countries-by-day.json

It takes a little coercing (keys are a combination of the dimensions and metrics values, the data are in rows) but I've turned it into a CSV file and loaded it into roapi and all is good, 178,935 rows in the resulting API.

However, having similarly converted it to a more conventional JSON structure, roapi doesn't seem to serve all the records, only holding 8,192 rows (I'd presume this was an error in the conversion were it not for the fact that this batch_size used internally.)

Multi-node deployment

Hi this tool looks awesome! Do you have any experience/tutorial on how to run roapi on multiple nodes?

Add RFC7231 Accept header parsing

I would like to use ROAPI endpoints from a browser. Currently when accessing /api/tables/{table_name} endpoint with a browser you get an error like the following:

{ "code": 400, "error": "unsupported_content_type", "message": "\"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\" is not a supported response content type" }

How to parse the Accept header is specified in RFC7231 and Actix has an implementation for this. How to implement this would be impacted by #76 and if this would be implemented it might make switching harder.

An easy quicfix would be to implement a fallback to application/json if the resulting content-type is unknown.

Selecting files with Glob pattern / regexp when registering a table

Note sure if it could be interesting but:

When registering a table:

addr: 0.0.0.0:8084
tables:
  - name: "example"
    uri: "/data/"
    option:
      format: "parquet"
      use_memory_table: false

add in options:
glob

pattern: "file_typev1*.parquet"

or regexp

pattern: "\wfile_type\wv1\w*.parquet"

It would allow selecting in uri's with different extension / schema version

Expecting qualified names

Hi all,

I'm using the version 0.4.4 and the REST API and was able to successfully use roapi using a projection/sort/limit pattern with the following call:
curl -v "http://127.0.0.1:8084/api/tables/cities?&columns=City,State&sort=LatD&limit=3"

Unfortunately, when I run more than once this call, I'm starting to receive such messages: {"code":400,"error":"query_execution","message":"Failed to execute query: Error during planning: No field named 'cities.LatD'. Valid fields are 'cities.City', 'cities.State'."}* Connection #0 to host 127.0.0.1 left intact

More, I can't successfully qualify the names of the columns, I tried many variationq but without success:
curl -v "http://127.0.0.1:8084/api/tables/cities?&sort=LatD&columns=cities.City,cities.State&limit=3" which is misleading regarding that the error statement is claiming that the expected fields are 'cities.City', 'cities.State'.

Any direction will be welcome.

roapi-http can not work on windows

run roapi-http.exe on windows with simple:

roapi-http.exe --table 'tresult=d:/data/out_result.csv'

and get error:

unsupported extension in uri: d:/data/out_result.csv

Importing parquet file produced by odbc2parquet fails silently with "Good bye!"

Hello,

When I create parquet files using odbc2parquet, columnq cannot read them and fails silently. These same files are able to be read without issue in duckdb and PowerBI. If I import the parquet file into duckdb and export it again with duckdb's default parquet export compression format, columnq generates the error: Error: Error loading Parquet: Failed to create file reader: Parquet error: Invalid Parquet file. Corrupt footer. If I export from duckdb explicitly setting it to use snappy compression, columnq can import that parquet file successfully. I am using the latest version of columnq 0.2.1

Regards,
Chris

PS C:\utils\database\odbc\parquet> odbc2parquet.exe query --column-compression-default snappy --dsn CH files_snappy.parquet "select * from files"

PS C:\utils\database\odbc\parquet> columnq --version
Columnq 0.2.1
PS C:\utils\database\odbc\parquet> columnq console --table "files=files_zstd.parquet"
Good bye!
PS C:\utils\database\odbc\parquet> columnq console --table "files=files_snappy.parquet"
Good bye!
PS C:\utils\database\odbc\parquet> columnq console --table "files=files.parquet"  (this one is gzip)
Good bye!

Error reading Parquet files from S3 with a space in the path.

Hi, I had a problem loading parquet file from s3 when there's a space in the path. I tried with %20 but it doesn't work either.

Example path:

s3://my-bucket/trusted/receita_socios/version=2021-10-30 09:00:56/my-file.parquet

Error message:

Error: Error loading Parquet: no parquet file found

The same file without space in the path works fine.

Azure Data Lake Gen 2 support

Delta-rs (and soon, datafusion) will support ADL2, and I think it would be great to also have this in ROAPI.

If I had to implement it, I'd have a closer look at the implementation in delta-rs but I'm open to other suggestions.

ReadMe described installation on Windows fails

Hi,

I tried to install roapi according to the ReadMe and the following error occured:

image

Operating System: Windows 10.0.18363

I'm not fluent in Rust and just wanted to test this cool project. So, maybe this is something that an additional command parameter could solve? Would be great if information regarding this problem could be added to the ReadMe. Thanks!

Alias issues on join

I have a simple setup with the following configuration and tables:

roapi.yml

addr: 0.0.0.0:8080
tables:
  - name: "countries"
    uri: "countries.csv"

  - name: "cities"
    uri: "cities.csv"

countries.csv

ID,COUNTRY
1,Germany
2,Sweden
3,Japan

cities.csv

ID,CITY,COUNTRY_ID
1,Hamburg,1
2,Stockholm,2
3,Osaka,3
4,Berlin,1
5,Göteborg,2
6,Tokyo,3
7,Kyoto,3

If I then perform the following JOIN request without any issues:

$ curl -X POST -H 'Accept: application/csv' -d "SELECT c1.CITY, c2.COUNTRY FROM cities AS c1 JOIN countries AS c2 ON c1.COUNTRY_ID = c2.ID" http://127.0.0.1:8080/api/sql

CITY,COUNTRY
Stockholm,Sweden
Göteborg,Sweden
Osaka,Japan
Tokyo,Japan
Kyoto,Japan
Hamburg,Germany
Berlin,Germany

But if I then try to additionally select the city ID things doesn't goes so well:

$ curl -X POST -H 'Accept: application/json' -d "SELECT c1.ID, c1.CITY, c2.COUNTRY FROM cities AS c1 JOIN countries AS c2 ON c1.COUNTRY_ID = c2.ID" http://127.0.0.1:8080/api/sql

{"code":400,"error":"query_execution","message":"Failed to execute query: Error during planning: The left schema and the right schema have the following columns with the same name without being on the ON statement: {Column { name: \"ID\", index: 0 }}. Consider aliasing them."}%

To me this seems as if the aliasing is not working as expected. I would expect the query engine to be able to distinguish between those two fields since I've aliased the to tables (c1 and c2 respectively). Using e.g. sqlite this query works as expected:

sqlite> SELECT c1.ID, c1.CITY, c2.COUNTRY FROM cities AS c1 JOIN countries AS c2 ON c1.COUNTRY_ID = c2.ID;
1|Hamburg|Germany
2|Stockholm|Sweden
3|Osaka|Japan
4|Berlin|Germany
5|Göteborg|Sweden
6|Tokyo|Japan
7|Kyoto|Japan

Tested on 0.3.4 and main+gf72299f.

GET and POST verbs

Is there a reason the GraphQL and SQL need to communicate via POST and not GET like the REST endpoint allows for?

Timestamp issues

First of all, thanks for including the Date32 in the latest commit.

I have checked out the code, build it with --release (btw, you might want to update the readme, the cargo command doesn't works because now you have two targets, roapi and columnq). Ubuntu 18, Rust updated to latest stable.

I have a few parquet files, in which datetime is stored in various formats, form Scala - Timestamp, DateTime, Long.
If I load the ones with Timestamp and DateTime, the result is the same for select * from tbl limit 1 :

[2021-04-12T06:43:04Z INFO  actix_web::middleware::logger] 127.0.0.1:33584 "POST /api/sql HTTP/1.1" 200 29 "-" "curl/7.58.0" 0.012752
thread 'actix-rt|system:0|arbiter:1' panicked at 'Unsupported datatype: Timestamp(
    Nanosecond,
    None,
)', /root/.cargo/git/checkouts/arrow-7d34e669e95701bb/cfec7bc/rust/arrow/src/json/writer.rs:326:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:435:5
   2: arrow::json::writer::set_column_for_json_rows
   3: arrow::json::writer::record_batches_to_json_rows
   4: roapi_http::encoding::json::record_batches_to_bytes
   5: roapi_http::api::encode_record_batches

Now, I was thinking that I can live with Long values, and use the to_timestamp function. But I get the same error. Here's an example from the output:

curl -X POST -d "select to_timestamp('2020-09-08T12:00:00+00:00') from tbl limit 1" localhost:8091/api/sql
[2021-04-12T06:58:35Z INFO  actix_web::middleware::logger] 127.0.0.1:33678 "POST /api/sql HTTP/1.1" 400 170 "-" "curl/7.58.0" 0.003689
thread 'actix-rt|system:0|arbiter:4' panicked at 'Unsupported datatype: Timestamp(
    Nanosecond,
    None,
)', /root/.cargo/git/checkouts/arrow-7d34e669e95701bb/cfec7bc/rust/arrow/src/json/writer.rs:326:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/673d0db5e393e9c64897005b470bfeb6d5aec61b/library/std/src/panicking.rs:435:5
   2: arrow::json::writer::set_column_for_json_rows
   3: arrow::json::writer::record_batches_to_json_rows
   4: roapi_http::encoding::json::record_batches_to_bytes
   5: roapi_http::api::encode_record_batches

And the value above comes from the Datafusion tests, which are presumably working : https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/rust/datafusion/tests/sql.rs#L1768

I am a little puzzled right now. Do you see a way in which I can use timestamps in conjuction with roapi ?
Is my understanding correct, that at this moment I can use Date, but not DateTime, Timestamp ?

Error: no CA certificates found

Hi,

When I configure a Google Spreadsheet using the Docker environment, I get the following error:

[2021-11-12T11:35:25Z INFO  roapi_http::api] loading `uri(https://docs.google.com/spreadsheets/d/<redacted>)` as table `<redacted>`
thread 'main' panicked at 'no CA certificates found', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-rustls-0.22.1/src/connector.rs:45:13

Installing the package ca-certificates with APT in the Docker image fixed the issue.

Thanks

Fields omitted for null values?

Thanks again for fixing #132. I'm still tinkering and noticed that when validating the CSV vs. JSON outputs the record-count didn't seem to tally when I was checking for certain fields.

It seems that if a field has a null value in the input CSV or JSON, that field is omitted from the output rather than being set to null.

For example:

CSV, integer values, empty field:

a
1
""

(CSV, string values don't really have a null value and work as expected.)

JSON, integer values, null field:

[{"a": 1}, {"a": null}]

JSON, string values, null field:

[{"a": "b"}, {"a": null}]

In all of the above cases the a field is dropped from the output in the latter record:

❯ curl --silent "http://127.0.0.1:8080/api/tables/test" 
[{"a":"b"},{}]
❯ curl --silent "http://127.0.0.1:8080/api/tables/test" 
[{"a":1},{}]

Not sure if this is a bug, expected behaviour or an issue with with underlying data parsing…?

optimize IO with bufreader

When loading dataset from filesystem and network, we should use BufReader instead of Read trait to reduce system calls. This means partitions_from_uri functions' signatures need to be changed to accept a callback that takes BufReader instead of Read.

Support for large data sets (avoiding `MemTable`)

Hi,

Related to #57,
I have a use case where I need visualize a small part (few thousand rows) from a large dataset (a few hundred TBs) by using a highly selective query that is typically only hitting a few parquet files out of a large number of partitions. Using spark, these queries can be served in a few seconds (assuming that the file meta data fits in memory) - even on a single-node system.

I've been playing with using datafusion directly, but actually roapi is doing exactly what I need. Is there any design choice / project direction that prevents the various table implementations from returning some dyn TableProvider instead of MemTable?

I understand that roapi supports various cloud storage systems, which is not supported by datafusion yet, but given the recent developments in apache/datafusion#811 there is hopefully a solution coming.

FWIW, I hammered together a small PoC by ripping out the parquet loader and directly registering the parquet reader over at https://github.com/dispanser/roapi/tree/lazy-load-parquet . This approach seems to work, but is definitely not mergeable :-).

A possibly implementation could add a configuration flag that flips from MemTable to TableProvider code paths, supporting both the existing in-memory scenario as well as a new code path that would register parquet / delta / ... directly.

Unsupported datatype: Date32

I had an error when querying the data that had the Date32 type. I checked DataFusion library but it works on Date32.

thread 'actix-rt|system:0|arbiter:5' panicked at 'Unsupported datatype: Date32', /Users/runner/.cargo/git/checkouts/arrow-7d34e669e95701bb/f7cf157/rust/arrow/src/json/writer.rs:326:13

Reproduce:
Config:

addr: 0.0.0.0:8080
tables:
- name: han_shop
  option:
    format: csv
  uri: ../examples.csv

SQL query:

SELECT name FROM example

CSV data:

period,name,
2021-03-01,Linh,
2021-03-01,Han,

Schema:

{
    "name": "period",
    "data_type": "Date32",
    "nullable": false,
    "dict_id": 0,
    "dict_is_ordered": false
}

Command:

roapi-http -c ./config-1615885382668.yml

HTTP2 Support

I was wondering if the maintainers can add HTTP2 support as a feature request. So it can be worked later to be added to the framework.
Thanks.

Dataset JOIN query support

Datafusion only has partial SQL join support, we need to extend Datafusion to support full fledged JOIN queryies.

Add HTTP2 integration test in CI

to make sure we don't have regression on dropping support for http2 going forward. The test can just be a simple http2 curl call to roapi.

Failed to load Delta table with partitions

When I tried to load a Delta table with partitions, it throws the following exception

Error: DeltaTable error: Failed to apply transaction log: Invalid JSON in log record
Caused by:
0: Failed to apply transaction log: Invalid JSON in log record
1: Invalid JSON in log record
2: invalid type: null, expected a string at line 1 column 195

I believe it fails because the table commit log has an entry

{"add":{"path":"file_year=__HIVE_DEFAULT_PARTITION__/file_month=__HIVE_DEFAULT_PARTITION__/part-00000-4cfe8fb8-3905-4ae6-8b44-e7612753064c.c000.snappy.parquet","partitionValues":{"file_year":null,"file_month":null},"size":3331,"modificationTime":1628075908000,"dataChange":true}}

issue compiling roapi-http v0.1.13 could not compile `actix-cors`

Hello,
I try to compile roapi-http from the source zip

I have an error when I compile

cargo install --verbose --path ./roapi-http --bin roapi-http
building
...
...

error[E0053]: method `error_response` has an incompatible type for trait                                                              │
  --> /usr/local/cargo/git/checkouts/actix-extras-3214827614a42f65/ab3bdb6/actix-cors/src/error.rs:53:5                               │
   |                                                                                                                                  │
53 |     fn error_response(&self) -> HttpResponse {                                                                                   │
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `BaseHttpResponse`, found struct `actix_web::HttpResponse`          │
   |                                                                                                                                  │
   = note: expected fn pointer `fn(&CorsError) -> BaseHttpResponse<actix_web::dev::Body>`                                             │
              found fn pointer `fn(&CorsError) -> actix_web::HttpResponse`                                                            │
                                                                                                                                      │
error: aborting due to previous error                                                                                                 │
                                                                                                                                      │
For more information about this error, try `rustc --explain E0053`.                                                                   │
error: could not compile `actix-cors`                                                                                                 │

Unable to load a large Delta table

I am able to load smaller Delta tables but unable to load larger Delta table (4.2 billions rows consists of 1K Parquet files). I tried to bump up batch_size but it did not help. Currently in ROAPI, I do not see an option to pre-filter in the Arrow dataset level. No option to distribute the dataset in multiple nodes (distributed dataset). Any suggestions how to achieve this in ROAPI? Are these known issues with ROAPI? Please, let me know how to fix it. Thanks.

Timestamps in parquet files

I've run into issues reading timestamp columns from parquet files.

The snapshot DataFusion library did not work at all to convert a string to a timestamp in query.
e.g. select * from parquet_table where date > CAST('2020-01-01T00:00:00.000Z' as Timestamp).
The required upgrading to datatfusion 4.0.0, which has now been released.

However there is still an issue reading timestamps from parquet. All of my timestamps are coming through as a 1970 date. It appears that they are being interpreted as a timestamp(ns), rather than a timestamp(us). Given ns/nanoseconds is deprecated in parquet this is very strange. The file was written using pandas and pyarrow, so in theory is the same underlings.

Support roapi for AWS Lambda

I want to use roapi for serverless but I have to do some tricks. For example, I've to build a python image, and run roapi as a subprocess then use restful API to call it. I checked the repo and saw the subproject columnq. So do you have any plan to split columnq to another repo so that it would be easy to use in serverless or integrate with another backend code?

Why support serverless?
In my company, we have a lot of teams. Each team has different resources, so I want them can take the initiative to select their resources. When they invoke the serverless function, it will create a YAML config (with their resources) in lambda tmp direction and another subprocess for roapi will be invoked by reading this config.

How to contribute?
If you have a plan to support serverless, then how can I help you to contribute it? I'm also excited about this project.

Docker image:

FROM public.ecr.aws/lambda/python:3.8
WORKDIR /app
COPY . .
RUN pip3 install -r requirements.txt
CMD ["/app/app.hello"]

Check roapi is running:

def start_roapi(addr, config_file_name):
    process = subprocess.Popen([
        "roapi-http -c {filename}".format(filename=config_file_name)],
        shell=True,
        # stdout=subprocess.PIPE,
        # stderr=subprocess.PIPE
    )

    url = "http://{addr}".format(addr=addr)
    for n in range(1000):
        try:
            time.sleep(0.5)
            res = requests.request("GET", url)
            print("started", res.text)
            break
        except:
            pass

    return process

Then, I've to query to localhost:

url = "http://{addr}{path}?{qs}".format(addr=addr, path=event["rawPath"], qs=event["rawQueryString"])
response = requests.request(method, url, headers=event["headers"], data=data)

Docker image not working

Hi,

I'm just trying roapi.

Docker image with the latest tag (and version >=0.4.3) is not working. It ends with the exit code 132 and no error message.

$ docker run -t --rm -p 8081:8080 ghcr.io/roapi/roapi-http:v0.4.2 --addr 0.0.0.0:8080 --table "uk_cities=test_data/uk_cities_with_headers.csv" --table "test_data/spacex_launches.json"
$ echo $?
132

It works ok with v0.4.2

$ docker run -t --rm -p 8081:8080 ghcr.io/roapi/roapi-http:v0.4.3 --addr 0.0.0.0:8080 --table "uk_cities=test_data/uk_cities_with_headers.csv" --table "test_data/spacex_launches.json"
[2021-11-11T12:31:49Z INFO  roapi_http::api] loading `uri(test_data/uk_cities_with_headers.csv)` as table `uk_cities`
[2021-11-11T12:31:49Z INFO  roapi_http::api] registered `uri(test_data/uk_cities_with_headers.csv)` as table `uk_cities`
[2021-11-11T12:31:49Z INFO  roapi_http::api] loading `uri(test_data/spacex_launches.json)` as table `spacex_launches`
[2021-11-11T12:31:50Z INFO  roapi_http::api] registered `uri(test_data/spacex_launches.json)` as table `spacex_launches`
[2021-11-11T12:31:50Z INFO  actix_server::builder] Starting 2 workers
[2021-11-11T12:31:50Z INFO  actix_server::builder] Starting "actix-web-service-0.0.0.0:8080" service on 0.0.0.0:8080

Thanks

[feature] elasticsearch support

I think integrating this not only with RDBMS like Postgres (as seen in the roadmap) but also Elasticsearch would be quite awesome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.