Giter VIP home page Giter VIP logo

datasette-parquet's Introduction

I like all things data. I've worked with data my whole career, from things that would fit in a Tweet all the way up to multi-petabyte data warehouses running in the cloud.

Currently, I'm a founder at SyncWith. We want anyone to be able to get their data into Google Sheets and Looker Studio.

As a hobby, I'm dabbling with making small datasets much more accessible via SQLite and Datasette plugins.

You can find me on the web at cldellow.com or as @cldellow on Hachyderm.

datasette-parquet's People

Contributors

cldellow avatar mause avatar mrchrisadams avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

datasette-parquet's Issues

Sanity checking the handling of WHERE statements with datasette

Hi there.

First of all, thanks for making datasette-parquet - I was very happy to discover it exists.

I'm using it on a project to play around with some recently published open data and whenever I try adding a "WHERE" clause to a query that would go to duckdb I'm seeing an error.

When I run load a page without a where clause, all is well.

However once I add a where clause like so:

Screenshot 2023-07-18 at 16 27 51

I end up with a confusing "Binder" error that I'm not sure how to debug:

Screenshot 2023-07-18 at 16 27 56

I'm working to get a test case together, but in the meantime, is the rewrite.py the file to check to see if there is any logic that it might be causing this?

https://github.com/cldellow/datasette-parquet/blob/main/datasette_parquet/rewrite.py#L80-L81

I see some logging statements, and knowing where to look would help for me to create a PR to fix it once I understand the cause of the issue.

canned queries seem to use sqlite?

I was trying to use duckdb to export a parquet file with a writable query like:

{
  "databases": {
    "trove": {
      "queries": {
        "export": {
          "sql": "copy (select geo_id, name, ascii_name, latitude, longitude, feature_class, feature_code, country, cc2, admin1, admin2, admin3, admin4, population, elevation, dem, timezone, updated_at from cities500) to 'output.parquet' (format parquet)",
          "write": true
        }
      }
    }
  }
}

This gave: near "copy": syntax error and didn't seem to go through my rewriter. That error message is what sqlite3 gives if I run the command.

support duckdb

In addition to directory, we should let you specify file, pointing to an existing DuckDB database

support query runtime limits

SQLite's Python interface supports interrupting long-running queries. DuckDB's C API supports this, too, but it has not yet been exposed to the Python API. See duckdb/duckdb#5938 and duckdb/duckdb#3749

A really brutal workaround: DuckDB will respond to signals... could we temporarily install a signal handler that prevents SIGINT from killing the server, but lets it percolate through to DuckDB to interrupt? That feels fragile and complex, might be easier to learn how to patch DuckDB

Error with facet suggestions

Absolutely fantastic project, thanks a lot!

As hinted to in the Readme, disabling facet suggestions might be needed.

Without it disabling I was getting an error when clicking on a view of a parquet file:

datasette_parquet.exceptions.DoubleQuoteForLiteraValue: ['"unit" must appear in the GROUP BY clause or must be part of an aggregate function.\nEither add it to the GROUP BY list, or use "ANY_VALUE(unit)" if the exact value of "unit"']

When started with

datasette serve --metadata metadata.json --setting suggest_facets off

there is no problem.

Not sure if there is a way to check or prevent this error without it, happy to look into it if you have any pointers!
("unit" is a column of string values.)

not working in conjunction with datasette-graphql

metadata.json

{
  "plugins": {
    "datasette-parquet": {
      "trove": {
        "directory": "/app/data",
        "watch": true
      }
    }
  }
}

run.sh

datasette --metadata /app/metadata.json -h 0.0.0.0 --setting sql_time_limit_ms 30000 --setting suggest_facets off

Dockerfile

FROM datasetteproject/datasette
WORKDIR /app
RUN datasette install datasette-parquet
RUN datasette install datasette-graphql
ADD https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet ./data/
ADD /run.sh ./
ADD /metadata.json ./
RUN chmod +x ./run.sh
CMD "/app/run.sh"

After running it in browser it lists the view names appropriately but errored when I navigate to any view (/trove/yellow_tripdata_2009-01) with this error
Catalog Error: unrecognized configuration parameter "recursive_triggers" Did you mean: "user"

If I remove either of the graphql package it works again.. Not sure but something is clashing when both packages being used so I am not entirely sure if it's a problem in graphql package or parquet package.

see if we can remove named parameters hack

DuckDB's python bindings now support named parameters!

We might still need to strip out unused parameters (eg the csrftoken that leaks on the canned query page), but I think we can remove my crappy rewriting of named parameters to positional parameters.

is there a way to revive DateFacet and ArrayFacet?

The Facet class gets a reference to its database, so we can generate DuckDB specific queries

Another challenge might be expressing the filter. Dates in parquet aren't strings, so created LIKE '2023-01-01%' won't work.

I'm a little afraid that we'd need actual proper support from Datasette

I could imagine doing some crazy hacks to make it work, if we really wanted to: take over all the filter generation, consulting the type of the column to know whether to do a simple string LIKE or a date comparison

add flag to install and load httpfs

so, eg, this query will work:

select count(*), count(distinct vendorid)
from read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet')
limit 1

Currently if you try this query, you'll get:

IO Error: No files found that match the pattern "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"

This should be opt in, though - the ability to make outbound connections is probably a surprising one.

strip unexpected parameters

While implementing, I discovered:

Datasette passes extraneous parameters to the sqlite3 connection. A writable canned query will post a csrftoken for security purposes, which ends up as part of the query parameters. DuckDB is strict on the parameters matching the SQL query, so it fails.

I added a one off fix for csrftoken. But this happens elsewhere, too: go to https://dux.fly.dev/parquet/geonames_stats and click hide.

Proposal: let's strip parameters that are in the params dict, but not the SQL string

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.