Giter VIP home page Giter VIP logo

dply-rs's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dply-rs's Issues

The syntax for pipeline variables is problematic

I was just shortly confused why csv("test.csv") | show didn't show anything. As it turns out omitting the () in show() just defines a pipeline variable.

I'd consider this syntax to be rather confusing, e.g. in the example given in the documentation:

parquet("nyctaxi.parquet") |
    select(payment_type, contains("amount")) |
    fare_amounts |
    group_by(payment_type) |
    summarize(mean_amount = mean(total_amount)) |
    head()

fare_amounts |
    # etc ...

fare_amounts | also just looks like a function to me ... (and probably anybody else who's used to shell pipes).

But apart from being confusing I'd also argue that the syntax poses a problem to the readability of queries since you always have to look at the end of a line (to check if there's a ()) to understand what the line does. So I think some prefix notation would be much preferable over the current suffix notation.

For example PRQL has an into keyword so you can query e.g. from items take 5 into top_5 ... perhaps this would also make sense for dply-rs? So the above example could become:

parquet("nyctaxi.parquet") |
    select(payment_type, contains("amount")) |
    into fare_amounts |
    group_by(payment_type) |
    summarize(mean_amount = mean(total_amount)) |
    head()

fare_amounts |
    # etc ...

Error: Unknown function: json

I was trying to follow along with your readme and process some json data but I can't seem to get it to work. Am I doing something wrong?

> dply -c 'json("./buildtimes.json") | show()'
Error: Unknonw function: json

Also, when I go in the repl, I don't see a json( function like the parquet( function.

Enhance: Auto backtick "invalid" column names

The column completion really improves ergonomics!

Would it be possible when inserting a chosen a column name that would require backticks to be valid (for example contains a -), to automatically surround the column name with backticks?

Builtin documentation

The Python REPL has a nice feature which is the help function so e.g. help(print) shows the usage documentation of the print function. R also has a help function. So it would be nice if this CLI had the same.

Document function signatures

The documentation in functions.md is quite example-heavy ... which makes it harder to quickly look up how exactly a function can be used. It would be nice if the documentation of each function started with a short synopsis (as is e.g. commonplace in man pages or API documentation).

Implicit default sink for pipelines

By sink I mean show, glimpse, csv, json, parquet and pipeline variables (which could become into, see #42).

Every step of a pipeline that isn't followed by a sink is just dead code ... completely useless since the results will be discarded.

This is particularly unintuitive since dply currently doesn't print any warning or error about such dead code. It also makes the REPL a bit cumbersome since you have to append | show() to pretty much every single query and every time you want to append a filter to the pipeline, you press ArrowUp and then have to move the cursor before the | show() to insert the new filter.

I think it would be better if dply would detect that there's no sink and just append a default sink, which would be show() for the REPL but could perhaps be configured on the command-line so that you could execute the same query with different output formats without having to edit the query file e.g:

$ dply example.query --to json
$ dply example.query --to csv

question: can you cast data types

In #47, I learned how to open a json file, as long as it's jsonl/ndjson and named with a .json file extension. Now I have a int64 field that needs to be interpreted as a timestamp(nanoseconds). Is there a way to cast the field to some type of a duration or timestamp datatype?

Support shell pipelines

I'd like to be able to execute e.g:

$ dply -c "json() | filter(x > 3) | csv()" < test.json > test.csv
# or
$ curl https://example.com/data.csv | dply -c "csv() | filter(x > 3) | json()" | jq .

However this doesn't currently work since the csv, json and parquet functions all currently require a file path and don't support working with stdin/stdout.

This would also allow queries stored in files to be run against different files without having to modify the query, e.g:

$ dply example.query < foo.json
$ dply example.query < bar.json

compile for older glibc?

Interesting package, thanks. Could you please consider making this package compile and run on Ubuntu 20.04, eg with an older glibc?

ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31

This would make it much easier to deploy ad hoc in a container etc (I know I can use cargo but thats more top heavy).

Thanks,
Colin

./dply 
./dply: /usr/lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./dply)

Poor error message when attempting to filter with `=` instead of `==`

Nice tool ... I really like that it has autocompletion. I was just very confused by the following error message:

〉csv("data.csv") | filter(name = "test") | show()
::: 
Error: Invalid argument 'name = "test"' for function 'filter'

It would be nice if it instead complained about the operator and perhaps even listed the supported operators.
I found a better error message in the code:

_ => panic!("Unexpected filter operator {op}"),

(besides that being a panic). However the unhelpful error message appears to come from the signatures module ...

great work!

Just passing on some love here. Great work on this tool! I'm really appreciating your approach to it. Dataframes can be difficult to deal with and you've hidden most of the complexity for users. I'd love to build your stuff into nushell if I could find an easy way to do it.

How to specify select columns with names containing `-`

Thanks for the great tool!! I've been looking for something like this for a while, the pandas-backed options weren't working well for me!

I am wondering how best to specify column names that aren't valid python names. Currently if a column name includes for example a minus sign (-), it doesn't appear to be possible to select the column by simply quoting the name.

I was hoping this would work, but get an error.

$ dply -c 'parquet("myfile.parquet") | select( column1, column2, "column-3", column_4) | show()'
Error: Match failure '"column-3"' must be an identifier

In the meantime I've been able to use the following as a workaround:

$ dply -c 'parquet("myfile.parquet") | select( column1, column2, starts_with("column-3"), column_4) | show()'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.