vincev / dply-rs Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 3.0 489 KB

A dataframe manipulation tool for parquet, csv, and json data.

License: Apache License 2.0

Rust 100.00%

csv dplyr json parquet polars rust

dply-rs's People

Stargazers

Watchers

Forkers

not-my-profile stormasm 13minutes-yt

dply-rs's Issues

The syntax for pipeline variables is problematic

I was just shortly confused why csv("test.csv") | show didn't show anything. As it turns out omitting the () in show() just defines a pipeline variable.

I'd consider this syntax to be rather confusing, e.g. in the example given in the documentation:

parquet("nyctaxi.parquet") |
    select(payment_type, contains("amount")) |
    fare_amounts |
    group_by(payment_type) |
    summarize(mean_amount = mean(total_amount)) |
    head()

fare_amounts |
    # etc ...

fare_amounts | also just looks like a function to me ... (and probably anybody else who's used to shell pipes).

But apart from being confusing I'd also argue that the syntax poses a problem to the readability of queries since you always have to look at the end of a line (to check if there's a ()) to understand what the line does. So I think some prefix notation would be much preferable over the current suffix notation.

For example PRQL has an into keyword so you can query e.g. from items take 5 into top_5 ... perhaps this would also make sense for dply-rs? So the above example could become:

parquet("nyctaxi.parquet") |
    select(payment_type, contains("amount")) |
    into fare_amounts |
    group_by(payment_type) |
    summarize(mean_amount = mean(total_amount)) |
    head()

fare_amounts |
    # etc ...

Error: Unknown function: json

I was trying to follow along with your readme and process some json data but I can't seem to get it to work. Am I doing something wrong?

> dply -c 'json("./buildtimes.json") | show()'
Error: Unknonw function: json

Also, when I go in the repl, I don't see a json( function like the parquet( function.

Enhance: Auto backtick "invalid" column names

The column completion really improves ergonomics!

Would it be possible when inserting a chosen a column name that would require backticks to be valid (for example contains a -), to automatically surround the column name with backticks?

Builtin documentation

The Python REPL has a nice feature which is the help function so e.g. help(print) shows the usage documentation of the print function. R also has a help function. So it would be nice if this CLI had the same.

Document function signatures

The documentation in functions.md is quite example-heavy ... which makes it harder to quickly look up how exactly a function can be used. It would be nice if the documentation of each function started with a short synopsis (as is e.g. commonplace in man pages or API documentation).

Implicit default sink for pipelines

By sink I mean show, glimpse, csv, json, parquet and pipeline variables (which could become into, see #42).

Every step of a pipeline that isn't followed by a sink is just dead code ... completely useless since the results will be discarded.

This is particularly unintuitive since dply currently doesn't print any warning or error about such dead code. It also makes the REPL a bit cumbersome since you have to append | show() to pretty much every single query and every time you want to append a filter to the pipeline, you press ArrowUp and then have to move the cursor before the | show() to insert the new filter.

I think it would be better if dply would detect that there's no sink and just append a default sink, which would be show() for the REPL but could perhaps be configured on the command-line so that you could execute the same query with different output formats without having to edit the query file e.g:

$ dply example.query --to json
$ dply example.query --to csv

question: can you cast data types

In #47, I learned how to open a json file, as long as it's jsonl/ndjson and named with a .json file extension. Now I have a int64 field that needs to be interpreted as a timestamp(nanoseconds). Is there a way to cast the field to some type of a duration or timestamp datatype?

Why is the datafusion dependency a git dependency?

This was changed in 96dd5ba without explaining why. If there's a reason why dply currently has to depend on that particular commit of datafusion that should probably explained in a comment in Cargo.toml.

Support shell pipelines

I'd like to be able to execute e.g:

$ dply -c "json() | filter(x > 3) | csv()" < test.json > test.csv
# or
$ curl https://example.com/data.csv | dply -c "csv() | filter(x > 3) | json()" | jq .

However this doesn't currently work since the csv, json and parquet functions all currently require a file path and don't support working with stdin/stdout.

This would also allow queries stored in files to be run against different files without having to modify the query, e.g:

$ dply example.query < foo.json
$ dply example.query < bar.json

compile for older glibc?

Interesting package, thanks. Could you please consider making this package compile and run on Ubuntu 20.04, eg with an older glibc?

ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31

This would make it much easier to deploy ad hoc in a container etc (I know I can use cargo but thats more top heavy).

Thanks,
Colin

./dply 
./dply: /usr/lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./dply)

Poor error message when attempting to filter with `=` instead of `==`

Nice tool ... I really like that it has autocompletion. I was just very confused by the following error message:

〉csv("data.csv") | filter(name = "test") | show()
::: 
Error: Invalid argument 'name = "test"' for function 'filter'

It would be nice if it instead complained about the operator and perhaps even listed the supported operators.
I found a better error message in the code:

dply-rs/src/engine/filter.rs

Line 67 in 97d8c74

_ => panic!("Unexpected filter operator {op}"),

(besides that being a panic). However the unhelpful error message appears to come from the signatures module ...

great work!

Just passing on some love here. Great work on this tool! I'm really appreciating your approach to it. Dataframes can be difficult to deal with and you've hidden most of the complexity for users. I'd love to build your stuff into nushell if I could find an easy way to do it.

How to specify select columns with names containing `-`

Thanks for the great tool!! I've been looking for something like this for a while, the pandas-backed options weren't working well for me!

I am wondering how best to specify column names that aren't valid python names. Currently if a column name includes for example a minus sign (-), it doesn't appear to be possible to select the column by simply quoting the name.

I was hoping this would work, but get an error.

$ dply -c 'parquet("myfile.parquet") | select( column1, column2, "column-3", column_4) | show()'
Error: Match failure '"column-3"' must be an identifier

In the meantime I've been able to use the following as a workaround:

$ dply -c 'parquet("myfile.parquet") | select( column1, column2, starts_with("column-3"), column_4) | show()'

feature request: add a bracket or brackets automatically to completions

Hi! It seems to me that it would be convenient to automatically add to completions:

brackets if no arguments are expected (as in glimpse())
or a single bracket if arguments are expected (as in mutate().

Please consider adding this feature.

vincev / dply-rs Goto Github PK

dply-rs's People

Stargazers

Watchers

Forkers

dply-rs's Issues

Recommend Projects

Recommend Topics

Recommend Org