vincev / dply-rs Goto Github PK
View Code? Open in Web Editor NEWA dataframe manipulation tool for parquet, csv, and json data.
License: Apache License 2.0
A dataframe manipulation tool for parquet, csv, and json data.
License: Apache License 2.0
I was just shortly confused why csv("test.csv") | show
didn't show anything. As it turns out omitting the ()
in show()
just defines a pipeline variable.
I'd consider this syntax to be rather confusing, e.g. in the example given in the documentation:
parquet("nyctaxi.parquet") |
select(payment_type, contains("amount")) |
fare_amounts |
group_by(payment_type) |
summarize(mean_amount = mean(total_amount)) |
head()
fare_amounts |
# etc ...
fare_amounts |
also just looks like a function to me ... (and probably anybody else who's used to shell pipes).
But apart from being confusing I'd also argue that the syntax poses a problem to the readability of queries since you always have to look at the end of a line (to check if there's a ()
) to understand what the line does. So I think some prefix notation would be much preferable over the current suffix notation.
For example PRQL has an into
keyword so you can query e.g. from items take 5 into top_5
... perhaps this would also make sense for dply-rs? So the above example could become:
parquet("nyctaxi.parquet") |
select(payment_type, contains("amount")) |
into fare_amounts |
group_by(payment_type) |
summarize(mean_amount = mean(total_amount)) |
head()
fare_amounts |
# etc ...
I was trying to follow along with your readme and process some json data but I can't seem to get it to work. Am I doing something wrong?
> dply -c 'json("./buildtimes.json") | show()'
Error: Unknonw function: json
Also, when I go in the repl, I don't see a json(
function like the parquet(
function.
The column completion really improves ergonomics!
Would it be possible when inserting a chosen a column name that would require backticks to be valid (for example contains a -
), to automatically surround the column name with backticks?
The Python REPL has a nice feature which is the help
function so e.g. help(print)
shows the usage documentation of the print
function. R also has a help function. So it would be nice if this CLI had the same.
The documentation in functions.md is quite example-heavy ... which makes it harder to quickly look up how exactly a function can be used. It would be nice if the documentation of each function started with a short synopsis (as is e.g. commonplace in man pages or API documentation).
By sink I mean show
, glimpse
, csv
, json
, parquet
and pipeline variables (which could become into
, see #42).
Every step of a pipeline that isn't followed by a sink is just dead code ... completely useless since the results will be discarded.
This is particularly unintuitive since dply currently doesn't print any warning or error about such dead code. It also makes the REPL a bit cumbersome since you have to append | show()
to pretty much every single query and every time you want to append a filter to the pipeline, you press ArrowUp and then have to move the cursor before the | show()
to insert the new filter.
I think it would be better if dply would detect that there's no sink and just append a default sink, which would be show()
for the REPL but could perhaps be configured on the command-line so that you could execute the same query with different output formats without having to edit the query file e.g:
$ dply example.query --to json
$ dply example.query --to csv
In #47, I learned how to open a json file, as long as it's jsonl/ndjson and named with a .json file extension. Now I have a int64 field that needs to be interpreted as a timestamp(nanoseconds). Is there a way to cast the field to some type of a duration or timestamp datatype?
This was changed in 96dd5ba without explaining why. If there's a reason why dply currently has to depend on that particular commit of datafusion that should probably explained in a comment in Cargo.toml
.
I'd like to be able to execute e.g:
$ dply -c "json() | filter(x > 3) | csv()" < test.json > test.csv
# or
$ curl https://example.com/data.csv | dply -c "csv() | filter(x > 3) | json()" | jq .
However this doesn't currently work since the csv
, json
and parquet
functions all currently require a file path and don't support working with stdin/stdout.
This would also allow queries stored in files to be run against different files without having to modify the query, e.g:
$ dply example.query < foo.json
$ dply example.query < bar.json
Interesting package, thanks. Could you please consider making this package compile and run on Ubuntu 20.04, eg with an older glibc?
ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
This would make it much easier to deploy ad hoc in a container etc (I know I can use cargo but thats more top heavy).
Thanks,
Colin
./dply
./dply: /usr/lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./dply)
./dply: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./dply)
Nice tool ... I really like that it has autocompletion. I was just very confused by the following error message:
〉csv("data.csv") | filter(name = "test") | show()
:::
Error: Invalid argument 'name = "test"' for function 'filter'
It would be nice if it instead complained about the operator and perhaps even listed the supported operators.
I found a better error message in the code:
Line 67 in 97d8c74
(besides that being a panic). However the unhelpful error message appears to come from the signatures
module ...
Just passing on some love here. Great work on this tool! I'm really appreciating your approach to it. Dataframes can be difficult to deal with and you've hidden most of the complexity for users. I'd love to build your stuff into nushell if I could find an easy way to do it.
Thanks for the great tool!! I've been looking for something like this for a while, the pandas-backed options weren't working well for me!
I am wondering how best to specify column names that aren't valid python names. Currently if a column name includes for example a minus sign (-
), it doesn't appear to be possible to select the column by simply quoting the name.
I was hoping this would work, but get an error.
$ dply -c 'parquet("myfile.parquet") | select( column1, column2, "column-3", column_4) | show()'
Error: Match failure '"column-3"' must be an identifier
In the meantime I've been able to use the following as a workaround:
$ dply -c 'parquet("myfile.parquet") | select( column1, column2, starts_with("column-3"), column_4) | show()'
Hi! It seems to me that it would be convenient to automatically add to completions:
glimpse()
)mutate(
).Please consider adding this feature.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.