juliadata / dataframesmeta.jl Goto Github PK
View Code? Open in Web Editor NEWMetaprogramming tools for DataFrames
Home Page: https://juliadata.github.io/DataFramesMeta.jl/stable/
License: Other
Metaprogramming tools for DataFrames
Home Page: https://juliadata.github.io/DataFramesMeta.jl/stable/
License: Other
Creating a column with only one value is only an error if it's the first column:
julia> df = DataFrame(A=1:3)
3x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> @select(df, B=1, :A)
ERROR: New columns must have the same length as old columns
in insert_single_column!(::DataFrames.DataFrame, ::DataArrays.DataArray{Int64,1}, ::Symbol) at /home/milan/.julia/DataFrames/src/dataframe/dataframe.jl:309
in setindex!(::DataFrames.DataFrame, ::DataArrays.DataArray{Int64,1}, ::Symbol) at /home/milan/.julia/DataFrames/src/dataframe/dataframe.jl:368
in #select#17(::Array{Any,1}, ::Any, ::DataFrames.DataFrame) at /home/milan/.julia/DataFramesMeta/src/DataFramesMeta.jl:450
[inlined code] from ./boot.jl:307
in (::###8232#20{DataFrames.DataFrame})(::DataArrays.DataArray{Int64,1}) at /home/milan/.julia/DataFramesMeta/src/DataFramesMeta.jl:55
in eval(::Module, ::Any) at ./boot.jl:243
julia> @select(df, :A, B=1)
3x2 DataFrames.DataFrame
│ Row │ A │ B │
┝━━━━━┿━━━┿━━━┥
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 1 │
│ 3 │ 3 │ 1 │
Would it make sense to always recycle the columns so that the new data frame has the same number of columns as the original? Other constructs could be recommended for aggregation operations.
As noted in #48, based_on
might not be the best name for this operation. It's hard to distinguish from groupby
(and doesn't follow the same naming convention with regard to the underscore).
summarize
/summarise
(like dplyr) might be more explicit. But it's a bit surprising that this operation also allows returning as many rows as in the original data, which is completely different from summarizing.
I wonder whether it couldn't be merged with select
, as in SQL. AFAICT select
doesn't work on GroupedDataFrame
currently, so there would be no conflict.
julia> Pkg.add("DataFramesMeta")
fatal: bad default revision 'HEAD'
ERROR: failed process: Process(`git --git-dir=/home/diego/.julia/v0.4/.cache/DataFramesMeta log --all --format=%H`, ProcessExited(128)) [128]
in pipeline_error at process.jl:555
in readbytes at process.jl:515
in prefetch at pkg/cache.jl:44
in resolve at ./pkg/entry.jl:434
in edit at pkg/entry.jl:26
in anonymous at task.jl:447
in sync_end at ./task.jl:413
[inlined code] from task.jl:422
in add at pkg/entry.jl:46
in add at pkg/entry.jl:73
in anonymous at pkg/dir.jl:31
in cd at file.jl:22
in cd at pkg/dir.jl:31
in add at pkg.jl:23
Just wondering if it's better if we "rotate" the mean of the expressions, like so:
expression | current | alternative |
---|---|---|
x | variable | column |
:x | column | symbol |
^(:x) | symbol | variable |
I notice that most expressions involve just the column. External variable reference is much less often. This reduces some of the visual noise due to the colons.
Of course it may be water under the bridge, but I'm just curious what others think of this.
Cross-posting from https://discourse.julialang.org/t/renaming-columns-in-dataframesmeta-with-spaces/11101
I’ve created the following dataset:
d = DataFrame(x=randn(100))
d[Symbol("Y 2")] = randn(100)
I can rename the second column via
rename!(d, Symbol("Y 2") => :Y)
But I would like to rename it via a DataFramesMeta chain.
I can do
@linq d |> select(X=:x)
But
@linq d |> select(Y=Symbol("Y 2"))
drops the column Y 2 and adds the column Y with only one row with the value “Y 2”
Additionally,
@linq d |> select(X=Symbol("x"))
Does something similar. Anyway to get around this, or do I need to use rename!?
hi there,
I tried to replicate the example from ?@transform
and i see
julia> df = DataFrame(A = 1:3, B = [2, 1, 2])
3×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 2 │
│ 2 │ 2 │ 1 │
│ 3 │ 3 │ 2 │
julia> @transform(df, a = 2 * :A, x = :A + :B)
ERROR: MethodError: no method matching transform(::DataFrames.DataFrame, ::##12#14, ::##13#16)
Closest candidates are:
transform(::Union{Associative, DataFrames.AbstractDataFrame}; kwargs...) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:268
I just did Pkg.update
before this session. julia v0.6. thanks
In the same vein, i get
julia> df = DataFrame(A = 1:3, B = [2, 1, 2])
3×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 2 │
│ 2 │ 2 │ 1 │
│ 3 │ 3 │ 2 │
julia> @select(df, :B, :A)
3×2 DataFrames.DataFrame
│ Row │ B │ A │
├─────┼───┼───┤
│ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │
│ 3 │ 2 │ 3 │
julia> @select(df, :B, x = :A .+ :B)
ERROR: MethodError: no method matching select(::DataFrames.DataFrame, ::DataArrays.DataArray{Int64,1}; B=[2, 1, 2])
Closest candidates are:
select(::DataFrames.AbstractDataFrame, ::Any) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:214 got unsupported keyword argument "B"
select(::Union{Associative, DataFrames.AbstractDataFrame}; kwargs...) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:431
Stacktrace:
[1] (::###3798#8{DataFrames.DataFrame})(::DataArrays.DataArray{Int64,1}, ::DataArrays.DataArray{Int64,1}) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:55
even though that is exaclty what comes in the docstring.
I'm on Julia v0.7 and DataFramesMeta v3.0.0. It appears that @transform
and @where
are broken. I believe it could be related to some changes to broadcast
, but most of the work in this package is done in macros which are very unfamiliar to me, so I'm struggling to debug it.
@transform
is behaving badly. If I just extract the columns and do something like df[:c] = df[:a] .+ df[:b]
, I have no issues. Sometimes I get nonsense for the result, sometimes not.julia> dd = DataFrame(a=[1,1,1,1,1,2,2,2,2,2],b=rand(10),c=rand(10))
10×3 DataFrame
│ Row │ a │ b │ c │
├─────┼───┼───────────┼────────────┤
│ 1 │ 1 │ 0.940118 │ 0.00203923 │
│ 2 │ 1 │ 0.443475 │ 0.600234 │
│ 3 │ 1 │ 0.958618 │ 0.984405 │
│ 4 │ 1 │ 0.657766 │ 0.826958 │
│ 5 │ 1 │ 0.0482096 │ 0.249115 │
│ 6 │ 2 │ 0.903136 │ 0.27147 │
│ 7 │ 2 │ 0.319808 │ 0.697216 │
│ 8 │ 2 │ 0.0525784 │ 0.890392 │
│ 9 │ 2 │ 0.223741 │ 0.978436 │
│ 10 │ 2 │ 0.297486 │ 0.176859 │
julia> dd[:b] ./ dd[:c]
10-element Array{Float64,1}:
461.0164605051064
0.7388362802425429
0.9738049508515227
0.7954040678869911
0.19352378037309087
3.3268399796579144
0.4586931077205082
0.059050832041258175
0.22867234620825236
1.6820560963139088
julia> @transform(dd, d = (:b ./ :c))
10×4 DataFrame
│ Row │ a │ b │ c │ d │
├─────┼───┼───────────┼────────────┼───────────┤
│ 1 │ 1 │ 0.940118 │ 0.00203923 │ 461.016 │
│ 2 │ 1 │ 0.443475 │ 0.600234 │ 0.738836 │
│ 3 │ 1 │ 0.958618 │ 0.984405 │ 0.973805 │
│ 4 │ 1 │ 0.657766 │ 0.826958 │ 0.795404 │
│ 5 │ 1 │ 0.0482096 │ 0.249115 │ 0.193524 │
│ 6 │ 2 │ 0.903136 │ 0.27147 │ 3.32684 │
│ 7 │ 2 │ 0.319808 │ 0.697216 │ 0.458693 │
│ 8 │ 2 │ 0.0525784 │ 0.890392 │ 0.0590508 │
│ 9 │ 2 │ 0.223741 │ 0.978436 │ 0.228672 │
│ 10 │ 2 │ 0.297486 │ 0.176859 │ 1.68206 │
julia> @transform(dd, d = (:b ./ :c))
10×4 DataFrame
│ Row │ a │ b │ c │ d │
├─────┼───┼───────────┼────────────┼────────────┤
│ 1 │ 1 │ 0.940118 │ 0.00203923 │ 0.00216912 │
│ 2 │ 1 │ 0.443475 │ 0.600234 │ 1.35348 │
│ 3 │ 1 │ 0.958618 │ 0.984405 │ 1.0269 │
│ 4 │ 1 │ 0.657766 │ 0.826958 │ 1.25722 │
│ 5 │ 1 │ 0.0482096 │ 0.249115 │ 5.16732 │
│ 6 │ 2 │ 0.903136 │ 0.27147 │ 0.300586 │
│ 7 │ 2 │ 0.319808 │ 0.697216 │ 2.18011 │
│ 8 │ 2 │ 0.0525784 │ 0.890392 │ 16.9346 │
│ 9 │ 2 │ 0.223741 │ 0.978436 │ 4.37307 │
│ 10 │ 2 │ 0.297486 │ 0.176859 │ 0.59451 │
@where
sometimes I get an error that really makes no sense given what I'm asking it to do. It seems like it is trying to mix the columns together or something, based on the error I'm seeing. If I do one subset at a time, it works.julia> dd = DataFrame(a = [Date(1998),Date(1999),Date(2000)],b=rand(3))
3×2 DataFrame
│ Row │ a │ b │
├─────┼────────────┼──────────┤
│ 1 │ 1998-01-01 │ 0.100379 │
│ 2 │ 1999-01-01 │ 0.516064 │
│ 3 │ 2000-01-01 │ 0.541581 │
julia> @where(dd, :b .>= 0.2, :a .>= Date(1998))
ERROR: MethodError: no method matching isless(::Float64, ::Date)
Closest candidates are:
isless(::Float64, ::Float64) at float.jl:457
isless(::Missing, ::Any) at missing.jl:62
isless(::AbstractFloat, ::AbstractFloat) at operators.jl:124
...
Stacktrace:
[1] <(::Float64, ::Date) at .\operators.jl:227
[2] <= at .\operators.jl:273 [inlined]
[3] >= at .\operators.jl:297 [inlined]
[4] (::getfield(, Symbol("##331#334")))(::Date, ::Float64, ::Date) at .\<missing>:0
[5] broadcast_nonleaf(::Function, ::Base.Broadcast.VectorStyle, ::Type{Union{}}, ::Tuple{Base.OneTo{Int64}}, ::Array{Date,1}, ::Vararg{Any,N} where N) at .\broadcast.jl:649
[6] broadcast(::Function, ::Base.Broadcast.VectorStyle, ::Type{Union{}}, ::Tuple{Base.OneTo{Int64}}, ::Array{Date,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at .\broadcast.jl:626
[7] broadcast at .\broadcast.jl:618 [inlined]
[8] broadcast at .\broadcast.jl:615 [inlined]
[9] (::getfield(, Symbol("###1843#333")))(::Array{Float64,1}, ::Array{Date,1}) at C:\Users\tbeason\.julia\v0.7\DataFramesMeta\src\DataFramesMeta.jl:70
[10] (::getfield(, Symbol("##330#332")))(::DataFrame) at C:\Users\tbeason\.julia\v0.7\DataFramesMeta\src\DataFramesMeta.jl:72
[11] where(::DataFrame, ::getfield(, Symbol("##330#332"))) at C:\Users\tbeason\.julia\v0.7\DataFramesMeta\src\DataFramesMeta.jl:194
[12] top-level scope
I attempted to use transform to convert a text-column to a float-column. However, this fails with
MethodError: Cannot `convert` an object of type Float64 to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.
in macro expansion at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:60 [inlined]
in macro expansion at ./cartesian.jl:64 [inlined]
in (::DataArrays.#_F_#204)(::DataArrays.DataArray{String,1}, ::DataArrays.DataArray{String,1}) at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:130
in broadcast!(::Function, ::DataArrays.DataArray{String,1}, ::DataArrays.DataArray{String,1}) at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:229
in databroadcast(::Function, ::DataArrays.DataArray{String,1}, ::Vararg{DataArrays.DataArray{String,1},N}) at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:235
in (::###346#72{DataFrames.DataFrame})(::DataArrays.DataArray{String,1}) at /home/sturm/.julia/v0.5/DataFramesMeta/src/DataFramesMeta.jl:55
Here is a minimal working example:
function return_float(x)
42.0
end
function return_string(x)
"42"
end
df = DataFrame(a=["1", "2"])
# works
@transform(df, new_col = return_string.(:a))
# fails
@transform(df, new_col = return_float.(:a))
Since @where
operates by row, I would find it both natural and practical to allow using non-vectorized operators, like ==
instead of .==
. That would be particularly useful for in
, which is currently not vectorized and might never be (JuliaLang/julia#5212). For example, it would be great to be able to write:
using DataFrames, DataFramesMeta, RDatasets
iris = dataset("datasets", "iris")
@where iris :Species in ("setosa", "virginica")
instead of as currently:
@where iris (:Species .== "setosa") | (:Species .== "virginica")
Do you think this would be technically possible? (If so, this idea could also be applied to other macros.)
@byrow
doesn't seem to be included in the version installed using Pkg.add
On current master
myd = DataFrame(x=zeros(10), y=ones(10))
@byrow myd begin
println(:x)
end
gives me the error "ERROR: myd not defined"
On commit e84ac6c it works as expected.
This issue parallels JuliaData/DataFrames.jl#369.
@byrow!
may be close to what @nalimilan is after in the issue above. Here is a recoding example:
n = nrow(df)
df[:x] = Array(Int, n)
df[:y] = NullableArray(Float64, n)
@byrow! df begin
:x = :a > 1 ? 4 : 9
:y = :a * :b
end
The main irritation with the code above is allocating new columns. That could be handled by extending @transform
or making a new macro that does something similar:
@transform(df, x = Array{Int}, y = NullableArray{Float64})
Another option is to add a feature to @byrow!
that does this:
@byrow! df let x = Array{Int},
y = NullableArray{Float64}
:x = :a > 1 ? 4 : 9
:y = :a * :b
end
Or:
@byrow! df begin
@newcol x = Array{Int}
@newcol y = NullableArray{Float64}
:x = :a > 1 ? 4 : 9
:y = :a * :b
end
Any opinions on these or other approaches? Any other features that byrow!
needs to better recode data?
e.g. dataframe a = DataFrames.DataFrame
│ Row │ variable │ value │ Sites │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ MFB_PM25 │ 1.7166 │ "x1142A" │
│ 2 │ MFB_PM25 │ -13.8486 │ "x1143A" │
│ 3 │ MFB_PM25 │ 11.2938 │ "x1144A" │
│ 4 │ MFB_PM25 │ -8.32452 │ "x1145A" │
I want to select the rows of x1143A and x1144A in Sites
how to code it .
@where(a, :Sites.== ???
EDIT: Nevermind, this turns out not really to be an issue. See comment below. I'm leaving the following original post unedited for posterity.
This issue arises while trying to rebind an external variable inside of an @with
call:
julia> using DataFrames, DataFramesMeta
julia> df = DataFrame(A = 1:3, B = [2, 1, 2])
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 1 |
| 3 | 3 | 2 |
julia> y = 1
1
julia> @with df for i in 1:3; :A[i] += y end
julia> df
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 2 | 2 |
| 2 | 3 | 1 |
| 3 | 4 | 2 |
julia> @with df for i in 1:3; y += :A[i] end
ERROR: y not defined
in ##24126 at /Users/David/.julia/v0.3/DataFramesMeta/src/DataFramesMeta.jl:1
julia> @with df y += :A[1]
ERROR: y not defined
in ##24128 at /Users/David/.julia/v0.3/DataFramesMeta/src/DataFramesMeta.jl:47
I can't remember the particular feature of Julia's scoping logic that provides this behavior. It seems that inside of a local scope, an external variable can be referenced so long as it is not rebound.
Of course, defining a function through which to pass the external value works fine:
julia> function f(x) @with df x += :A[1] end
f (generic function with 2 methods)
julia> f(y)
6
However, in the interest of making certain kinds of functionality more generally accessible in the REPL (think demographics of users who aren't used to wrapping things in functions), I wonder if we should provide some sort of tool (macro) for doing this more efficiently. Something like
@relay y @with df y += :A[1]
where @relay
would generate a function, replace all instances of y
in the body @with df y += :A[1]
with a variable, and then call that function on y
. Incorporating functionality for multiple external variables might be tricky, though at this point of playing around I believe an efficient solution is possible. I'm happy to work on such a feature.
I don't know if this proposal approaches the realm of too much magic, but the core idea seems pretty straightforward. I would also provide thorough documentation on the internals, so lay users would be able to troubleshoot more effectively and understand their code better in general.
Or, we could just tell folks to wrap their code in a function.
when I try to use based_on within a linq call, I get an undef_var error. here is a MWE:
using DataFrames, DataFramesMeta df = DataFrame(id = [1, 1, 2, 2, 3, 3], vals = randn(6)) @linq df |> groupby(:id) |> based_on(totval = sum(vals))
for me, the result is a red: "ERROR: UndefVarError: based_on not defined"
here is the output of Pkg.status(DataFramesMeta)
`julia> Pkg.status("DataFramesMeta")
any idea why this doesn't work?
Hello,
some of the examples give an error on julia 0.4.5
The following code
using DataArrays, DataFrames
using DataFramesMeta
df = DataFrame(x = 1:3, y = [2, 1, 2])
transform(df, newCol = cos(df[:x]), anotherCol = df[:x]^2 + 3*df[:x] + 4)
@transform(df, newCol = cos(:x), anotherCol = :x^2 + 3*:x + 4)
produces
ERROR: MethodError:
*
has no method matching *(::Array{Int64,1}, ::Array{Int64,1})
replacing ^ by .^ fixes the problem, ie:
transform(df, newCol = cos(df[:x]), anotherCol = df[:x].^2 + 3*df[:x] + 4)
@transform(df, newCol = cos(:x), anotherCol = :x.^2 + 3*:x + 4)
Best regards,
Not an issue, but as I'm making a version of this package for IndexedTables here I though it was important to notify this repository.
I prefer to keep the implementations separate as IndexedTables are very different from DataFrames under the hood (complete type information about columns, fast row iteration), which makes the preferred implementations very different for the two cases.
I'm acknowledging this package in the README as most of the ideas come from here (I think historically this package was the first use of metaprogramming in Julia to simplify table operations, which I believe is a brilliant idea).
On version 0.1.3, I get
using DataFramesMeta
WARNING: Method definition nrow(DataFramesMeta.AbstractCompositeDataFrame) in module
DataFramesMeta at C:\Users\rcnlee\julia\v0.5\DataFramesMeta\src\compositedataframe.jl:108
overwritten at C:\Users\rcnlee\.julia\v0.5\DataFramesMeta\src\compositedataframe.jl:109
There seems to be two definitions for nrow(cdf::AbstractCompositeDataFrame)
.
Here's an operation I perform often:
true
Maybe something like the following?
filtergroups(data, [:id], df -> size(df, 1) == expected_rows)
filtergroups(data, row -> row[:value] == 42)
Hi. Sorry to be a bother. I'm getting an error saying "UndefVarError @newCol not defined", when running the block in the readme that includes @byrow! and @newCol as displayed below.
df = DataFrame(A = 1:3, B = [2, 1, 2])
df2 = @byrow! df begin
@newcol colX::Array{Float64}
@newcol colY::DataArray{Int}
:colX = :B == 2 ? pi * :A : :B
if :A > 1
:colY = :A * :B
end
end
I'm trying to apply this to my own dataframe to bring separate time and date columns into a single column that contains DateTime elements. I was thinking this would work:
dateForm = Dates.DateFormat("d-u-y H")
newMachineData = @byrow! MachineData begin
@newCol DT::Array{DateTime}
:DT = DateTime(string(:DATE, " ", :HOUR), dateForm)
end
However, the error with @newCol is raised. It's likely I'm missing something simple though. I am new to Julia
In DataFrames
you can write the following:
names = [:newName1, :newName2, :newName3]
for name in names
df[name] = 5
end
I believe that DataFramesMeta's version of this does not make it possible to do such a thing with the @transform
macro. With DataFrames
, you can store names as a list of symbols, but there is no equivalent object you can use to store names when you with to use @transform(x = :oldVar)
command.
Is this a missing functionality?
There are multiple failures:
Error During Test
Test threw an exception of type MethodError
Expression: size(df[1:2,:]) == (2,2)
MethodError: broadcast(::DataFramesMeta.##15#16{TestCompositeDataFrames.CompositeDF##349}) is ambiguous. Candidates:
broadcast(f, x::Number...) at broadcast.jl:16
broadcast(f::Function, As::DataArrays.PooledDataArray...) at /Users/ranjan/.julia/v0.5/DataArrays/src/broadcast.jl:318
in nrow at /Users/ranjan/.julia/v0.5/DataFramesMeta/src/compositedataframe.jl:109 [inlined]
in size(::TestCompositeDataFrames.CompositeDF##349) at /Users/ranjan/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:214
in include_from_node1(::String) at ./loading.jl:426
in macro expansion; at /Users/ranjan/.julia/v0.5/DataFramesMeta/test/runtests.jl:15 [inlined]
in anonymous at ./<missing>:?
in include_from_node1(::String) at ./loading.jl:426
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
FAILED: compositedataframes.jl
and
Test Failed
Expression: xlinq2[[:meanX,:meanY]] == xlinq[[:meanX,:meanY]]
Evaluated: 3×2 DataFrames.DataFrame
│ Row │ meanX │ meanY │
├─────┼──────────┼─────────┤
│ 1 │ 0.537542 │ 5.37542 │
│ 2 │ 0.521357 │ 5.21357 │
│ 3 │ 0.494239 │ 4.94239 │ == 3×2 DataFrames.DataFrame
│ Row │ meanX │ meanY │
├─────┼──────────┼─────────┤
│ 1 │ 0.521357 │ 5.21357 │
│ 2 │ 0.494239 │ 4.94239 │
│ 3 │ 0.537542 │ 5.37542 │
FAILED: linqmacro.jl
ERROR: LoadError: "Tests failed"
in include_from_node1(::String) at ./loading.jl:426
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
while loading /Users/ranjan/.julia/v0.5/DataFramesMeta/test/runtests.jl, in expression starting on line 27
Tried adding a minus sign before the column symbol, but got the error msg
MethodError: no method matching -(::String)
...
Is there any workaround? Thanks.
I wonder what's the fundamental difference between transform
and select
. The most visible difference seems to be that the former always returns as many rows as the input, contrary to select
:
julia> df = DataFrame(A=1:3)
3x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> @select(df, A=1)
1x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1 │ 1 │
julia> @transform(df, A=1)
3x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1 │ 1 │
│ 2 │ 1 │
│ 3 │ 1 │
Are there any other differences? The documentation isn't very explicit about them.
Do we really need two different macros here? I find it a bit confusing, and the names don't really reflect the differences between them. AFAICT, dplyr only offers select
and mutate
(equivalent of transform
) because the former only allows selecting columns, and not transforming them. It sounds like a powerful SQL-like @select
(what we have now) is a better solution.
I'm trying to reproduce the examples on READEME.jl, however, I found that if the using Lazy
statement is executed before using DataArrays, DataFrames, DataFramesMeta
, an error will be raised when calling @where(df, :x .> 1)
: ERROR: wrong number of arguments
I am using Julia v0.4, on Ubuntu 14.04 x64 desktop version.
Following https://stackoverflow.com/questions/50547688/julia-using-groupby-with-dataframesmeta-and-lazy.
Would it be possible to add support for passing qualified function names to @linq
macro?
Hi,
why?
@where((df, :x .> x) & (df, :y .== 3))
ERROR: wrong number of arguments
NB:
df = DataFrame(x = 1:3, y = [2, 1, 2])
x = [2, 1, 0]
p.s.
versioninfo()
Julia Version 0.4.1
Commit cbe1bee (2015-11-08 10:33 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.3
This is a small gripe with DataFramesMeta
When using the linq
macros, you write df = @> begin df
When using other DataFramesMeta
macros, you write the begin first with df = @byrow! df begin
Any interest in changing the linq
macro to be @> df begin
?
When I try to do something very similar to the Lazy.jl example in the docs, I get a error informing me that both Lazy.jl and DataFramesMeta.jl have macros called "with", and as a result, uses of with must be "qualified". Here is the code block I am trying to run. For reference, "ndic" is a dataframe, and dfsplit() is a helper function that more or less calls split() in a particular way.
using DataFrames
using Lazy
using DataFramesMeta
...
pools = @> begin
ndic
@where(!isna(:ProducedPools))
@by(:FileNo, df -> dfsplit(df, :ProducedPools, ','))
@select(:FileNo, poolorder = :splitorder, pool = :splitval)
@where(:pool .!= "Confidential")
end
Here is the error I get:
julia> include("CODE.jl")
WARNING: both DataFramesMeta and Lazy export "@with"; uses of it in module Main must be qualified
ERROR: LoadError: UndefVarError: @with not defined
in include at /usr/local/Cellar/julia/0.4.3/lib/julia/sys.dylib
in include_from_node1 at /usr/local/Cellar/julia/0.4.3/lib/julia/sys.dylib
while loading /PATH_TO_CODE/CODE.jl, in expression starting on line 36
We used to have this "hidden" feature where you could use _DF
to access the whole DataFrame (or SubDataFrame in a group operation). It is occasionally nice. I think it got coded out in #73.
I've been playing around with this package, and it's already made some of my real-world work a lot easier. I came across this error just now--based_on
does not seem to work when chained together with other operations using @linq
.
With R and dplyr, I can do the following:
> iris %>% group_by(Species) %>% summarise(x=mean(Sepal.Width))
but when I try the (I think?) equivalent in Julia, I get an error:
julia> @linq iris |> groupby(:Species) |> based_on(x = mean(SepalWidth))
ERROR: UndefVarError: based_on not defined
Is this something that should work? (Or am I doing something incorrectly?)
A method error occurs when attempting to use more than one filter statement in @where
with the most recent release.
Version info:
Code to reproduce error:
Pkg.add("DataFramesMeta", v"0.1.3")
using DataFrames, DataFramesMeta
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])
@where(df, :name .== "John", :age .== 23)
Error log:
MethodError: &(::DataArrays.DataArray{Bool,1}, ::DataArrays.DataArray{Bool,1}) is ambiguous. Candidates:
&(a::DataArrays.DataArray{Bool,N} where N, b::Union{AbstractArray{Bool,N} where N, Bool}) in DataArrays at /home/juser/.julia/v0.6/DataArrays/src/operators.jl:390
&(b::Union{AbstractArray{Bool,N} where N, Bool}, a::DataArrays.DataArray{Bool,N} where N) in DataArrays at /home/juser/.julia/v0.6/DataArrays/src/operators.jl:390
Possible fix, define
&(::DataArrays.DataArray{Bool,N} where N, ::DataArrays.DataArray{Bool,N} where N)
Suppose I want to filter my dataframe by multiple constrains (which is not rare):
@where(df, :x1 .== 5, :x2 .== 3)
but now it gives back: ERROR: wrong number of arguments. Can it be enhanced to take multiple constrains as how dplyr in R does? :D
Hi there,
I'm using Julia 0.6 and DataFramesMeta 0.3.0.
The following code produces a deprecation warning.
I'm not sure what I need to change to avoid the warning. Any ideas?
using DataFrames
using DataFramesMeta
tbl = DataFrame(a = [22,11,33])
@linq tbl |> orderby(:a)
Many of the LINQ-like operations have a functional form where a function is passed as an argument, like:
where(g, d -> mean(d[:a]) > 0)
This form is quite common with LINQ libraries. The issue is argument ordering. Base Julia functions like map
all have the function passed as the first argument. The LINQ standard uses the function as the second argument, and passing the "operations" as the second argument is more consistent with the macro form: @where(g, mean(:a) > 0)
.
Our choices are: (1) follow Julia base, (2) follow the LINQ standard, or (3) define both.
I was trying to replicate Stata's rowmean
function using @byrow!
.
In stata this is
local vars x1 x2 x3
gen t = rowmean(`vars')
Using byrow you would do
vars = [:x1, :x2, :x3]
df[:works] = 0
df[:not_works] = 0
@byrow! df begin
:works = mean([:x1, :x2, :x3])
:not_works = mean(_I_(vars))
end
The reason this doesn't work is that _I_(vars)
reflexively returns df[vars]
. If vars
is more than 1 columns, it returns a dataframe, and mean
is not defined for a dataframe.
the obvious solution would be to have _I_
return a DataFrameRow
in this context. However mean
still isn't defined for a DataFrameRow
. Where do we stand on overloading array function to work on DataFrameRows?
Unless all of this is tangent to an obvious solution to my original problem. maybe there is a better way to do rowmean
than I know about.
@within
(or @transform
) could be developed. Here's a hypothetical
example:
@within df begin
x = 4
:x = :colA + :colB + x # a new column
:z = 2 * :x # can we create a new column based on another newly created column?
rm(:y) # should we / could we allow deletion of columns?
end
Another option might be to extend DataFrame()
to take a DataFrame
and keyword arguments to create a new DataFrame as:
df1 = @with df DataFrame(df,
x = :colA + :colB, # a new column
z = 2 * :x) # can we create a new column based on another newly created column?
This goes well with using DataFrame
with @with
to create a new
DataFrame based on another DataFrame. This already works:
df1 = @with df DataFrame(
x = :colA + :colB,
z = 2 * :colA)
Here are several issues that discuss metaprogramming and/or different
approaches to query and manipulate DataFrames:
Please add additional comments here on better approaches to querying DataFrames.
Hi there,
I am creating some new macros for use with the @linq
macro. I can create the macros, but I can't get @linq
to recognize them. Simple example below. Any ideas?
using DataFrames
using DataFramesMeta
# simple renaming of where macro
macro selectrows(d, args...)
esc(DataFramesMeta.where_helper(d, args...))
end
function linq(::DataFramesMeta.SymbolParameter{:selectrows}, d, args...)
DataFramesMeta.where_helper(d, args...)
end
df = DataFrame(x = 1:3, y = [2, 1, 2]);
@linq df |> @selectrows(:x .> 1) # OK
@linq df |> selectrows(:x .> 1) # Error
The error is:
ERROR: MethodError: no method matching isless(::Int64, ::Symbol)
Closest candidates are:
isless(::Symbol, ::Symbol) at strings/basic.jl:136
isless(::DataArrays.NAtype, ::Any) at /home/jock/.julia/v0.6/DataArrays/src/operators.jl:383
isless(::Real, ::AbstractFloat) at operators.jl:97
...
Stacktrace:
[1] (::##97#98)(::Symbol) at ./<missing>:0
[2] broadcast(::Function, ::Symbol) at ./broadcast.jl:434
@where(datos, :x_13 .== "#0_PHY")
throws the following error because the column has missing values:
LoadError: DataArrays.NAException("cannot index an array with a DataArray containing NA values")
while loading In[8], in expression starting on line 1
in to_index at /home/dzea/.julia/v0.4/DataArrays/src/indexing.jl:76
in getindex at /home/dzea/.julia/v0.4/DataArrays/src/indexing.jl:173
in getindex at /home/dzea/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:281
in where at /home/dzea/.julia/v0.4/DataFramesMeta/src/DataFramesMeta.jl:164
datos[:x_13] .== "#0_PHY"
works and returns a DataArrays.DataArray{Bool,1}
, but datos[datos[:x_13] .== "#0_PHY",:]
throws the same error.
I seem to be getting an error and wanted to see if I was using DataFramesMeta
incorrectly. Here is the set-up
julia> a
12-element Array{Float64,1}:
3.1
6.8
9.0
9.0
11.3
16.2
8.7
9.0
10.1
12.1
18.7
23.1
julia> b
12-element Array{Int64,1}:
0
1
0
0
1
0
0
0
1
1
0
1
I wanted to filter by the rows, a
, that were unique. This would result in the following
julia> unique(df[:a])
10-element DataArrays.DataArray{Float64,1}:
3.1
6.8
9.0
11.3
16.2
8.7
10.1
12.1
18.7
23.1
It seemed that @where
would be appropriate, but I got an InexactError()
.
julia> @where(df, unique(df[:a]))
ERROR: InexactError()
in copy!(::Base.LinearFast, ::Array{Int64,1}, ::Base.LinearFast, ::Array{Float64,1}) at ./abstractarray.jl:559
in getindex(::DataFrames.DataFrame, ::DataArrays.DataArray{Float64,1}) at /Users/alexhallam/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:234
in where(::DataFrames.DataFrame, ::##34#35) at /Users/alexhallam/.julia/v0.5/DataFramesMeta/src/DataFramesMeta.jl:164
How can I filter
-- in the dplyr sense of the word -- by the unique values in a
.
DataFramesMeta and Lazy work very well together, and personally I like the @>
syntax the best out of all the chaining.
However the two packages export conflicting names, which is frustrating.
julia> using Lazy, DataFramesMeta, DataFrames
julia> df = DataFrame(rand(10,10));
julia> @with df begin
y = :x1
end
WARNING: both DataFramesMeta and Lazy export "@with"; uses of it in module Main must be qualified
ERROR: UndefVarError: @with not defined
and
groupby(df, :x1)
WARNING: both DataFrames and Lazy export "groupby"; uses of it in module Main must be qualified
What can be done to make sure these conflicts don't pop up?
Cross-posted from Discourse here
One thing that is necessary when working with survey data is being able create new variables by standardizing capping existing variables variables. One workflow might look like this:
local variables_to_cap price mpg headroom trunk weight
foreach variable in `variables_to_cap' {
gen `variable'_std = `variable' // generate a new variable but with a suffix added
sum `variable'_std, detail
replace `variable'_std = r(p99) if `variable'_std > r(p99) & !missing(`variable'_std)
}
Above, the suffix _std indicates to the user that we are working with the standardized variable. The benefit of the above code is that you can add the suffix extremely easily, and use the exact same code to generate many standardized variables at once (Ideally, this would be done using a function.)
Using DataFrames we can’t do this because the way to refer to an existing variable is with a symbol, while the you create a variable
@transform(df, newCol = :oldCol) # Command(?) = Symbol works
@transform(df, symbol("newCol") = :oldCol) # error
@transform(df, :newCol) = :oldCol) # error
My ideal functionality would something along the lines of what Stata does. You have a list of variables you wish to transform
function CapVariable (x::Array)
...
end
variables_to_cap = [ ... ] # An array of symbols?
for variables in variables_to_cap
df = @tranform(df, suffix(variable, "_std") = CapVariable(variable))
How would one going about implementing this functionality?
Hello all. I'm currently using the new NullableArrays
version of DataFrames, and have sort of been putting off making my own fork of DataFramesMeta for NullableArrays
. For me personally, DataFramesMeta is by far the best solution for querying dataframes in most situations. Before I start work on a fork, has any work been done on this yet? Is there any roadmap for this? I don't see a branch or issue, so I assume the answer to both questions is no.
Hello I have a float column in my DataFrame and I can't use "*" to multiply 2 columns together. I can however use "+".
leg[:gi]
DataArrays.DataArray{Float64,1}:
This fails.
leg = @Transform(leg, gi_sq = :gi * :gi)
MethodError: *
has no method matching *(::Array{Float64,1}, ::Array{Float64,1})
This works
leg = @Transform(leg, gi_sq = :gi + :gi)
It seems as if DataFrames.based_on has been depreciated? I'm very very new to Julia, so I might just be confused. The error message I get is based_on is not defined. Here is an example:
using RDatasets
iris = dataset("datasets", "iris")
@as _ begin
iris
groupby(_, :Species)
@based_on(_, Mean_Petal_Length = mean(:Petal_Length))
end
I think perhaps transform
from DataFramesMeta has some problems when the new column contains missing
. When I do an operation involving a column of type Union{<:Real,Missing}
the new column will be of type Any
rather than Union{<:Real,Missing}
julia> dd = DataFrame(a=[1,2,3],b=[1,2,missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ missing │
julia> eltypes(dd)
2-element Array{Type,1}:
Int64
Union{Int64, Missings.Missing}
julia> dd=@transform(dd,aa=2*:a,bb=2*:b)
3×4 DataFrames.DataFrame
│ Row │ a │ b │ aa │ bb │
├─────┼───┼─────────┼────┼─────────┤
│ 1 │ 1 │ 1 │ 2 │ 2 │
│ 2 │ 2 │ 2 │ 4 │ 4 │
│ 3 │ 3 │ missing │ 6 │ missing │
julia> eltypes(dd)
4-element Array{Type,1}:
Int64
Union{Int64, Missings.Missing}
Int64
Any
dplyr provides n() which gives the number of rows in a subgroup. In Julia, it would look something like this:
iris2 = @> begin
iris
@by( :Species,
count = n(),
sepal_length = mean(:SepalLength))
end
In order to get what I want, I have to do this:
iris2 = @as _ begin
iris
@transform(count = 1)
@by( :Species,
count = sum(:count),
sepal_length = mean(:SepalLength))
end
It seems like @byrow
should be written @byrow!
, because it seems like a mutate-in-place function. In fact, might be useful to have both a mutate and a mutate-in-place version of all the functions in DataFramesMeta.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.