juliadata / dataframesmeta.jl Goto Github PK

View Code? Open in Web Editor NEW

472.0 472.0 54.0 1.37 MB

Metaprogramming tools for DataFrames

Home Page: https://juliadata.github.io/DataFramesMeta.jl/stable/

License: Other

Julia 100.00%

data data-frame dataframes dataframesmeta datasets hacktoberfest julia tabular-data

dataframesmeta.jl's People

Contributors

Stargazers

Watchers

dataframesmeta.jl's Issues

Non-systematic "New columns must have the same length as old columns" error

Creating a column with only one value is only an error if it's the first column:

julia> df = DataFrame(A=1:3)
3x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1   │ 1 │
│ 2   │ 2 │
│ 3   │ 3 │

julia> @select(df, B=1, :A)
ERROR: New columns must have the same length as old columns
 in insert_single_column!(::DataFrames.DataFrame, ::DataArrays.DataArray{Int64,1}, ::Symbol) at /home/milan/.julia/DataFrames/src/dataframe/dataframe.jl:309
 in setindex!(::DataFrames.DataFrame, ::DataArrays.DataArray{Int64,1}, ::Symbol) at /home/milan/.julia/DataFrames/src/dataframe/dataframe.jl:368
 in #select#17(::Array{Any,1}, ::Any, ::DataFrames.DataFrame) at /home/milan/.julia/DataFramesMeta/src/DataFramesMeta.jl:450
 [inlined code] from ./boot.jl:307
 in (::###8232#20{DataFrames.DataFrame})(::DataArrays.DataArray{Int64,1}) at /home/milan/.julia/DataFramesMeta/src/DataFramesMeta.jl:55
 in eval(::Module, ::Any) at ./boot.jl:243

julia> @select(df, :A, B=1)
3x2 DataFrames.DataFrame
│ Row │ A │ B │
┝━━━━━┿━━━┿━━━┥
│ 1   │ 1 │ 1 │
│ 2   │ 2 │ 1 │
│ 3   │ 3 │ 1 │

Would it make sense to always recycle the columns so that the new data frame has the same number of columns as the original? Other constructs could be recommended for aggregation operations.

Rename based_on()?

As noted in #48, based_on might not be the best name for this operation. It's hard to distinguish from groupby (and doesn't follow the same naming convention with regard to the underscore).

summarize/summarise (like dplyr) might be more explicit. But it's a bit surprising that this operation also allows returning as many rows as in the original data, which is completely different from summarizing.

I wonder whether it couldn't be merged with select, as in SQL. AFAICT select doesn't work on GroupedDataFrame currently, so there would be no conflict.

ERROR with Pkg.add: bad default revision 'HEAD'

julia> Pkg.add("DataFramesMeta")
fatal: bad default revision 'HEAD'
ERROR: failed process: Process(`git --git-dir=/home/diego/.julia/v0.4/.cache/DataFramesMeta log --all --format=%H`, ProcessExited(128)) [128]
 in pipeline_error at process.jl:555
 in readbytes at process.jl:515
 in prefetch at pkg/cache.jl:44
 in resolve at ./pkg/entry.jl:434
 in edit at pkg/entry.jl:26
 in anonymous at task.jl:447
 in sync_end at ./task.jl:413
 [inlined code] from task.jl:422
 in add at pkg/entry.jl:46
 in add at pkg/entry.jl:73
 in anonymous at pkg/dir.jl:31
 in cd at file.jl:22
 in cd at pkg/dir.jl:31
 in add at pkg.jl:23

rotating the meaning of x, :x, ^(:x)

Just wondering if it's better if we "rotate" the mean of the expressions, like so:

expression	current	alternative
x	variable	column
:x	column	symbol
^(:x)	symbol	variable

I notice that most expressions involve just the column. External variable reference is much less often. This reduces some of the visual noise due to the colons.

Of course it may be water under the bridge, but I'm just curious what others think of this.

Renaming columns in DataFramesMeta with spaces

Cross-posting from https://discourse.julialang.org/t/renaming-columns-in-dataframesmeta-with-spaces/11101

I’ve created the following dataset:

d = DataFrame(x=randn(100))
d[Symbol("Y 2")] = randn(100)

I can rename the second column via

rename!(d, Symbol("Y 2") => :Y)

But I would like to rename it via a DataFramesMeta chain.

I can do

@linq d |> select(X=:x)

But

@linq d |> select(Y=Symbol("Y 2"))

drops the column Y 2 and adds the column Y with only one row with the value “Y 2”

Additionally,

@linq d |> select(X=Symbol("x"))

Does something similar. Anyway to get around this, or do I need to use rename!?

@transform and @select error on julia v0.6

hi there,

I tried to replicate the example from ?@transform and i see

julia> df = DataFrame(A = 1:3, B = [2, 1, 2])
3×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 2 │
│ 2   │ 2 │ 1 │
│ 3   │ 3 │ 2 │

julia> @transform(df, a = 2 * :A, x = :A + :B)
ERROR: MethodError: no method matching transform(::DataFrames.DataFrame, ::##12#14, ::##13#16)
Closest candidates are:
  transform(::Union{Associative, DataFrames.AbstractDataFrame}; kwargs...) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:268

I just did Pkg.update before this session. julia v0.6. thanks

In the same vein, i get

julia> df = DataFrame(A = 1:3, B = [2, 1, 2])
3×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 2 │
│ 2   │ 2 │ 1 │
│ 3   │ 3 │ 2 │

julia> @select(df, :B, :A)
3×2 DataFrames.DataFrame
│ Row │ B │ A │
├─────┼───┼───┤
│ 1   │ 2 │ 1 │
│ 2   │ 1 │ 2 │
│ 3   │ 2 │ 3 │

julia> @select(df, :B, x = :A .+ :B)
ERROR: MethodError: no method matching select(::DataFrames.DataFrame, ::DataArrays.DataArray{Int64,1}; B=[2, 1, 2])
Closest candidates are:
  select(::DataFrames.AbstractDataFrame, ::Any) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:214 got unsupported keyword argument "B"
  select(::Union{Associative, DataFrames.AbstractDataFrame}; kwargs...) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:431
Stacktrace:
 [1] (::###3798#8{DataFrames.DataFrame})(::DataArrays.DataArray{Int64,1}, ::DataArrays.DataArray{Int64,1}) at /Users/74097/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:55

even though that is exaclty what comes in the docstring.

@transform and @where (maybe more) are broken on v0.7

I'm on Julia v0.7 and DataFramesMeta v3.0.0. It appears that @transform and @where are broken. I believe it could be related to some changes to broadcast, but most of the work in this package is done in macros which are very unfamiliar to me, so I'm struggling to debug it.

@transform is behaving badly. If I just extract the columns and do something like df[:c] = df[:a] .+ df[:b], I have no issues. Sometimes I get nonsense for the result, sometimes not.

julia> dd = DataFrame(a=[1,1,1,1,1,2,2,2,2,2],b=rand(10),c=rand(10))
10×3 DataFrame
│ Row │ a │ b         │ c          │
├─────┼───┼───────────┼────────────┤
│ 1   │ 1 │ 0.940118  │ 0.00203923 │
│ 2   │ 1 │ 0.443475  │ 0.600234   │
│ 3   │ 1 │ 0.958618  │ 0.984405   │
│ 4   │ 1 │ 0.657766  │ 0.826958   │
│ 5   │ 1 │ 0.0482096 │ 0.249115   │
│ 6   │ 2 │ 0.903136  │ 0.27147    │
│ 7   │ 2 │ 0.319808  │ 0.697216   │
│ 8   │ 2 │ 0.0525784 │ 0.890392   │
│ 9   │ 2 │ 0.223741  │ 0.978436   │
│ 10  │ 2 │ 0.297486  │ 0.176859   │

julia> dd[:b] ./ dd[:c]
10-element Array{Float64,1}:
 461.0164605051064
   0.7388362802425429
   0.9738049508515227
   0.7954040678869911
   0.19352378037309087
   3.3268399796579144
   0.4586931077205082
   0.059050832041258175
   0.22867234620825236
   1.6820560963139088

julia> @transform(dd, d = (:b ./ :c))
10×4 DataFrame
│ Row │ a │ b         │ c          │ d         │
├─────┼───┼───────────┼────────────┼───────────┤
│ 1   │ 1 │ 0.940118  │ 0.00203923 │ 461.016   │
│ 2   │ 1 │ 0.443475  │ 0.600234   │ 0.738836  │
│ 3   │ 1 │ 0.958618  │ 0.984405   │ 0.973805  │
│ 4   │ 1 │ 0.657766  │ 0.826958   │ 0.795404  │
│ 5   │ 1 │ 0.0482096 │ 0.249115   │ 0.193524  │
│ 6   │ 2 │ 0.903136  │ 0.27147    │ 3.32684   │
│ 7   │ 2 │ 0.319808  │ 0.697216   │ 0.458693  │
│ 8   │ 2 │ 0.0525784 │ 0.890392   │ 0.0590508 │
│ 9   │ 2 │ 0.223741  │ 0.978436   │ 0.228672  │
│ 10  │ 2 │ 0.297486  │ 0.176859   │ 1.68206   │

julia> @transform(dd, d = (:b ./ :c))
10×4 DataFrame
│ Row │ a │ b         │ c          │ d          │
├─────┼───┼───────────┼────────────┼────────────┤
│ 1   │ 1 │ 0.940118  │ 0.00203923 │ 0.00216912 │
│ 2   │ 1 │ 0.443475  │ 0.600234   │ 1.35348    │
│ 3   │ 1 │ 0.958618  │ 0.984405   │ 1.0269     │
│ 4   │ 1 │ 0.657766  │ 0.826958   │ 1.25722    │
│ 5   │ 1 │ 0.0482096 │ 0.249115   │ 5.16732    │
│ 6   │ 2 │ 0.903136  │ 0.27147    │ 0.300586   │
│ 7   │ 2 │ 0.319808  │ 0.697216   │ 2.18011    │
│ 8   │ 2 │ 0.0525784 │ 0.890392   │ 16.9346    │
│ 9   │ 2 │ 0.223741  │ 0.978436   │ 4.37307    │
│ 10  │ 2 │ 0.297486  │ 0.176859   │ 0.59451    │

When using @where sometimes I get an error that really makes no sense given what I'm asking it to do. It seems like it is trying to mix the columns together or something, based on the error I'm seeing. If I do one subset at a time, it works.

julia> dd = DataFrame(a = [Date(1998),Date(1999),Date(2000)],b=rand(3))
3×2 DataFrame
│ Row │ a          │ b        │
├─────┼────────────┼──────────┤
│ 1   │ 1998-01-01 │ 0.100379 │
│ 2   │ 1999-01-01 │ 0.516064 │
│ 3   │ 2000-01-01 │ 0.541581 │

julia> @where(dd, :b .>= 0.2, :a .>= Date(1998))
ERROR: MethodError: no method matching isless(::Float64, ::Date)
Closest candidates are:
  isless(::Float64, ::Float64) at float.jl:457
  isless(::Missing, ::Any) at missing.jl:62
  isless(::AbstractFloat, ::AbstractFloat) at operators.jl:124
  ...
Stacktrace:
 [1] <(::Float64, ::Date) at .\operators.jl:227
 [2] <= at .\operators.jl:273 [inlined]
 [3] >= at .\operators.jl:297 [inlined]
 [4] (::getfield(, Symbol("##331#334")))(::Date, ::Float64, ::Date) at .\<missing>:0
 [5] broadcast_nonleaf(::Function, ::Base.Broadcast.VectorStyle, ::Type{Union{}}, ::Tuple{Base.OneTo{Int64}}, ::Array{Date,1}, ::Vararg{Any,N} where N) at .\broadcast.jl:649
 [6] broadcast(::Function, ::Base.Broadcast.VectorStyle, ::Type{Union{}}, ::Tuple{Base.OneTo{Int64}}, ::Array{Date,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at .\broadcast.jl:626
 [7] broadcast at .\broadcast.jl:618 [inlined]
 [8] broadcast at .\broadcast.jl:615 [inlined]
 [9] (::getfield(, Symbol("###1843#333")))(::Array{Float64,1}, ::Array{Date,1}) at C:\Users\tbeason\.julia\v0.7\DataFramesMeta\src\DataFramesMeta.jl:70
 [10] (::getfield(, Symbol("##330#332")))(::DataFrame) at C:\Users\tbeason\.julia\v0.7\DataFramesMeta\src\DataFramesMeta.jl:72
 [11] where(::DataFrame, ::getfield(, Symbol("##330#332"))) at C:\Users\tbeason\.julia\v0.7\DataFramesMeta\src\DataFramesMeta.jl:194
 [12] top-level scope

output column of transform cannot be of a different type than input column

I attempted to use transform to convert a text-column to a float-column. However, this fails with

MethodError: Cannot `convert` an object of type Float64 to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.

 in macro expansion at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:60 [inlined]
 in macro expansion at ./cartesian.jl:64 [inlined]
 in (::DataArrays.#_F_#204)(::DataArrays.DataArray{String,1}, ::DataArrays.DataArray{String,1}) at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:130
 in broadcast!(::Function, ::DataArrays.DataArray{String,1}, ::DataArrays.DataArray{String,1}) at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:229
 in databroadcast(::Function, ::DataArrays.DataArray{String,1}, ::Vararg{DataArrays.DataArray{String,1},N}) at /home/sturm/.julia/v0.5/DataArrays/src/broadcast.jl:235
 in (::###346#72{DataFrames.DataFrame})(::DataArrays.DataArray{String,1}) at /home/sturm/.julia/v0.5/DataFramesMeta/src/DataFramesMeta.jl:55

Here is a minimal working example:

function return_float(x)
    42.0 
end

function return_string(x)
    "42"
end

df = DataFrame(a=["1", "2"])

# works
@transform(df, new_col = return_string.(:a))

# fails
@transform(df, new_col = return_float.(:a))

Support non-vectorized syntax in @where

Since @where operates by row, I would find it both natural and practical to allow using non-vectorized operators, like == instead of .==. That would be particularly useful for in, which is currently not vectorized and might never be (JuliaLang/julia#5212). For example, it would be great to be able to write:

using DataFrames, DataFramesMeta, RDatasets
iris = dataset("datasets", "iris")
@where iris :Species in ("setosa", "virginica")

instead of as currently:

@where iris (:Species .== "setosa") | (:Species .== "virginica")

Do you think this would be technically possible? (If so, this idea could also be applied to other macros.)

Publish new version to METADATA?

@byrow doesn't seem to be included in the version installed using Pkg.add

regression with `@byrow`

On current master

myd = DataFrame(x=zeros(10), y=ones(10))
@byrow myd begin
    println(:x)
end

gives me the error "ERROR: myd not defined"

On commit e84ac6c it works as expected.

Ergonomic ways of recoding data

This issue parallels JuliaData/DataFrames.jl#369.

@byrow! may be close to what @nalimilan is after in the issue above. Here is a recoding example:

n = nrow(df)
df[:x] = Array(Int, n)
df[:y] = NullableArray(Float64, n)
@byrow! df begin
    :x = :a > 1 ? 4 : 9
    :y = :a * :b
end

The main irritation with the code above is allocating new columns. That could be handled by extending @transform or making a new macro that does something similar:

@transform(df, x = Array{Int}, y = NullableArray{Float64})

Another option is to add a feature to @byrow! that does this:

@byrow! df let x = Array{Int},
               y = NullableArray{Float64}
    :x = :a > 1 ? 4 : 9
    :y = :a * :b
end

Or:

@byrow! df begin
    @newcol x = Array{Int}
    @newcol y = NullableArray{Float64}
    :x = :a > 1 ? 4 : 9
    :y = :a * :b
end

Any opinions on these or other approaches? Any other features that byrow! needs to better recode data?

@where how to select mulit-rows.

e.g. dataframe a = DataFrames.DataFrame
│ Row │ variable │ value │ Sites │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ MFB_PM25 │ 1.7166 │ "x1142A" │
│ 2 │ MFB_PM25 │ -13.8486 │ "x1143A" │
│ 3 │ MFB_PM25 │ 11.2938 │ "x1144A" │
│ 4 │ MFB_PM25 │ -8.32452 │ "x1145A" │

I want to select the rows of x1143A and x1144A in Sites
how to code it .

@where(a, :Sites.== ???

Rebinding external vars inside of `@with`

EDIT: Nevermind, this turns out not really to be an issue. See comment below. I'm leaving the following original post unedited for posterity.

This issue arises while trying to rebind an external variable inside of an @with call:

julia> using DataFrames, DataFramesMeta

julia> df = DataFrame(A = 1:3, B = [2, 1, 2])
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 1 | 2 |
| 2   | 2 | 1 |
| 3   | 3 | 2 |

julia> y = 1
1

julia> @with df for i in 1:3; :A[i] += y end

julia> df
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 2 | 2 |
| 2   | 3 | 1 |
| 3   | 4 | 2 |

julia> @with df for i in 1:3; y += :A[i] end
ERROR: y not defined
 in ##24126 at /Users/David/.julia/v0.3/DataFramesMeta/src/DataFramesMeta.jl:1

julia> @with df y += :A[1]
ERROR: y not defined
 in ##24128 at /Users/David/.julia/v0.3/DataFramesMeta/src/DataFramesMeta.jl:47

I can't remember the particular feature of Julia's scoping logic that provides this behavior. It seems that inside of a local scope, an external variable can be referenced so long as it is not rebound.

Of course, defining a function through which to pass the external value works fine:

julia> function f(x) @with df x += :A[1] end
f (generic function with 2 methods)

julia> f(y)
6

However, in the interest of making certain kinds of functionality more generally accessible in the REPL (think demographics of users who aren't used to wrapping things in functions), I wonder if we should provide some sort of tool (macro) for doing this more efficiently. Something like

@relay y @with df y += :A[1]

where @relay would generate a function, replace all instances of y in the body @with df y += :A[1] with a variable, and then call that function on y. Incorporating functionality for multiple external variables might be tricky, though at this point of playing around I believe an efficient solution is possible. I'm happy to work on such a feature.

I don't know if this proposal approaches the realm of too much magic, but the core idea seems pretty straightforward. I would also provide thorough documentation on the internals, so lay users would be able to troubleshoot more effectively and understand their code better in general.

Or, we could just tell folks to wrap their code in a function.

based_on doesn't seem to work with linq

when I try to use based_on within a linq call, I get an undef_var error. here is a MWE:

using DataFrames, DataFramesMeta df = DataFrame(id = [1, 1, 2, 2, 3, 3], vals = randn(6)) @linq df |> groupby(:id) |> based_on(totval = sum(vals))

for me, the result is a red: "ERROR: UndefVarError: based_on not defined"

here is the output of Pkg.status(DataFramesMeta)
`julia> Pkg.status("DataFramesMeta")

DataFramesMeta 0.1.1
`

any idea why this doesn't work?

spelling error in one of the examples

Hello,

some of the examples give an error on julia 0.4.5

The following code

using DataArrays, DataFrames
using DataFramesMeta

df = DataFrame(x = 1:3, y = [2, 1, 2])

transform(df, newCol = cos(df[:x]), anotherCol = df[:x]^2 + 3*df[:x] + 4)
@transform(df, newCol = cos(:x), anotherCol = :x^2 + 3*:x + 4)

produces

ERROR: MethodError: * has no method matching *(::Array{Int64,1}, ::Array{Int64,1})

replacing ^ by .^ fixes the problem, ie:

transform(df, newCol = cos(df[:x]), anotherCol = df[:x].^2 + 3*df[:x] + 4)
@transform(df, newCol = cos(:x), anotherCol = :x.^2 + 3*:x + 4)

Best regards,

Heads up: IndexedTables version in the making

Not an issue, but as I'm making a version of this package for IndexedTables here I though it was important to notify this repository.

I prefer to keep the implementations separate as IndexedTables are very different from DataFrames under the hood (complete type information about columns, fast row iteration), which makes the preferred implementations very different for the two cases.

I'm acknowledging this package in the README as most of the ideas come from here (I think historically this package was the first use of metaprogramming in Julia to simplify table operations, which I believe is a brilliant idea).

Method definition overwritten warning in Julia v0.5

On version 0.1.3, I get

using DataFramesMeta
WARNING: Method definition nrow(DataFramesMeta.AbstractCompositeDataFrame) in module 
DataFramesMeta at C:\Users\rcnlee\julia\v0.5\DataFramesMeta\src\compositedataframe.jl:108 
overwritten at C:\Users\rcnlee\.julia\v0.5\DataFramesMeta\src\compositedataframe.jl:109

There seems to be two definitions for nrow(cdf::AbstractCompositeDataFrame).

New macro?

Here's an operation I perform often:

Split data on one column into groups
Assess predicate on each of those groups
Create a new DataFrame that only contains data for the groups for which the predicate returned true

Maybe something like the following?

filtergroups(data, [:id], df -> size(df, 1) == expected_rows)

filtergroups(data, row -> row[:value] == 42)

@where cannot use defined symbol variable

No issue when
filtered_data=@where(data, :x .== 2)

but if we define
indicator=:x
filtered_data=@where(data, indicator .== 2)

it returns an error. This makes it difficult to go though a loop using different "indicators"

@newcol in @byrow! not defined

Hi. Sorry to be a bother. I'm getting an error saying "UndefVarError @newCol not defined", when running the block in the readme that includes @byrow! and @newCol as displayed below.

df = DataFrame(A = 1:3, B = [2, 1, 2])
df2 = @byrow! df begin
    @newcol colX::Array{Float64}
    @newcol colY::DataArray{Int}
    :colX = :B == 2 ? pi * :A : :B
     if :A > 1 
        :colY = :A * :B
    end
end

I'm trying to apply this to my own dataframe to bring separate time and date columns into a single column that contains DateTime elements. I was thinking this would work:

dateForm = Dates.DateFormat("d-u-y H")
newMachineData = @byrow! MachineData begin
@newCol DT::Array{DateTime}
:DT = DateTime(string(:DATE, " ", :HOUR), dateForm)
end

However, the error with @newCol is raised. It's likely I'm missing something simple though. I am new to Julia

Feature: use an existing list of names for new variables

In DataFrames you can write the following:

names = [:newName1, :newName2, :newName3]

for name in names 
    df[name] = 5
end

I believe that DataFramesMeta's version of this does not make it possible to do such a thing with the @transform macro. With DataFrames, you can store names as a list of symbols, but there is no equivalent object you can use to store names when you with to use @transform(x = :oldVar)command.

Is this a missing functionality?

Tests fail on v0.5

There are multiple failures:

Error During Test
  Test threw an exception of type MethodError
  Expression: size(df[1:2,:]) == (2,2)
  MethodError: broadcast(::DataFramesMeta.##15#16{TestCompositeDataFrames.CompositeDF##349}) is ambiguous. Candidates:
    broadcast(f, x::Number...) at broadcast.jl:16
    broadcast(f::Function, As::DataArrays.PooledDataArray...) at /Users/ranjan/.julia/v0.5/DataArrays/src/broadcast.jl:318
   in nrow at /Users/ranjan/.julia/v0.5/DataFramesMeta/src/compositedataframe.jl:109 [inlined]
   in size(::TestCompositeDataFrames.CompositeDF##349) at /Users/ranjan/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:214
   in include_from_node1(::String) at ./loading.jl:426
   in macro expansion; at /Users/ranjan/.julia/v0.5/DataFramesMeta/test/runtests.jl:15 [inlined]
   in anonymous at ./<missing>:?
   in include_from_node1(::String) at ./loading.jl:426
   in process_options(::Base.JLOptions) at ./client.jl:262
   in _start() at ./client.jl:318
    FAILED: compositedataframes.jl

and

Test Failed
  Expression: xlinq2[[:meanX,:meanY]] == xlinq[[:meanX,:meanY]]
   Evaluated: 3×2 DataFrames.DataFrame
│ Row │ meanX    │ meanY   │
├─────┼──────────┼─────────┤
│ 1   │ 0.537542 │ 5.37542 │
│ 2   │ 0.521357 │ 5.21357 │
│ 3   │ 0.494239 │ 4.94239 │ == 3×2 DataFrames.DataFrame
│ Row │ meanX    │ meanY   │
├─────┼──────────┼─────────┤
│ 1   │ 0.521357 │ 5.21357 │
│ 2   │ 0.494239 │ 4.94239 │
│ 3   │ 0.537542 │ 5.37542 │
    FAILED: linqmacro.jl
ERROR: LoadError: "Tests failed"
 in include_from_node1(::String) at ./loading.jl:426
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
while loading /Users/ranjan/.julia/v0.5/DataFramesMeta/test/runtests.jl, in expression starting on line 27

How to order descending in `@orderby`?

Tried adding a minus sign before the column symbol, but got the error msg

MethodError: no method matching -(::String)
...

Is there any workaround? Thanks.

transform() vs select()

I wonder what's the fundamental difference between transform and select. The most visible difference seems to be that the former always returns as many rows as the input, contrary to select:

julia> df = DataFrame(A=1:3)
3x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1   │ 1 │
│ 2   │ 2 │
│ 3   │ 3 │

julia> @select(df, A=1)
1x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1   │ 1 │

julia> @transform(df, A=1)
3x1 DataFrames.DataFrame
│ Row │ A │
┝━━━━━┿━━━┥
│ 1   │ 1 │
│ 2   │ 1 │
│ 3   │ 1 │

Are there any other differences? The documentation isn't very explicit about them.

Do we really need two different macros here? I find it a bit confusing, and the names don't really reflect the differences between them. AFAICT, dplyr only offers select and mutate (equivalent of transform) because the former only allows selecting columns, and not transforming them. It sounds like a powerful SQL-like @select (what we have now) is a better solution.

Macro Conflict with Lazy.jl

I'm trying to reproduce the examples on READEME.jl, however, I found that if the using Lazy statement is executed before using DataArrays, DataFrames, DataFramesMeta, an error will be raised when calling @where(df, :x .> 1): ERROR: wrong number of arguments

I am using Julia v0.4, on Ubuntu 14.04 x64 desktop version.

Use of qualified function names in @linq

Following https://stackoverflow.com/questions/50547688/julia-using-groupby-with-dataframesmeta-and-lazy.

Would it be possible to add support for passing qualified function names to @linq macro?

@where: wrong number of arguments

Hi,

why?

@where((df, :x .> x) & (df, :y .== 3))
ERROR: wrong number of arguments

NB:

df = DataFrame(x = 1:3, y = [2, 1, 2])
x = [2, 1, 0]

p.s.

versioninfo()
Julia Version 0.4.1
Commit cbe1bee (2015-11-08 10:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@> begin df versus @byrow! df begin

This is a small gripe with DataFramesMeta

When using the linq macros, you write df = @> begin df

When using other DataFramesMeta macros, you write the begin first with df = @byrow! df begin

Any interest in changing the linq macro to be @> df begin?

Error using function chaining and Lazy.lj

When I try to do something very similar to the Lazy.jl example in the docs, I get a error informing me that both Lazy.jl and DataFramesMeta.jl have macros called "with", and as a result, uses of with must be "qualified". Here is the code block I am trying to run. For reference, "ndic" is a dataframe, and dfsplit() is a helper function that more or less calls split() in a particular way.

using DataFrames
using Lazy
using DataFramesMeta

...

pools = @> begin
ndic
@where(!isna(:ProducedPools))
@by(:FileNo, df -> dfsplit(df, :ProducedPools, ','))
@select(:FileNo, poolorder = :splitorder, pool = :splitval)
@where(:pool .!= "Confidential")
end

Here is the error I get:

julia> include("CODE.jl")
WARNING: both DataFramesMeta and Lazy export "@with"; uses of it in module Main must be qualified
ERROR: LoadError: UndefVarError: @with not defined
in include at /usr/local/Cellar/julia/0.4.3/lib/julia/sys.dylib
in include_from_node1 at /usr/local/Cellar/julia/0.4.3/lib/julia/sys.dylib
while loading /PATH_TO_CODE/CODE.jl, in expression starting on line 36

Support `_DF` (or something) for full DataFrame access

We used to have this "hidden" feature where you could use _DF to access the whole DataFrame (or SubDataFrame in a group operation). It is occasionally nice. I think it got coded out in #73.

based_on not working in @linq macro

I've been playing around with this package, and it's already made some of my real-world work a lot easier. I came across this error just now--based_on does not seem to work when chained together with other operations using @linq.

With R and dplyr, I can do the following:

> iris %>%  group_by(Species) %>%  summarise(x=mean(Sepal.Width))

but when I try the (I think?) equivalent in Julia, I get an error:

julia> @linq iris |> groupby(:Species) |> based_on(x = mean(SepalWidth))
ERROR: UndefVarError: based_on not defined

Is this something that should work? (Or am I doing something incorrectly?)

Multiple statements in @where broken in 0.1.3

A method error occurs when attempting to use more than one filter statement in @where with the most recent release.

Version info:

Julia 0.6.0
DataFramesMeta 0.1.3

Code to reproduce error:

Pkg.add("DataFramesMeta", v"0.1.3")
using DataFrames, DataFramesMeta

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

@where(df, :name .== "John", :age .== 23)

Error log:

MethodError: &(::DataArrays.DataArray{Bool,1}, ::DataArrays.DataArray{Bool,1}) is ambiguous. Candidates:
  &(a::DataArrays.DataArray{Bool,N} where N, b::Union{AbstractArray{Bool,N} where N, Bool}) in DataArrays at /home/juser/.julia/v0.6/DataArrays/src/operators.jl:390
  &(b::Union{AbstractArray{Bool,N} where N, Bool}, a::DataArrays.DataArray{Bool,N} where N) in DataArrays at /home/juser/.julia/v0.6/DataArrays/src/operators.jl:390
Possible fix, define
  &(::DataArrays.DataArray{Bool,N} where N, ::DataArrays.DataArray{Bool,N} where N)

where() takes only one expression

Suppose I want to filter my dataframe by multiple constrains (which is not rare):

@where(df, :x1 .== 5, :x2 .== 3)

but now it gives back: ERROR: wrong number of arguments. Can it be enhanced to take multiple constrains as how dplyr in R does? :D

How to mitigate deprecation warning when using orderby?

Hi there,

I'm using Julia 0.6 and DataFramesMeta 0.3.0.
The following code produces a deprecation warning.
I'm not sure what I need to change to avoid the warning. Any ideas?

using DataFrames
using DataFramesMeta

tbl = DataFrame(a = [22,11,33])
@linq tbl |> orderby(:a)

Functional forms and argument ordering

Many of the LINQ-like operations have a functional form where a function is passed as an argument, like:

where(g, d -> mean(d[:a]) > 0)

This form is quite common with LINQ libraries. The issue is argument ordering. Base Julia functions like map all have the function passed as the first argument. The LINQ standard uses the function as the second argument, and passing the "operations" as the second argument is more consistent with the macro form: @where(g, mean(:a) > 0).

Our choices are: (1) follow Julia base, (2) follow the LINQ standard, or (3) define both.

Make escaping work differently in `@byrow!`

I was trying to replicate Stata's rowmean function using @byrow!.

In stata this is

local vars x1 x2 x3 
gen t = rowmean(`vars')

Using byrow you would do

vars = [:x1, :x2, :x3]
df[:works] = 0 
df[:not_works] = 0
@byrow! df begin
    :works = mean([:x1, :x2, :x3])
    :not_works = mean(_I_(vars))
end

The reason this doesn't work is that _I_(vars) reflexively returns df[vars]. If vars is more than 1 columns, it returns a dataframe, and mean is not defined for a dataframe.

the obvious solution would be to have _I_ return a DataFrameRow in this context. However mean still isn't defined for a DataFrameRow. Where do we stand on overloading array function to work on DataFrameRows?

Unless all of this is tangent to an obvious solution to my original problem. maybe there is a better way to do rowmean than I know about.

`@within`

@within (or @transform) could be developed. Here's a hypothetical
example:

@within df begin
    x = 4
    :x = :colA + :colB + x # a new column
    :z = 2 * :x     # can we create a new column based on another newly created column?
    rm(:y)          # should we / could we allow deletion of columns?
end

Another option might be to extend DataFrame() to take a DataFrame
and keyword arguments to create a new DataFrame as:

df1 = @with df DataFrame(df, 
    x = :colA + :colB,  # a new column
    z = 2 * :x)      # can we create a new column based on another newly created column?

This goes well with using DataFrame with @with to create a new
DataFrame based on another DataFrame. This already works:

df1 = @with df DataFrame(
    x = :colA + :colB,
    z = 2 * :colA)

Generic discussions about using metaprogramming with DataFrames

Here are several issues that discuss metaprogramming and/or different
approaches to query and manipulate DataFrames:

Please add additional comments here on better approaches to querying DataFrames.

Creating new macros

Hi there,

I am creating some new macros for use with the @linq macro. I can create the macros, but I can't get @linq to recognize them. Simple example below. Any ideas?

using DataFrames
using DataFramesMeta

# simple renaming of where macro
macro selectrows(d, args...)
    esc(DataFramesMeta.where_helper(d, args...))
end

function linq(::DataFramesMeta.SymbolParameter{:selectrows}, d, args...)
    DataFramesMeta.where_helper(d, args...)
end

df = DataFrame(x = 1:3, y = [2, 1, 2]);
@linq df |> @selectrows(:x .> 1)    # OK
@linq df |> selectrows(:x .> 1)     # Error

The error is:

ERROR: MethodError: no method matching isless(::Int64, ::Symbol)
Closest candidates are:
  isless(::Symbol, ::Symbol) at strings/basic.jl:136
  isless(::DataArrays.NAtype, ::Any) at /home/jock/.julia/v0.6/DataArrays/src/operators.jl:383
  isless(::Real, ::AbstractFloat) at operators.jl:97
  ...
Stacktrace:
 [1] (::##97#98)(::Symbol) at ./<missing>:0
 [2] broadcast(::Function, ::Symbol) at ./broadcast.jl:434

@where can't deal with NAs

@where(datos, :x_13 .== "#0_PHY") throws the following error because the column has missing values:

LoadError: DataArrays.NAException("cannot index an array with a DataArray containing NA values")
while loading In[8], in expression starting on line 1

 in to_index at /home/dzea/.julia/v0.4/DataArrays/src/indexing.jl:76
 in getindex at /home/dzea/.julia/v0.4/DataArrays/src/indexing.jl:173
 in getindex at /home/dzea/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:281
 in where at /home/dzea/.julia/v0.4/DataFramesMeta/src/DataFramesMeta.jl:164

datos[:x_13] .== "#0_PHY" works and returns a DataArrays.DataArray{Bool,1}, but datos[datos[:x_13] .== "#0_PHY",:] throws the same error.

Inexact

I seem to be getting an error and wanted to see if I was using DataFramesMeta incorrectly. Here is the set-up

julia> a
12-element Array{Float64,1}:
  3.1
  6.8
  9.0
  9.0
 11.3
 16.2
  8.7
  9.0
 10.1
 12.1
 18.7
 23.1

julia> b
12-element Array{Int64,1}:
 0
 1
 0
 0
 1
 0
 0
 0
 1
 1
 0
 1

I wanted to filter by the rows, a, that were unique. This would result in the following

julia> unique(df[:a])
10-element DataArrays.DataArray{Float64,1}:
  3.1
  6.8
  9.0
 11.3
 16.2
  8.7
 10.1
 12.1
 18.7
 23.1

It seemed that @where would be appropriate, but I got an InexactError().

julia> @where(df, unique(df[:a]))
ERROR: InexactError()
 in copy!(::Base.LinearFast, ::Array{Int64,1}, ::Base.LinearFast, ::Array{Float64,1}) at ./abstractarray.jl:559
 in getindex(::DataFrames.DataFrame, ::DataArrays.DataArray{Float64,1}) at /Users/alexhallam/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:234
 in where(::DataFrames.DataFrame, ::##34#35) at /Users/alexhallam/.julia/v0.5/DataFramesMeta/src/DataFramesMeta.jl:164

How can I filter -- in the dplyr sense of the word -- by the unique values in a.

Better interplay with Lazy

DataFramesMeta and Lazy work very well together, and personally I like the @> syntax the best out of all the chaining.

However the two packages export conflicting names, which is frustrating.

julia> using Lazy, DataFramesMeta, DataFrames

julia> df = DataFrame(rand(10,10));

julia> @with df begin
       y = :x1
       end
WARNING: both DataFramesMeta and Lazy export "@with"; uses of it in module Main must be qualified
ERROR: UndefVarError: @with not defined

and

 groupby(df, :x1)
WARNING: both DataFrames and Lazy export "groupby"; uses of it in module Main must be qualified

What can be done to make sure these conflicts don't pop up?

Feature requests: Adding suffixes to array of existing variables

Cross-posted from Discourse here
One thing that is necessary when working with survey data is being able create new variables by standardizing capping existing variables variables. One workflow might look like this:

local variables_to_cap price mpg headroom trunk weight 

foreach variable in `variables_to_cap' {
    gen `variable'_std = `variable' // generate a new variable but with a suffix added
    sum `variable'_std, detail
    replace `variable'_std = r(p99) if `variable'_std > r(p99) & !missing(`variable'_std)
}

Above, the suffix _std indicates to the user that we are working with the standardized variable. The benefit of the above code is that you can add the suffix extremely easily, and use the exact same code to generate many standardized variables at once (Ideally, this would be done using a function.)

Using DataFrames we can’t do this because the way to refer to an existing variable is with a symbol, while the you create a variable

@transform(df, newCol = :oldCol) # Command(?) = Symbol  works
@transform(df, symbol("newCol") = :oldCol) # error
@transform(df, :newCol) = :oldCol) # error

My ideal functionality would something along the lines of what Stata does. You have a list of variables you wish to transform

function CapVariable (x::Array)
    ...
end


variables_to_cap = [ ... ] # An array of symbols?
for variables in variables_to_cap
    df = @tranform(df, suffix(variable, "_std") = CapVariable(variable))

How would one going about implementing this functionality?

`NullableArrays` branch

Hello all. I'm currently using the new NullableArrays version of DataFrames, and have sort of been putting off making my own fork of DataFramesMeta for NullableArrays. For me personally, DataFramesMeta is by far the best solution for querying dataframes in most situations. Before I start work on a fork, has any work been done on this yet? Is there any roadmap for this? I don't see a branch or issue, so I assume the answer to both questions is no.

Cannot use transform to square the values of a column

Hello I have a float column in my DataFrame and I can't use "*" to multiply 2 columns together. I can however use "+".

leg[:gi]
DataArrays.DataArray{Float64,1}:

This fails.

leg = @Transform(leg, gi_sq = :gi * :gi)
MethodError: * has no method matching *(::Array{Float64,1}, ::Array{Float64,1})

This works

leg = @Transform(leg, gi_sq = :gi + :gi)

@based_on not working?

It seems as if DataFrames.based_on has been depreciated? I'm very very new to Julia, so I might just be confused. The error message I get is based_on is not defined. Here is an example:

using RDatasets
iris = dataset("datasets", "iris")

@as _ begin
  iris
  groupby(_, :Species)
  @based_on(_, Mean_Petal_Length = mean(:Petal_Length))
end

new columns via transform and missing values

I think perhaps transform from DataFramesMeta has some problems when the new column contains missing. When I do an operation involving a column of type Union{<:Real,Missing} the new column will be of type Any rather than Union{<:Real,Missing}

julia> dd = DataFrame(a=[1,2,3],b=[1,2,missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b       │
├─────┼───┼─────────┤
│ 1   │ 1 │ 1       │
│ 2   │ 2 │ 2       │
│ 3   │ 3 │ missing │

julia> eltypes(dd)
2-element Array{Type,1}:
 Int64
 Union{Int64, Missings.Missing}

julia> dd=@transform(dd,aa=2*:a,bb=2*:b)
3×4 DataFrames.DataFrame
│ Row │ a │ b       │ aa │ bb      │
├─────┼───┼─────────┼────┼─────────┤
│ 1   │ 1 │ 1       │ 2  │ 2       │
│ 2   │ 2 │ 2       │ 4  │ 4       │
│ 3   │ 3 │ missing │ 6  │ missing │

julia> eltypes(dd)
4-element Array{Type,1}:
 Int64
 Union{Int64, Missings.Missing}
 Int64
 Any

dplyr n() equivalent

dplyr provides n() which gives the number of rows in a subgroup. In Julia, it would look something like this:

iris2 = @> begin
  iris
  @by( :Species, 
      count = n(),
      sepal_length = mean(:SepalLength))
end

In order to get what I want, I have to do this:

iris2 = @as _ begin
    iris
    @transform(count = 1)
    @by( :Species,
        count = sum(:count),
        sepal_length = mean(:SepalLength))
end

mutate vs. mutate-in-place

It seems like @byrow should be written @byrow!, because it seems like a mutate-in-place function. In fact, might be useful to have both a mutate and a mutate-in-place version of all the functions in DataFramesMeta.

juliadata / dataframesmeta.jl Goto Github PK

dataframesmeta.jl's People

Contributors

Stargazers

Watchers

Forkers

dataframesmeta.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org