juliaai / decisiontree.jl Goto Github PK

View Code? Open in Web Editor NEW

340.0 14.0 101.0 1.11 MB

Julia implementation of Decision Tree (CART) and Random Forest algorithms

License: Other

Julia 100.00%

julia machine-learning decision-tree random-forests classification regression

decisiontree.jl's People

Stargazers

Watchers

Forkers

panlanfeng meggart kmsquire hellcoderz aerlinger yfractal lendle yama1968 catethos nunb erichhuang corysimon nagyist peter1000 ssagaert andrewhannigan zhmz90 trouve-antoine arunreddy richarwu paulstey sivapvarma cstjean diegozea tkelman reallyasi9 adelarue aviks mimiao2017 andreasnoack valdart travispchen ivanveleslavov xiao891220 eight1911 brioglade juliaprog congwen lazarusa kalikhademi pharten yinpuli gragusa kharazity yifchao tedbakanas paulgrigas ppalmes amckay1 aaryaahmed56 stjordanis dcjones beatrizxu jrfiedler karanbudhraja-tgen baggepinnen juliatagbot mkualquiera aa25desh zierenberg khosravipasha jackdunnnz xrosliang shalinkpatel liaoyaonline giopaglia zhangshaojie1993 baajarmeh standardgalactic barucden bjeffrey92 amphancm xj260098061 ianbutterworth playfloor berenicealexiajocteur sanderbboisen billygareth kronosthelate roland-ka bensadeghi tecosaur kescobo yufongpeng hugemiler dhanak ri0016 anushrav01 rafaqz salbert83 mehrnooshzandi nikita6004 artemsolod nikola-sur enmaidao mossr popov1212 aldosiu

decisiontree.jl's Issues

Use multithreading instead of distributed for parallel training

With true multithreading becoming a reality, I'm wondering whether it may be beneficial to use an @threads loop instead of an @distributed loop in the training phase.

Some questions about `prune_tree`.

The prune_tree function for classification trees internally calls _prune_run which computes the purity using the zero-one loss. However, decision trees are built using the entropy purity. I'm not sure if this is done on purpose or if it's a bug.

The latter can be fixed easily, but we might also address the more general problem and make prune_tree criterion-agnostic by storing the purity of the node in a struct field (which is already a byproduct of tree building) and, instead of recomputing the node purity, have the function refer to that field. This will also make the same prune_tree function work on both regression and classification trees.

force/auto-convert to one-hot encoding for categorical features

The current implementation uses the lexicographical ordering to calculate splits of string features. But in practice, this is rarely intended since categorical features are by definition unordered (for example, it wouldn't make any sense that "Blue" < "Red" < "Yellow".) One hot encoding would decouple the categorical variable from any unintended ordering, and allow, as is not currently the case, regression on datasets with categorical features.

Too many build_tree()

The original DecisionTree.build_tree functions are now just wrappers for the new treeclassifier.build_tree and treeregressor.build_tree routines.
This naming convention makes it unnecessarily confusing and hard to follow.

May I suggest renaming the new functions to treeclassifier.build and treeregressor.build, or similiar ?

cc @Eight1911

Usage code broken: matrix/vector not defined

Hi Ben,

I'm not sure whether any dependant libraries made breaking changes, but it looks like the sample code no longer works. The matrix(...) and vector(...) functions both fail:

julia> features = matrix(iris[:, 2:5]);
ERROR: matrix not defined

julia> labels = vector(iris[:, "Species"]);
ERROR: vector not defined

This is happening from both the REPL and in Julia Studio. I've tried a fresh install of Julia (removing all settings/libraries in between installs), but I still get this unfortunately.

One can force the features and matrix vars to be the right types using the code below:

features = convert(Matrix, hcat(values(iris[:, 2:5])))
isa(features,Matrix) # should print true
labels = hcat(iris[:, "Species"])[1:end]
isa(labels,Vector) # should print true

... however build_tree(...) still ultimately fails:

julia> model = build_tree(labels, features)
ERROR: no method isless(DataArray{Float64,1}, DataArray{Float64,1})
 in sort! at sort.jl:245
 in sort! at sort.jl:277
 in sortperm at sort.jl:325
 in _split_info_gain at /Users/dhruvbhatia/.julia/DecisionTree/src/DecisionTree.jl:85
 in _split at /Users/dhruvbhatia/.julia/DecisionTree/src/DecisionTree.jl:52
 in build_tree at /Users/dhruvbhatia/.julia/DecisionTree/src/DecisionTree.jl:141 (repeats 2 times)

Version info below (I've tried both 0.2 and the 0.3 prerelease):

julia> versioninfo()
Julia Version 0.3.0-prerelease+1202
Commit d4b825a (2014-01-25 06:25 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.0.0)
  CPU: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

Adding a new field to the `Leaf` API to support sample weights

I'm thinking of adding support for sample weights, which isn't compatible with the current Leaf struct. More specifically, the field values in the struct lists every label that falls into the leaf with multiplicity, but does not give the weight of each label in the list. To add support for sample weights, I propose that we do either of the following:

Add a new field of the weights to the Leaf struct such that leaf.weight[i] gives the sum of the weights of the label in leaf.values[i]
Change leaf.values into a dictionary mapping each label to the sum of its weight, i.e., leaf.values[d] gives the weight of the label d.

VFDT's based on this package?

Hi, I'm interested in a Julia implementation of Domingo's VFDT's aka "Hoeffding Trees", see, for example:

http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/HoeffdingTree.html

This is a streaming algorithm for learning decision trees and might be very useful for modelling "big data" such as logs etc.

Are there any plans for implementing streaming algorithms within this package? If not do you think it is feasible on top of the infrastructure provided here, or would a clean/separate implementation/package be better?

Thanks for any input you might have.

Support for DataFrame based data and model formulas

Hello, thanks for writing this. I've benchmarked its use against the default randomForest implementation in R and have found it to be amazingly fast.

I was hoping to be able to use this library with DataFrames, including the Model Formula format api. I know that DataFrames currently doesn't support categorical data columns, but I think it is planned to be integrated.

I can try to help contribute to this, but it would be nice if this project was merged into the JuliaStats project first (I prefer to contribute to projects that are explicitly community owned).

Support Julia 1.0

Currently, import DecisionTree works on Julia 0.6 and Julia 0.7, but fails on Julia 1.0.

Can we add support for using DecisionTree.jl on Julia 1.0?

cc: @bensadeghi

Error on @everywhere using DecisionTree

I want to use DecisionTree across multiple processors.

Code:
addprocs(3)
@Everywhere using DecisionTree

works fine on julia 3.11 pkg version- 0.3.11
I get the following error in julia 4.5 pkg version- 0.4.2
ERROR: On worker 2:
LoadError: UndefVarError: BaseClassifier not defined

@JuliaRegistrator register v0.8.3

@JuliaRegistrator register

Avoid "regressor" terminology

I think the use of "regressor" for a regression tree is unfortunate since a regressor is a right hand side variable while what separates the regression tree from a classification tree is the left hand side variable. I tried searching for this and it seem like something Scikit-learn came up with. The terminology is e.g. not in Elements of Statistical Learning.

Multi-output (multi-target) problems

Hi, is it possible to do something like the multi-output from scikit-learn in Julia? Cause my input matrix has 3 features, and I have 12 different outputs per example.
I tried to train 12 separate trees, but this didn't work.

inconsistency in argument order between regression and classification's build_tree

regression's build_tree has the following signature

function build_tree(
        labels             :: Vector{T},
        features           :: Matrix{S},
        min_samples_leaf    = 5,
        n_subfeatures       = 0,
        max_depth           = -1,
        min_samples_split   = 2,
        min_purity_increase = 0.0;
        rng                 = Random.GLOBAL_RNG) where {S, T <: Float64}

while classification's build_tree has the following signature

function build_tree(
        labels              :: Vector{T},
        features            :: Matrix{S},
        n_subfeatures        = 0,
        max_depth            = -1,
        min_samples_leaf     = 1,
        min_samples_split    = 2,
        min_purity_increase  = 0.0;
        rng                  = Random.GLOBAL_RNG) where {S, T}

The third argument, min_samples_leaf in regression should be the fifth argument for argument consistency. This should be changed since it may cause silent bugs for users.

weight values for the features.

I want to assign weight values for the features.
that the split function should use the features with larger weight first.

For some of the features are more important than the others, and I want to control the split process.

Calculation of `new_coeff` in adaboost

I'm curious about the calculation of new_coeff in build_adaboost_stumps().

Just about every book and article I've found lists the formula as new_coeff = 0.5 * log((1 - err)/err). I was just curious why the function uses new_coeff = 0.5 * log((1 + err)/(1 - err)). I think this subtle difference might actually make quite an impact in accuracy.

I would also note that in the book Boosting by Schapire and Freund they point out that for each boosting round err ought to be approximately 0.5. And the current method does not behave that way; instead err tends towards 1.0, and then becomes NaN for all rounds afterwards.

Is there a good citation you can recommend for the current approach? Or, is this a possible oversight? I might be missing something.

Thanks in advance.

-Paul

P.S.
Thanks for creating this package!

Random forest

My reading of the code for build_tree is that you're taking a 70% subset of the full data rather than a bootstrap sample. This should change the properties of the learner. Can the bootstrap version be implemented?

Facing issues with build_adaboost_stumps

I have used the adaboost function multiple times earlier, but now it throws the following error.

ERROR: DomainError
in build_adaboost_stumps at C:\Users\Dinkar.julia\v0.3\DecisionTree\src\Decisi
onTree.jl:281

I tried going step by step and found that the function on line 280 of the DecisionTree.jl/src/DecisionTree.jl (reproduced below) was not working.

err = _weighted_error(labels, predictions, weights)

It would be great if you could guide me on how to proceed.

Thank you
Dinker

nfoldCV_tree() generates error "no method matching round()" in 0.5.0; works in 0.4.3

In:

versioninfo()
Julia Version 0.5.0-dev+3123
Commit 01dd5ec* (2016-03-12 05:08 UTC)
Platform Info:
  System: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

code:

# example decision tree
using DataFrames
using DecisionTree
using RDatasets
data = DataFrame(A=[1.0,2,3,2,4,3,5,2,3,1],
                 B=[6.0,4,3,6,5,4,2,7,4,5],
                 C=[2,3,4,5,2,3,2,1,3,4])
#data = dataset("datasets","iris")
X = Array(data[:,1:end-1])
y = Array(data[end])
model = build_tree(y,X)
accuracy = nfoldCV_tree(y, X, 0.9, 3)

produces error:

ERROR: LoadError: MethodError: no method matching round(::TypeVar, ::Float64)
Closest candidates are:
  round{T<:Integer}(::Type{T<:Integer}, ::AbstractFloat)
  round{T<:Integer}(::Type{T<:Integer}, ::AbstractFloat, ::RoundingMode{T})
 [inlined code] from /home/colin/.julia/v0.5/DecisionTree/src/DecisionTree.jl:16
 in _nfoldCV(::Symbol, ::Array{Int64,1}, ::Array{Float64,2}, ::Float64, ::Vararg{Any}) at /home/colin/.julia/v0.5/DecisionTree/src/measures.jl:152
 in nfoldCV_tree(::Array{Int64,1}, ::Array{Float64,2}, ::Float64, ::Int64) at /home/colin/.julia/v0.5/DecisionTree/src/measures.jl:185
 in include(::ASCIIString) at ./boot.jl:264
 in include_from_node1(::ASCIIString) at ./loading.jl:417
 in eval(::Module, ::Any) at ./boot.jl:267
while loading /home/colin/kaggle/tester/dectree.jl, in expression starting on line 12

Code is able to print model correctly, tried with sample data and iris data, same result; 0.4.3 Julia completes the nfoldCV correctly from same code.

Squeeze type piracy

The squeeze definition might break some other code/package in a very hard-to-track way:

squeeze([1], 1)
> 0-dimensional Array{Int64,0}:
> 1

using DecisionTree
squeeze([1], 1)
> 1-element Array{Int64,1}:
>  1

There's several uses of squeeze in DecisionTree, so we'll have to be careful when fixing this.

better `Leaf` and `Ensemble` structs

I've implemented GradientBoost and a more efficient version of AdaBoost that interface with treeregressor and treeclassifier and am now looking to port them to the light-weight front facing structs we use, i.e., LeafOrNode and Ensemble.

Trying to do so, I realized that one thing I feel somewhat against is the storage of every labels in the Leaf struct. So I want to discuss if it would be possible to change it so something more lightweight. These values are usually unnecessary, take up too much space, and can be recomputed if needed. For a lot of purposes, a single confidence parameter should suffice (e.g., the impurity of that leaf) And if that is not enough we might add, for example, another function that takes in a set of samples and returns an array containing the indices of the leaves they end up in. This would generalize the current implementation to any dataset and not just the training set, and from that we can very easily rewrite apply_tree_proba.

This may feel like an inconvenience, but for boosting and random forest ensembles that usually use shallow trees, most of the memory cost is in this storage of the samples, and the real inconvenience is dealing with models that are 4GB large or not being able to train your model because you ran out of memory.

One other thing is that the current Ensemble struct doesn't add very much to the current implementation because it's just a list of trees. This means that we have to deal with something like AdaBoost returning both an Ensemble and a list of coefficients that should have been a part of that ensemble in the first place. So I'm also proposing that we encode these coefficients into the model.

tldr; I'm proposing that we use the following structs instead

struct Leaf{S, T}
    label    :: T
    impurity :: Float64
end

struct Ensemble{S, T}
    trees  :: Vector{Node{S, T}}
    coeffs :: Union{Nothing, Vector{Float64}}
    method :: String
end

Tests no longer detectable

http://iainnz.github.io/packages.julialang.org/

I'm no longer able to detect your tests. The official convention is to have a master test file test/runtests.jl that runs them all, but I can detect others too - see https://github.com/IainNZ/PackageEvaluator.jl/ readme for more info

Real type features for Regression

Good work with DecisionTree.jl!

Is there any chance of having Real type features supported for regression?
As I understand, it only works with FloatingPoint at the moment.

"build_forest" does not accept categorical data?

My data can be described as follows:
features - Array{Any,2} - two columns of strings, and 3 columns containing numerical values
labels - Array{String,2} - array of strings

Now, when I use:

model = build_forest(labels, features, 2, 10, 0.5, 6)

I get:

MethodError: no method matching build_forest(::Array{String,2}, ::Array{Any,2}, ::Int64, ::Int64, ::Float64, ::Int64)

I get similar errors if I convert "features" and "labels" to categorical:
model = build_forest(categorical(labels), categorical(features), 2, 10, 0.5, 6)

It works when I use only the 3 numerical columns in the features array as features.

How do you build models using features that contain non-numeric data?
Am I using the "categorical" data incorrectly in the "build_forest" command?

For reference, the packages I am using in my code are:

using DecisionTree
using DataFrames
using CategoricalArrays

Why the type UniqueRanges?

As far as I know, the UniqueRanges type is used in the _split_info_gain function to iterate over the values of a particular feature without repetition. I am wondering if Base.Set is enough to do the job?

Problem with adaboost

For some reason, boosting doesn't seem to work. I don't think the issue here is the same as #42. I tried the example from Elements of Statistical Learning and compared to fastAdaboost in R

julia> using Distributions, DecisionTree, RCall, DataFrames

julia> # Boosting example from EoSL
       X = randn(1000, 10);

julia> y = Vector{Int64}(vec(sum(abs2, X, 2) .> quantile(Chisq(10), 0.5)));

julia> # Use DecisionTree
       ada1 = DecisionTree.build_adaboost_stumps(y, X, 5);

julia> mean(apply_adaboost_stumps(ada1..., X) .== y)
0.579

julia> # Use fastAdaboost
       R"library(fastAdaboost)";

julia> df = DataFrame(X);

julia> df[:y] = y;

julia> ada2 = R"adaboost(y ~ x1 + x2 + x3 + x4 + x5 + x6 +x7 + x8 + x9 + x10, data = $df, 5)";

julia> rcopy(R"predict($ada2, newdata = $df)$error")
0.021

Furthermore, the build_adaboost_stumpss is much slower than adaboost from fastAdaboost. It looks like build_adaboost_stumps might not use the same optimizations as build_tree.

categorical features handled "correctly"?

Does this package "correctly" handle categorical variables (e.g. without conversion to numerical encoding schemes like one-hot or ordinal encoding), as that ability is a distinct advantage of decision trees and their progeny? Issues #61 and #13 are related but it is not clear to me what the current status is. Perhaps if they are supported, I could make a documentation PR for a brief mention on the README.

If so, it would be a good reason for some users to switch from scikit-learn's RF implementation, which still requires numerical encoding.

TODO: Out of bag error

It would be awesome if we could compute the out of bag error in the process of training the random forest. We can naturally do a sort of cross validation in random forests by testing on the data that we leave out in the process of training each tree through bagging.

Code comments

Are you interested in having docstring code comments? I'm asking because I'm currently adding comments to my fork to understand the code better, I'd be happy to put in a PR once I've covered more of the code.

Does not handle Adjoint nor Transposed features

in https://white.ucc.asn.au/2017/01/24/JuliaML-and-TensorFlow-Tuitorial.html
rows and columns are swapped, as columns are labeled samples instead of rows

ERROR: LoadError: MethodError: no method matching build_tree(::Array{Bool,1}, ::LinearAlgebra.Adjoint{Int64,Array{Int64,2}}, ...

using DecisionTree
import ScikitLearnBase: fit!

features = [1 2;4 5;3 6]
labels = [true,false]
    
println("As array");fit!(DecisionTreeClassifier(), convert(Array{Float64,2},features'), labels)
println("Adjoint");fit!(DecisionTreeClassifier(), features', labels)

Supporting regression trees

Hi,

I just wanted to ask if you intend to support also regression trees for continuous variables at some point in the future. I know the package is called DecisionTree but almost everything is there to do regression trees, too.

For my work I just replaced the majority_vote and info_gain functions with a mean and RMSE and it worked perfectly well.
However, to support both tree types in one package, there would have to be some changes in user interface...

Fabian

Missing data

I've been trying out this library as I jump into learning Julia and I'm wondering what support there is for missing values in the dataset. Any recommendations that you have based on your experience for how to deal with these missing values would be very helpful.

julia> model = build_forest(labels, features, 3, 10)

exception on 1: ERROR: no method convert(Type{Bool}, NAtype)
in setindex! at array.jl:298
in bitcache_lt at broadcast.jl:366
in .< at broadcast.jl:382
in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:153
in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:171
in anonymous at no file:237
in anonymous at multi.jl:1263
in run_work_thunk at multi.jl:613
in run_work_thunk at multi.jl:622
in anonymous at task.jl:6
ERROR: no method convert(Type{Node}, MethodError)
in copy! at abstractarray.jl:149
in convert at array.jl:209
in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:239
in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:232

Simplify DataFrame => Matrix?

I'm really glad to have this package. Thanks for creating it!

The API is still in flux, but I believe the following should now work if you're interested in shrinking your README file:

iris = data("datasets", "iris")
features = matrix(iris[:, 2:5])
labels = iris[:, "Species"]

We're also debating how to handle providing formula interfaces for models like decision trees so that you could just write:

build_tree(Species ~ Petal_Length + Petal_Width, iris)

If you have requests there, please let me know.

Deprecated Dict comprehension warning

Using Julia6 (but not Julia4) on using DecisionTree a warning appears (but code seems to run fine anyway):

WARNING: deprecated syntax "[a=>b for (a,b) in c]".
Use "Dict(a=>b for (a,b) in c)" instead.

This is the entire output, no indication of line numbers or affected files. Using:

Julia Version 0.6.0-dev.233
Commit 62615c3* (2016-08-16 05:34 UTC)
Platform Info:
  System: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

Problems saving random forest model

Hi all,

I'm having a strange problem when saving a random forest model. When using the JLD module to save the model created by the DecisionTree module, it usually takes a huge amount of space on disk. For instance, a model that had a size of 155 Mb as a variable in julia, would take more than 2 Gb to save in disk and an eternity to save the model!

I solved this problem by using the serialize and deserialize commands. After this, I was able to save the models. However, the drawback of this method is that maybe it won't be possible to read the models in
other Julia versions. I didn't know if this is an issue from the DecisionTree module or the JLD module but would like to know if you have another workaround.

Cheers,

Stratified sampling

Hello, I wasnt quite sure where to write this - it isnt really an issue with your code. Is there a possibility of stratified sampling with this package or merging 2 forests (or trees) together? If not, do you plan to add these features?

If I have more questions, where should I write them? I'm new to github.

Parallel Random Forest

I noticed that the random forest classifier is intended to build trees in parallel. However, we must manually add julia processes by either invoking the -p option or use the addproc function. I am wondering if the classifier can automatically add processes. perhaps by adding an option indicating how many processes the user wants to use, which default value is 1.

Caleb

Saving RandomForest Ensembles

In a production setting, I want to train a model with a fairly large dataset and then run classifications on a per line basis using a persistent model. Can I store the ensemble in a file (or redis) and then load it at runtime?

I've tried writing the resulting model to a file but the following approach:

outfile = open("model.txt", "w")
write(model,outfile)

fails with the following error:

ERROR: write has no method matching write(::Ensemble, ::IOStream)

Any thoughts?

Thanks,

Marc

Heterogenous features in regression

The package promises "support for mixed nominal and numerical data" but this is currently only true for classification and not regression.

Input checking

This is scary:

tree = fit!(DecisionTreeRegressor(), [1.0 2; 3 4], [10, 24.0])
predict(tree, [])
> 17.0

apply_tree also accepts (and ignores) extra values in the feature_vector without complaining.

varimportance plot

is there a function or is there a way to print the importance of the variables base on purity or gini index similar to breiman's version in R?

ERROR: UndefVarError: last_f not defined (Julia v0.7.0-alpha)

There's a strange error when running build_tree in Julia v0.7.0-alpha (for both classification and regression), error traces below. What's peculiar is that the routines run fine in Julia v0.6.x.

For classification, I've tried a hack of adding last_f = Xf[1] before the while hi < n_samples loop (line 130 of tree.jl).
But with this change, I'm seeing some serious tree build performance degradation, both in execution time and memory allocation, on bleeding edge Julia. This performance issue might be due to something unrelated, but finding its root cause would be helpful for the Julia core team.

cc @Eight1911

Classification:

ERROR: UndefVarError: last_f not defined
Stacktrace:
 [1] _split!(::Array{Float64,2}, ::Array{Int64,1}, ::DecisionTree.treeclassifier.NodeMeta, ::Int64, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Float64,1}, ::Array{Int64,1}, ::Random.MersenneTwister) at /home/bs/.julia/dev/DecisionTree/src/classification/tree.jl:131
 [2] #build_tree#1(::Any, ::Any, ::Array{Float64,2}, ::Array{Int64,1}, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64) at /home/bs/.julia/dev/DecisionTree/src/classification/tree.jl:322
 [3] #build_tree at ./none:0 [inlined]
 [4] #build_tree#6 at /home/bs/.julia/dev/DecisionTree/src/classification/main.jl:86 [inlined]
 [5] build_tree at /home/bs/.julia/dev/DecisionTree/src/classification/main.jl:73 [inlined] (repeats 2 times)
 [6] top-level scope at none:0

Regression:

ERROR: UndefVarError: last_f not defined
Stacktrace:
 [1] _split!(::Array{Float64,2}, ::Array{Float64,1}, ::DecisionTree.treeregressor.NodeMeta, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Int64,1}, ::Array{Float64,1}, ::Array{Float64,1}, ::Random.MersenneTwister) at /home/bs/.julia/dev/DecisionTree/src/regression/tree.jl:139
 [2] #build_tree#1(::Random.MersenneTwister, ::Any, ::Array{Float64,2}, ::Array{Float64,1}, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64) at /home/bs/.julia/dev/DecisionTree/src/regression/tree.jl:289
 [3] #build_tree at ./none:0 [inlined]
 [4] #build_tree#24 at /home/bs/.julia/dev/DecisionTree/src/regression/main.jl:126 [inlined]
 [5] build_tree at /home/bs/.julia/dev/DecisionTree/src/regression/main.jl:110 [inlined] (repeats 2 times)
 [6] top-level scope at none:0

Convert model to TensorFlow

As we have scaling resources e.g. available from Firebase to use TensorFlow models, it would be nice to find out how to convert a DecisionTree model into a TensorFlow model.

I see that a DecisionTree model can be saved as a JLD2, which should be readable in a HDF5 reader. Am I missing just the last step, from HDF5 to MetaGraph (or .GraphDef);

Or maybe I am missing a completely different way to achieve this?

Thanks

EDIT: I had asked this on the julia discourse but I guess it actually belongs here

minimum size of each leaf

Hi, is there a way to limit size of a leaf to prevent overfitting? Something similar to minbucket in R?

Thanks a lot for sharing your code!

Feature request: Interactive visualization

With large decision trees, it is very helpful if one can interactively explore the tree (interactively expand/collapse nodes). For example: http://blog.revolutionanalytics.com/2016/12/interactive-decision-trees-with-microsoft-r.html
Perhaps it is worth adding support for this, for example through D3Trees.jl. It probably makes sense as a separate package but I wanted to throw the idea out there.

Turn on precompile

Now that Julia 0.3 isn't supported, is there any reason not to precompile DecisionTree.jl? I just tried it on my side, and everything looked fine.

DecisionTree v0.8.2 is not registered in the General registry

If you look at the list of DecisionTree releases, you'll see that the most recent release was v0.8.2, released on May 10, 2019.

However, if you look at the General registry, the most recent release registered there was v0.8.1.

Could you register DecisionTree v0.8.2 in the General registry?

Speed comparison with R

Hello, I tried these two codes in Julia and R respectively:

Ytrain=rand(2000,1)
Xtrain=rand(2000,60)

addprocs(3)
using DecisionTree

@time model = build_forest(Ytrain[:,1],Xtrain,20,200,5,1)

library(randomForest)

X=replicate(60, runif(2000))
Y=runif(2000)

ptm <- proc.time()
rf=randomForest(X, Y, importance = FALSE, ntree=200,do.trace=0,nodesize=5)
proc.time() - ptm

In Julia it takes around 30 seconds and in R it takes around 10. I think that configuration of both "build_forest" and "randomForest" are the same (as rF takes one third of the variables for each node which is exactly 20). And as far as I know, rF in R cant use paralellization (at least this library cant) and Julia should be way faster than R.

So, what might be causing the difference in speed?

Intermittent test failure

In test/classification_rand.jl, this code failed once on Travis because depth(model) was 1. We should probably change the == to <=? Or we could srand to ensure that we always get exactly the same results.

using Base.Test
using DecisionTree

n,m = 10^3, 5 ;
features = rand(n,m);
weights = rand(-1:1,m);
labels = _int(features * weights);

maxdepth = 3
model = build_tree(labels, features, 0, maxdepth)
@test depth(model) == maxdepth

large memory footprint and slow performance on large datasets

Due to the fact that the data matrix is being copied at every recursive call, the current implementation has a memory footprint that is unnecessarily large. This is a structural issue and may well require a complete rewrite of the built_tree function. For example, on my MBP, the @time macro prints the following when used on the build_tree function to split on the MNIST training dataset:

  2.866582 seconds (3.31 M allocations: 820.371 MiB, 3.99% gc time).

This above time does not include the JIT compile time. If the maintainer(s) gives me the green light, I will be making a couple of PR's in the upcoming weeks to fix this issue and hopefully up the performance of the current implementation. In particular, I will be making a small port of the sklearn trees here compatible with the current API. The library linked has the following footprint

  0.496638 seconds (21.32 k allocations: 802.953 KiB, 1.00% gc time)

which is approximately a thousand times smaller with 5-6 times the performance and produces the same tree with the same number of nodes when asked to evaluate every feature for each split.

juliaai / decisiontree.jl Goto Github PK

decisiontree.jl's People

Stargazers

Watchers

Forkers

decisiontree.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org