juliaai / decisiontree.jl Goto Github PK
View Code? Open in Web Editor NEWJulia implementation of Decision Tree (CART) and Random Forest algorithms
License: Other
Julia implementation of Decision Tree (CART) and Random Forest algorithms
License: Other
With true multithreading becoming a reality, I'm wondering whether it may be beneficial to use an @threads
loop instead of an @distributed
loop in the training phase.
The prune_tree
function for classification trees internally calls _prune_run
which computes the purity using the zero-one loss. However, decision trees are built using the entropy purity. I'm not sure if this is done on purpose or if it's a bug.
The latter can be fixed easily, but we might also address the more general problem and make prune_tree
criterion-agnostic by storing the purity of the node in a struct field (which is already a byproduct of tree building) and, instead of recomputing the node purity, have the function refer to that field. This will also make the same prune_tree
function work on both regression and classification trees.
The current implementation uses the lexicographical ordering to calculate splits of string features. But in practice, this is rarely intended since categorical features are by definition unordered (for example, it wouldn't make any sense that "Blue" < "Red" < "Yellow".) One hot encoding would decouple the categorical variable from any unintended ordering, and allow, as is not currently the case, regression on datasets with categorical features.
The original DecisionTree.build_tree
functions are now just wrappers for the new treeclassifier.build_tree
and treeregressor.build_tree
routines.
This naming convention makes it unnecessarily confusing and hard to follow.
May I suggest renaming the new functions to treeclassifier.build
and treeregressor.build
, or similiar ?
cc @Eight1911
Hi Ben,
I'm not sure whether any dependant libraries made breaking changes, but it looks like the sample code no longer works. The matrix(...)
and vector(...)
functions both fail:
julia> features = matrix(iris[:, 2:5]);
ERROR: matrix not defined
julia> labels = vector(iris[:, "Species"]);
ERROR: vector not defined
This is happening from both the REPL and in Julia Studio. I've tried a fresh install of Julia (removing all settings/libraries in between installs), but I still get this unfortunately.
One can force the features and matrix vars to be the right types using the code below:
features = convert(Matrix, hcat(values(iris[:, 2:5])))
isa(features,Matrix) # should print true
labels = hcat(iris[:, "Species"])[1:end]
isa(labels,Vector) # should print true
... however build_tree(...)
still ultimately fails:
julia> model = build_tree(labels, features)
ERROR: no method isless(DataArray{Float64,1}, DataArray{Float64,1})
in sort! at sort.jl:245
in sort! at sort.jl:277
in sortperm at sort.jl:325
in _split_info_gain at /Users/dhruvbhatia/.julia/DecisionTree/src/DecisionTree.jl:85
in _split at /Users/dhruvbhatia/.julia/DecisionTree/src/DecisionTree.jl:52
in build_tree at /Users/dhruvbhatia/.julia/DecisionTree/src/DecisionTree.jl:141 (repeats 2 times)
Version info below (I've tried both 0.2 and the 0.3 prerelease):
julia> versioninfo()
Julia Version 0.3.0-prerelease+1202
Commit d4b825a (2014-01-25 06:25 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.0.0)
CPU: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
I'm thinking of adding support for sample weights, which isn't compatible with the current Leaf
struct. More specifically, the field values
in the struct lists every label that falls into the leaf with multiplicity, but does not give the weight of each label in the list. To add support for sample weights, I propose that we do either of the following:
weights
to the Leaf
struct such that leaf.weight[i]
gives the sum of the weights of the label in leaf.values[i]
leaf.values
into a dictionary mapping each label to the sum of its weight, i.e., leaf.values[d]
gives the weight of the label d
.Hi, I'm interested in a Julia implementation of Domingo's VFDT's aka "Hoeffding Trees", see, for example:
http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/HoeffdingTree.html
This is a streaming algorithm for learning decision trees and might be very useful for modelling "big data" such as logs etc.
Are there any plans for implementing streaming algorithms within this package? If not do you think it is feasible on top of the infrastructure provided here, or would a clean/separate implementation/package be better?
Thanks for any input you might have.
Hello, thanks for writing this. I've benchmarked its use against the default randomForest implementation in R and have found it to be amazingly fast.
I was hoping to be able to use this library with DataFrames, including the Model Formula format api. I know that DataFrames currently doesn't support categorical data columns, but I think it is planned to be integrated.
I can try to help contribute to this, but it would be nice if this project was merged into the JuliaStats project first (I prefer to contribute to projects that are explicitly community owned).
Currently, import DecisionTree
works on Julia 0.6 and Julia 0.7, but fails on Julia 1.0.
Can we add support for using DecisionTree.jl on Julia 1.0?
cc: @bensadeghi
I want to use DecisionTree across multiple processors.
Code:
addprocs(3)
@Everywhere using DecisionTree
works fine on julia 3.11 pkg version- 0.3.11
I get the following error in julia 4.5 pkg version- 0.4.2
ERROR: On worker 2:
LoadError: UndefVarError: BaseClassifier not defined
@JuliaRegistrator register
I think the use of "regressor" for a regression tree is unfortunate since a regressor is a right hand side variable while what separates the regression tree from a classification tree is the left hand side variable. I tried searching for this and it seem like something Scikit-learn came up with. The terminology is e.g. not in Elements of Statistical Learning.
Hi, is it possible to do something like the multi-output from scikit-learn in Julia? Cause my input matrix has 3 features, and I have 12 different outputs per example.
I tried to train 12 separate trees, but this didn't work.
regression's build_tree
has the following signature
function build_tree(
labels :: Vector{T},
features :: Matrix{S},
min_samples_leaf = 5,
n_subfeatures = 0,
max_depth = -1,
min_samples_split = 2,
min_purity_increase = 0.0;
rng = Random.GLOBAL_RNG) where {S, T <: Float64}
while classification's build_tree
has the following signature
function build_tree(
labels :: Vector{T},
features :: Matrix{S},
n_subfeatures = 0,
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0;
rng = Random.GLOBAL_RNG) where {S, T}
The third argument, min_samples_leaf
in regression should be the fifth argument for argument consistency. This should be changed since it may cause silent bugs for users.
I want to assign weight values for the features.
that the split function should use the features with larger weight first.
For some of the features are more important than the others, and I want to control the split process.
I'm curious about the calculation of new_coeff
in build_adaboost_stumps()
.
Just about every book and article I've found lists the formula as new_coeff = 0.5 * log((1 - err)/err)
. I was just curious why the function uses new_coeff = 0.5 * log((1 + err)/(1 - err))
. I think this subtle difference might actually make quite an impact in accuracy.
I would also note that in the book Boosting by Schapire and Freund they point out that for each boosting round err
ought to be approximately 0.5. And the current method does not behave that way; instead err
tends towards 1.0, and then becomes NaN
for all rounds afterwards.
Is there a good citation you can recommend for the current approach? Or, is this a possible oversight? I might be missing something.
Thanks in advance.
-Paul
P.S.
Thanks for creating this package!
My reading of the code for build_tree is that you're taking a 70% subset of the full data rather than a bootstrap sample. This should change the properties of the learner. Can the bootstrap version be implemented?
I have used the adaboost function multiple times earlier, but now it throws the following error.
ERROR: DomainError
in build_adaboost_stumps at C:\Users\Dinkar.julia\v0.3\DecisionTree\src\Decisi
onTree.jl:281
I tried going step by step and found that the function on line 280 of the DecisionTree.jl/src/DecisionTree.jl (reproduced below) was not working.
err = _weighted_error(labels, predictions, weights)
It would be great if you could guide me on how to proceed.
Thank you
Dinker
In:
versioninfo()
Julia Version 0.5.0-dev+3123
Commit 01dd5ec* (2016-03-12 05:08 UTC)
Platform Info:
System: Linux (x86_64-suse-linux)
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
code:
# example decision tree
using DataFrames
using DecisionTree
using RDatasets
data = DataFrame(A=[1.0,2,3,2,4,3,5,2,3,1],
B=[6.0,4,3,6,5,4,2,7,4,5],
C=[2,3,4,5,2,3,2,1,3,4])
#data = dataset("datasets","iris")
X = Array(data[:,1:end-1])
y = Array(data[end])
model = build_tree(y,X)
accuracy = nfoldCV_tree(y, X, 0.9, 3)
produces error:
ERROR: LoadError: MethodError: no method matching round(::TypeVar, ::Float64)
Closest candidates are:
round{T<:Integer}(::Type{T<:Integer}, ::AbstractFloat)
round{T<:Integer}(::Type{T<:Integer}, ::AbstractFloat, ::RoundingMode{T})
[inlined code] from /home/colin/.julia/v0.5/DecisionTree/src/DecisionTree.jl:16
in _nfoldCV(::Symbol, ::Array{Int64,1}, ::Array{Float64,2}, ::Float64, ::Vararg{Any}) at /home/colin/.julia/v0.5/DecisionTree/src/measures.jl:152
in nfoldCV_tree(::Array{Int64,1}, ::Array{Float64,2}, ::Float64, ::Int64) at /home/colin/.julia/v0.5/DecisionTree/src/measures.jl:185
in include(::ASCIIString) at ./boot.jl:264
in include_from_node1(::ASCIIString) at ./loading.jl:417
in eval(::Module, ::Any) at ./boot.jl:267
while loading /home/colin/kaggle/tester/dectree.jl, in expression starting on line 12
Code is able to print model correctly, tried with sample data and iris data, same result; 0.4.3 Julia completes the nfoldCV correctly from same code.
The squeeze definition might break some other code/package in a very hard-to-track way:
squeeze([1], 1)
> 0-dimensional Array{Int64,0}:
> 1
using DecisionTree
squeeze([1], 1)
> 1-element Array{Int64,1}:
> 1
There's several uses of squeeze
in DecisionTree, so we'll have to be careful when fixing this.
I've implemented GradientBoost and a more efficient version of AdaBoost that interface with treeregressor
and treeclassifier
and am now looking to port them to the light-weight front facing structs we use, i.e., LeafOrNode
and Ensemble
.
Trying to do so, I realized that one thing I feel somewhat against is the storage of every labels in the Leaf
struct. So I want to discuss if it would be possible to change it so something more lightweight. These values are usually unnecessary, take up too much space, and can be recomputed if needed. For a lot of purposes, a single confidence parameter should suffice (e.g., the impurity of that leaf) And if that is not enough we might add, for example, another function that takes in a set of samples and returns an array containing the indices of the leaves they end up in. This would generalize the current implementation to any dataset and not just the training set, and from that we can very easily rewrite apply_tree_proba
.
This may feel like an inconvenience, but for boosting and random forest ensembles that usually use shallow trees, most of the memory cost is in this storage of the samples, and the real inconvenience is dealing with models that are 4GB large or not being able to train your model because you ran out of memory.
One other thing is that the current Ensemble
struct doesn't add very much to the current implementation because it's just a list of trees. This means that we have to deal with something like AdaBoost returning both an Ensemble
and a list of coefficients that should have been a part of that ensemble in the first place. So I'm also proposing that we encode these coefficients into the model.
tldr; I'm proposing that we use the following structs instead
struct Leaf{S, T}
label :: T
impurity :: Float64
end
struct Ensemble{S, T}
trees :: Vector{Node{S, T}}
coeffs :: Union{Nothing, Vector{Float64}}
method :: String
end
http://iainnz.github.io/packages.julialang.org/
I'm no longer able to detect your tests. The official convention is to have a master test file test/runtests.jl that runs them all, but I can detect others too - see https://github.com/IainNZ/PackageEvaluator.jl/ readme for more info
Good work with DecisionTree.jl!
Is there any chance of having Real type features supported for regression?
As I understand, it only works with FloatingPoint at the moment.
My data can be described as follows:
features - Array{Any,2} - two columns of strings, and 3 columns containing numerical values
labels - Array{String,2} - array of strings
Now, when I use:
model = build_forest(labels, features, 2, 10, 0.5, 6)
I get:
MethodError: no method matching build_forest(::Array{String,2}, ::Array{Any,2}, ::Int64, ::Int64, ::Float64, ::Int64)
I get similar errors if I convert "features" and "labels" to categorical:
model = build_forest(categorical(labels), categorical(features), 2, 10, 0.5, 6)
It works when I use only the 3 numerical columns in the features array as features.
For reference, the packages I am using in my code are:
using DecisionTree
using DataFrames
using CategoricalArrays
As far as I know, the UniqueRanges
type is used in the _split_info_gain
function to iterate over the values of a particular feature without repetition. I am wondering if Base.Set
is enough to do the job?
For some reason, boosting doesn't seem to work. I don't think the issue here is the same as #42. I tried the example from Elements of Statistical Learning and compared to fastAdaboost
in R
julia> using Distributions, DecisionTree, RCall, DataFrames
julia> # Boosting example from EoSL
X = randn(1000, 10);
julia> y = Vector{Int64}(vec(sum(abs2, X, 2) .> quantile(Chisq(10), 0.5)));
julia> # Use DecisionTree
ada1 = DecisionTree.build_adaboost_stumps(y, X, 5);
julia> mean(apply_adaboost_stumps(ada1..., X) .== y)
0.579
julia> # Use fastAdaboost
R"library(fastAdaboost)";
julia> df = DataFrame(X);
julia> df[:y] = y;
julia> ada2 = R"adaboost(y ~ x1 + x2 + x3 + x4 + x5 + x6 +x7 + x8 + x9 + x10, data = $df, 5)";
julia> rcopy(R"predict($ada2, newdata = $df)$error")
0.021
Furthermore, the build_adaboost_stumps
s is much slower than adaboost
from fastAdaboost
. It looks like build_adaboost_stumps
might not use the same optimizations as build_tree
.
Does this package "correctly" handle categorical variables (e.g. without conversion to numerical encoding schemes like one-hot or ordinal encoding), as that ability is a distinct advantage of decision trees and their progeny? Issues #61 and #13 are related but it is not clear to me what the current status is. Perhaps if they are supported, I could make a documentation PR for a brief mention on the README.
If so, it would be a good reason for some users to switch from scikit-learn's RF implementation, which still requires numerical encoding.
It would be awesome if we could compute the out of bag error in the process of training the random forest. We can naturally do a sort of cross validation in random forests by testing on the data that we leave out in the process of training each tree through bagging.
Are you interested in having docstring code comments? I'm asking because I'm currently adding comments to my fork to understand the code better, I'd be happy to put in a PR once I've covered more of the code.
in https://white.ucc.asn.au/2017/01/24/JuliaML-and-TensorFlow-Tuitorial.html
rows and columns are swapped, as columns are labeled samples instead of rows
ERROR: LoadError: MethodError: no method matching build_tree(::Array{Bool,1}, ::LinearAlgebra.Adjoint{Int64,Array{Int64,2}}, ...
using DecisionTree
import ScikitLearnBase: fit!
features = [1 2;4 5;3 6]
labels = [true,false]
println("As array");fit!(DecisionTreeClassifier(), convert(Array{Float64,2},features'), labels)
println("Adjoint");fit!(DecisionTreeClassifier(), features', labels)
Hi,
I just wanted to ask if you intend to support also regression trees for continuous variables at some point in the future. I know the package is called DecisionTree but almost everything is there to do regression trees, too.
For my work I just replaced the majority_vote and info_gain functions with a mean and RMSE and it worked perfectly well.
However, to support both tree types in one package, there would have to be some changes in user interface...
Fabian
I've been trying out this library as I jump into learning Julia and I'm wondering what support there is for missing values in the dataset. Any recommendations that you have based on your experience for how to deal with these missing values would be very helpful.
julia> model = build_forest(labels, features, 3, 10)
exception on 1: ERROR: no method convert(Type{Bool}, NAtype)
in setindex! at array.jl:298
in bitcache_lt at broadcast.jl:366
in .< at broadcast.jl:382
in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:153
in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:171
in anonymous at no file:237
in anonymous at multi.jl:1263
in run_work_thunk at multi.jl:613
in run_work_thunk at multi.jl:622
in anonymous at task.jl:6
ERROR: no method convert(Type{Node}, MethodError)
in copy! at abstractarray.jl:149
in convert at array.jl:209
in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:239
in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:232
I'm really glad to have this package. Thanks for creating it!
The API is still in flux, but I believe the following should now work if you're interested in shrinking your README file:
iris = data("datasets", "iris")
features = matrix(iris[:, 2:5])
labels = iris[:, "Species"]
We're also debating how to handle providing formula interfaces for models like decision trees so that you could just write:
build_tree(Species ~ Petal_Length + Petal_Width, iris)
If you have requests there, please let me know.
Using Julia6 (but not Julia4) on using DecisionTree
a warning appears (but code seems to run fine anyway):
WARNING: deprecated syntax "[a=>b for (a,b) in c]".
Use "Dict(a=>b for (a,b) in c)" instead.
This is the entire output, no indication of line numbers or affected files. Using:
Julia Version 0.6.0-dev.233
Commit 62615c3* (2016-08-16 05:34 UTC)
Platform Info:
System: Linux (x86_64-suse-linux)
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
Hi all,
I'm having a strange problem when saving a random forest model. When using the JLD module to save the model created by the DecisionTree module, it usually takes a huge amount of space on disk. For instance, a model that had a size of 155 Mb as a variable in julia, would take more than 2 Gb to save in disk and an eternity to save the model!
I solved this problem by using the serialize and deserialize commands. After this, I was able to save the models. However, the drawback of this method is that maybe it won't be possible to read the models in
other Julia versions. I didn't know if this is an issue from the DecisionTree module or the JLD module but would like to know if you have another workaround.
Cheers,
Hello, I wasnt quite sure where to write this - it isnt really an issue with your code. Is there a possibility of stratified sampling with this package or merging 2 forests (or trees) together? If not, do you plan to add these features?
If I have more questions, where should I write them? I'm new to github.
I noticed that the random forest classifier is intended to build trees in parallel. However, we must manually add julia processes by either invoking the -p option or use the addproc function. I am wondering if the classifier can automatically add processes. perhaps by adding an option indicating how many processes the user wants to use, which default value is 1.
Caleb
In a production setting, I want to train a model with a fairly large dataset and then run classifications on a per line basis using a persistent model. Can I store the ensemble in a file (or redis) and then load it at runtime?
I've tried writing the resulting model to a file but the following approach:
outfile = open("model.txt", "w")
write(model,outfile)
fails with the following error:
ERROR: write
has no method matching write(::Ensemble, ::IOStream)
Any thoughts?
Thanks,
Marc
The package promises "support for mixed nominal and numerical data" but this is currently only true for classification and not regression.
This is scary:
tree = fit!(DecisionTreeRegressor(), [1.0 2; 3 4], [10, 24.0])
predict(tree, [])
> 17.0
apply_tree
also accepts (and ignores) extra values in the feature_vector without complaining.
is there a function or is there a way to print the importance of the variables base on purity or gini index similar to breiman's version in R?
There's a strange error when running build_tree
in Julia v0.7.0-alpha (for both classification and regression), error traces below. What's peculiar is that the routines run fine in Julia v0.6.x.
For classification, I've tried a hack of adding last_f = Xf[1]
before the while hi < n_samples
loop (line 130 of tree.jl).
But with this change, I'm seeing some serious tree build performance degradation, both in execution time and memory allocation, on bleeding edge Julia. This performance issue might be due to something unrelated, but finding its root cause would be helpful for the Julia core team.
cc @Eight1911
Classification:
ERROR: UndefVarError: last_f not defined
Stacktrace:
[1] _split!(::Array{Float64,2}, ::Array{Int64,1}, ::DecisionTree.treeclassifier.NodeMeta, ::Int64, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1}, ::Array{Float64,1}, ::Array{Int64,1}, ::Random.MersenneTwister) at /home/bs/.julia/dev/DecisionTree/src/classification/tree.jl:131
[2] #build_tree#1(::Any, ::Any, ::Array{Float64,2}, ::Array{Int64,1}, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64) at /home/bs/.julia/dev/DecisionTree/src/classification/tree.jl:322
[3] #build_tree at ./none:0 [inlined]
[4] #build_tree#6 at /home/bs/.julia/dev/DecisionTree/src/classification/main.jl:86 [inlined]
[5] build_tree at /home/bs/.julia/dev/DecisionTree/src/classification/main.jl:73 [inlined] (repeats 2 times)
[6] top-level scope at none:0
Regression:
ERROR: UndefVarError: last_f not defined
Stacktrace:
[1] _split!(::Array{Float64,2}, ::Array{Float64,1}, ::DecisionTree.treeregressor.NodeMeta, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Int64,1}, ::Array{Float64,1}, ::Array{Float64,1}, ::Random.MersenneTwister) at /home/bs/.julia/dev/DecisionTree/src/regression/tree.jl:139
[2] #build_tree#1(::Random.MersenneTwister, ::Any, ::Array{Float64,2}, ::Array{Float64,1}, ::Int64, ::Int64, ::Int64, ::Int64, ::Float64) at /home/bs/.julia/dev/DecisionTree/src/regression/tree.jl:289
[3] #build_tree at ./none:0 [inlined]
[4] #build_tree#24 at /home/bs/.julia/dev/DecisionTree/src/regression/main.jl:126 [inlined]
[5] build_tree at /home/bs/.julia/dev/DecisionTree/src/regression/main.jl:110 [inlined] (repeats 2 times)
[6] top-level scope at none:0
As we have scaling resources e.g. available from Firebase to use TensorFlow models, it would be nice to find out how to convert a DecisionTree model into a TensorFlow model.
I see that a DecisionTree model can be saved as a JLD2, which should be readable in a HDF5 reader. Am I missing just the last step, from HDF5 to MetaGraph (or .GraphDef);
Or maybe I am missing a completely different way to achieve this?
Thanks
EDIT: I had asked this on the julia discourse but I guess it actually belongs here
Hi, is there a way to limit size of a leaf to prevent overfitting? Something similar to minbucket
in R?
Thanks a lot for sharing your code!
With large decision trees, it is very helpful if one can interactively explore the tree (interactively expand/collapse nodes). For example: http://blog.revolutionanalytics.com/2016/12/interactive-decision-trees-with-microsoft-r.html
Perhaps it is worth adding support for this, for example through D3Trees.jl. It probably makes sense as a separate package but I wanted to throw the idea out there.
Now that Julia 0.3 isn't supported, is there any reason not to precompile DecisionTree.jl? I just tried it on my side, and everything looked fine.
If you look at the list of DecisionTree releases, you'll see that the most recent release was v0.8.2
, released on May 10, 2019.
However, if you look at the General registry, the most recent release registered there was v0.8.1
.
Could you register DecisionTree v0.8.2
in the General registry?
Hello, I tried these two codes in Julia and R respectively:
Ytrain=rand(2000,1)
Xtrain=rand(2000,60)
addprocs(3)
using DecisionTree
@time model = build_forest(Ytrain[:,1],Xtrain,20,200,5,1)
library(randomForest)
X=replicate(60, runif(2000))
Y=runif(2000)
ptm <- proc.time()
rf=randomForest(X, Y, importance = FALSE, ntree=200,do.trace=0,nodesize=5)
proc.time() - ptm
In Julia it takes around 30 seconds and in R it takes around 10. I think that configuration of both "build_forest" and "randomForest" are the same (as rF takes one third of the variables for each node which is exactly 20). And as far as I know, rF in R cant use paralellization (at least this library can
t) and Julia should be way faster than R.
So, what might be causing the difference in speed?
In test/classification_rand.jl
, this code failed once on Travis because depth(model)
was 1. We should probably change the ==
to <=
? Or we could srand
to ensure that we always get exactly the same results.
using Base.Test
using DecisionTree
n,m = 10^3, 5 ;
features = rand(n,m);
weights = rand(-1:1,m);
labels = _int(features * weights);
maxdepth = 3
model = build_tree(labels, features, 0, maxdepth)
@test depth(model) == maxdepth
Due to the fact that the data matrix is being copied at every recursive call, the current implementation has a memory footprint that is unnecessarily large. This is a structural issue and may well require a complete rewrite of the built_tree
function. For example, on my MBP, the @time macro prints the following when used on the build_tree function to split on the MNIST training dataset:
2.866582 seconds (3.31 M allocations: 820.371 MiB, 3.99% gc time).
This above time does not include the JIT compile time. If the maintainer(s) gives me the green light, I will be making a couple of PR's in the upcoming weeks to fix this issue and hopefully up the performance of the current implementation. In particular, I will be making a small port of the sklearn trees here compatible with the current API. The library linked has the following footprint
0.496638 seconds (21.32 k allocations: 802.953 KiB, 1.00% gc time)
which is approximately a thousand times smaller with 5-6 times the performance and produces the same tree with the same number of nodes when asked to evaluate every feature for each split.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.