Comments (5)
Bump. Any updates on these changes?
from decisiontree.jl.
Hi. Any updates on this? Would love to see the structs get more lightweight and efficient...
from decisiontree.jl.
How about this? We change the structure in such a way that storing which leaf the labels end up in can be optional. The advantage of doing this is that the users now have a choice, without the disadvantage of having to use some awkward apis for fit_X_proba
or even having to change most of the APIs for the user at all. I'm thinking of something like this
struct Leaf{T}
label :: T
# the location in the tree.labels
# where the labels in this leaf are stored
region :: UnitRange{Int}
# one or two summary statistics for fast retrieval
# - probability of inaccuracy for classication
# - variance of leaf for regression
deviation :: Float64
end
struct Node{S, T}
featid :: Int
featval :: S
left :: Union{Leaf{T}, Node{S, T}}
right :: Union{Leaf{T}, Node{S, T}}
end
struct Tree{S, T}
root :: Union{Leaf{T}, Node{S, T}}
# nothing if we choose not to store
# otherwise, a vector of labels
labels :: Union{Nothing, Vector{T}}
# ... other helpful metadata like how
# the tree was made, etc.
end
A little unrelated, but if you want something even more adventurous, you might also change how nodes refer to their children. That is, instead of linking them directly. Have the tree store the nodes in the array, and then use referral by index. That is,
struct Node{S, T}
featid :: Int
featval :: S
left :: Int # index of the left child in `tree.nodes`
right :: Int # index of the right child in `tree.nodes`
end
struct Tree{S, T}
nodes :: Vector{Union{Leaf{T}, Node{S, T}}}
# nothing if we choose not to store
# otherwise, a vector of labels
labels :: Union{Nothing, Vector{T}}
# ... other helpful metadata like how
# the tree was made, the purity criterion used
# etc.
end
The advantage is that it should be easier this way to storage in a linear format, which is helpful for porting to other languages. length(tree) is now O(1)
as it should be. (though length(node) is still O(n)
).
On another note, how important is preserving backwards compatibility for build_adaboost_stumps
? I think it's not intuitive to have it return something that is not used at all, especially when we will be adding another boosting algorithm which will use the same api.
from decisiontree.jl.
I'm open to both approaches for the new structs. It'd be good to experiment a bit and see which is more efficient, in terms of model training execution time and memory allocation, and also how well models write to disk.
Regarding build_adaboot_stumps
, it'll be fine to break the API, for the sake of having a consistent API across the boosting algorithms.
from decisiontree.jl.
@Eight1911
Great to hear about your implementations of GradientBoost and AdaBoost!
And yes, totally agree with you that the current Leaf
struct is unnecessarily bloated and wasteful when it comes to generating ensembles.
For a new Leaf
struct, it would be great to maintain the same functionality for apply_X_proba
(for multi-class tree, forest, stumps, ...) and print_tree
(where leaf majority matches are presented). We could potentially store the label counts in an array or tuple.
We could also consider having new and completely independent Leaf
and Node
types for classification and regression purposes, optimized for their corresponding tasks.
For a new Ensemble
struct, great to have the coeffs bundled in. We just need to make sure not to break the existing API for build_adaboost_stumps
(by perhaps redundantly outputting the coeffs as currently done?)
from decisiontree.jl.
Related Issues (20)
- Compat Helper PR's not tripping CI
- [Feature Request] Add a Field for Feature Importance HOT 1
- Remove the `Int` as rng functionality HOT 2
- Stop exporting metrics?
- Multi-variable RandomForestRegressor in Julia HOT 4
- No split found when target/output of RandomForestRegressor is very low. HOT 3
- Add option to resample features at nodes without replacement HOT 1
- Citation / Reference for DecisionTree.jl HOT 12
- RNG “shuffling” introduced in #174 is fundamentally flawed HOT 17
- Round thresholds in display of trees HOT 2
- Compatibility bounds for AbstractTrees.jl HOT 1
- Replicate Python model in Julia HOT 4
- Feature importance from random forest regression HOT 5
- Add multithreading support in RF predictions: probabilistic classification - and regression HOT 2
- Custom stopping criteria and loss functions HOT 9
- Fail to precompile the DecisionTree.jl on M1 mac HOT 2
- Add functionality for adding trees to an existing forest HOT 2
- [Tracking Issue] Add document strings to public methods
- Add support for specifying the `loss` used in random forests and AdaBoost model HOT 4
- Standardize the way fit! and predict methods take X matrix (features) HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from decisiontree.jl.