Comments (20)
I invent the protocol and tricks by my self, maybe you can just call it
xgboost. The general algorithm however, fits into framework of gradient
boosting.
On Sat, Aug 30, 2014 at 8:56 AM, maxliu [email protected] wrote:
Just gave a quick glance of the code ( it is beautiful ,by the way), it is
very interesting the way you treat the missing values - it is depending how
to make the tree better. Does this method/algorithm has name?—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53962310.
Sincerely,
Tianqi Chen
Computer Science & Engineering, University of Washington
from xgboost.
well - if values are not provided, it takes it as missing. So are all 0 values also treated as missing?
Example: A column has 25 values, 15 are 1, 5 are missing/NA and 5 are 0.
Are the 5 + 5 = 10 treated as missing?
from xgboost.
Normally, it is fine that you treat missing and zero all as zero:)
On Sat, Aug 30, 2014 at 5:11 AM, rkirana [email protected] wrote:
it maybe extremely difficult to list 0 features in case of sparse data. So
should we avoid xgboost in cases where there is missing data and many 0
features?—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53956745.
Sincerely,
Tianqi Chen
Computer Science & Engineering, University of Washington
from xgboost.
xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.
i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it.
from xgboost.
Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.
from xgboost.
I haven't done formal comparison with other methods, but I think it should be comparable, and it also gives computation benefit when your feature matrix is sparse
from xgboost.
It will depends on how you present the data. If you put data in as LIBSVM format, and list zero features there, it will not be treated as missing
from xgboost.
it maybe extremely difficult to list 0 features in case of sparse data. So should we avoid xgboost in cases where there is missing data and many 0 features?
from xgboost.
Just gave a quick glance of the code ( it is beautiful ,by the way), it is very interesting the way you treat the missing values - it is depending how to make the tree better. Does this method/algorithm have name?
from xgboost.
I am not surprised by the seed of xgboost but the score is better than sklearn-GBR. The trick of missing value might be one of the reasons.
Have you published any paper for the boosting algorithm you used for xgboost? Unlike random forest, I could not find many code for boosting with parallel algorithm - may need to improve my google skill though.
from xgboost.
I didn't yet publish any paper describing xgboost.
For parallel boosting tree code, the only one I am aware of so far is
http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt . You can try it out
and compare with xgb if you are interested
On Sat, Aug 30, 2014 at 9:40 AM, maxliu [email protected] wrote:
I am not surprised by the seed of xgboost but the score is better than
sklearn-GBR. The trick of missing value might be one of the reasons.Have you published any paper for the boosting algorithm you used for
xgboost? Unlike random forest, I could not find many code for boosting with
parallel algorithm - may need to improve my google skill though.—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53963590.
Sincerely,
Tianqi Chen
Computer Science & Engineering, University of Washington
from xgboost.
A follow up question-
While I understand how XGboost handles missing values within discrete variables, I'm not sure how does it handle continues (numeric) variables.
Can you please explain?
from xgboost.
For continuous features, a missing(default) direction is learnt for missing value data to go into, so when the data of the speficific value is missing, then it goes to the default direction
from xgboost.
Thanks Tianqi.
And what about missing continuous features in generalized linear models?
from xgboost.
Hi Tianqi
I am looking for an algo which does no imputation of the missing values internally and yet works .How does the xgboost work internally to handle missing values(can you drop in some basic idea) ?
from xgboost.
see https://arxiv.org/abs/1603.02754 sec 3.4
from xgboost.
Xgboost also works in the presence of categorical features ?We don't need to prepocess them (binarisation,etc.).For e.g my dataset has a feature called city which has values-"Milan","Rome","venice".Can I present them to xgboost without any preproccessing at all?
from xgboost.
Tianqi,
I have question about xgb.importance function. When I run this and look at the Real Cover it seems as though if there is any missing data in a feature, the Real Cover is NA. Is there anyway to deal with this issue to get some co-occurence count for each split?
Rex
from xgboost.
Hi Tianqi,
my processing pipeline include normalize the features before learning. Also, i have a lot of indicators which are missing and not zero for negative indication.
as a result, my normalized indicators are 0 for indicated values and missing for non-indicated values.
will xgboost handle such behavior properly? (changing non-exists features to 0 will cause problems...)
Thanks,
Nadav
from xgboost.
@tqchen
You wrote:
Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.
What about a case when the train set has not missing values, but the test has?
from xgboost.
Related Issues (20)
- clarification needed for model/saving loading HOT 2
- Missing XGBoostRanker in xgboost4j-spark jvm package HOT 3
- multi label support in Scala xgboost. HOT 8
- SparkXGBClassifier does not validate params HOT 6
- feature_weights only compatible with CPU ? HOT 1
- Improve XGBoost quantile predictions HOT 3
- Latest version training crashes HOT 12
- Error when trying to build HOT 4
- Model provides different results for different Python versions HOT 2
- error in the docs for ranking HOT 1
- Defining a callback to write hessians of train observations to a csv file HOT 2
- Slow inference on sphoradic stremaing data HOT 2
- XGBoost GPU Warning When Working with BayesSearchCV (XGBoost is running on: cuda:0, while the input data is on: cpu.)
- help installing xgboost with gpu HOT 2
- Potential Documentation Inaccuracy Regarding Feature Interaction Constraints
- Horizontal Federated Learning with Secure Features RFC
- [bug] Python - Cuda error (without using Cuda) HOT 5
- Pandas 2.2: Index.format is deprecated
- ArrayInterface handler for cuDF DataFrame cannot yet handle Boolean columns HOT 1
- src/metric/auc.cc:322: Check failed: auc <= local_area HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xgboost.