Olá! I'm trying to make predictions with FastForest, but the outputs don't match t

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Only this looks a bit strange: <div class="highlight highlight-source-c++ notransl

Hi and thanks for the answer! Concerning the features, I'm pas

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

XGBoost and FastForest predictions don't match about xgboost-fastforest HOT 11 CLOSED

guitargeek commented on August 22, 2024

XGBoost and FastForest predictions don't match

from xgboost-fastforest.

Comments (11)

guitargeek commented on August 22, 2024

Hi @farjasju, sorry for the late answer but unfortunately I don't seem to get email notifications when issues get opened here. I need to do something about that.

From the code you are sharing it seems you are doing everything correctly. Do you have some complete code so I can reproduce the problem? Right now I can only say that you might have done something wrong with the order of the features that you pass to the classifier. Unfortunately, the original order of the features as you pass them to the python model is not stored in the text dump from xgboost, so the fastforest class can't recover it. There is more in the README about this. Did you consider that?

Cheers!

from xgboost-fastforest.

guitargeek commented on August 22, 2024

Only this looks a bit strange:

float raw_output = fastForest(input.data(sample));

I don't know how you organize your input data, but if input is a vector like in the README, this will not work. Maybe Your problem is related to that? Did you check that the input features are the same for the xgboost and fastforest cases?

from xgboost-fastforest.

farjasju commented on August 22, 2024

Hi and thanks for the answer!

Concerning the features, I'm passing the default ones to FastForest, ie. f0, f1, f2, f3... (as they appear in the .txt dump file of the model).
Concerning the input data, I have a std::vector<float> input = {}; for each sample that I fill in from a csv. Isn't fastForest(input.data()) the right way to get the prediction?

Here is my (simplified) code:

// Initializing FastForest
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4", "f5", ... "f304", "f305", "f306", "f307", "f308", "f309"};
FastForest fastForest("../xgb_trained_model.txt", features);

// Reading the csv dataset
std::ifstream csv_input("../dataset.csv");
std::string delimiter = ",";
for (std::string line; getline(csv_input, line);)
{
  std::vector<float> input = {}; // the sample vector
  size_t pos = 0;
  std::string token;
  while ((pos = line.find(delimiter)) != std::string::npos)
  {
    token = line.substr(0, pos);
    input.push_back(::atof(token.c_str()));
    line.erase(0, pos + delimiter.length());
  }
  // Predicting the probability
  float output = fastForest(input.data());
  float score = 1. / (1. + std::exp(-(output)));
}

from xgboost-fastforest.

guitargeek commented on August 22, 2024

How does your csv file look like? Taking a quick glance at your code, it seems to me that the last feature in each row of the CSV is not read, if a row looks like I would expect (with no trailing comma):

f0,f1,..,f309

I think your code only adds a feature to the input vector while there is a comma after the value, hence you would miss the last one. Did you check that input.length() is as you would expect?

Unfortunately fastforest doesn't check if the number of passed feature is correct because it just takes an array pointer as an input. I should change the interface such that the number of passed features is validated.

from xgboost-fastforest.

farjasju commented on August 22, 2024

Well, you're totally right, it's missing the last feature. Thank you, I'm not used to working in C++.

Oddly, this doesn't seem to solve the problem and outputs are still between 0 and 2 where they are around 4 with XGBoost...

from xgboost-fastforest.

guitargeek commented on August 22, 2024

I'm sorry, in that case I can't spot the problem based on the code you shared. One more thing that I can think of concerning the csv file: maybe you exported it with the pandas to_csv() method with the default parameters? In that case, each row would start with an index instead of the first feature, which would give you the wrong result. Other than that, I'm afraid I can't help you further without having the files to reproduce the problem myself. And I would very much like to solve the problem, because an open issue "XGBoost and FastForest predictions don't match" is not good advertising for my library, even if it potentially is the users fault :)

from xgboost-fastforest.

farjasju commented on August 22, 2024

I understand and I apologize for this bad advertising!
I uploaded my files to this repo if you want to reproduce the problem. The csv has been generated by myself in a really simple way.

from xgboost-fastforest.

guitargeek commented on August 22, 2024

Forget what I wrote here in the beginning, it turned out that the original fastforest behaviour was correct and that the < instead of a <= sign in the text dump from xgboost is incorrect and was misleading me.

Thanks a lot for that. I see you have also many categorical features in there that only take integer values. That helped me to spot a trivial bug in fastforest, which had the wrong behaviour when a feature was equal to the cut value:
~~4e9814f#diff-e7d15b6e1cd8e66791b5ad8ca099fc81R211~~

~~A new test case was added for this as well.~~

~~Independently from this little bug,~~ I can't reproduce the problem that you describe:

Oddly, this doesn't seem to solve the problem and outputs are still between 0 and 2 where they are around 4 with XGBoost...

However, when I run your code from the repository I get values around 4 like you seem to expect (here the first 10 lines):

label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989892, primary output:4.584249)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989892, primary output:4.584249)
label: 1 - score: 1 (0.989035, primary output:4.502018)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.987967, primary output:4.408020)
label: 1 - score: 1 (0.985730, primary output:4.235198)
label: 1 - score: 1 (0.989885, primary output:4.583563)

Maybe the problem is just that the rows are ordered differently when you compare to your python output? Would it be possible to add also a binary file of the xgboost model in your repository such that I can load the model in python and do the comparison? You know, the file that you get as follows:

model = XGBClassifier(n_estimators=100, objective="binary:logitraw").fit(X, y)
model._Booster.save_model("model.bin")

Cheers!

from xgboost-fastforest.

guitargeek commented on August 22, 2024

Okay I could reproduce the problem myself by taking your csv file (which has actually 311 columns, so be careful with the feature names which should go up to f310) and train an xgboost classifier with that. This got implemented in a unit test now:
https://github.com/guitargeek/XGBoost-FastForest/blob/master/test/create_test_data.py#L43
https://github.com/guitargeek/XGBoost-FastForest/blob/master/test/test.cpp#L96

I took your file 1000_samples.csv from your repo, compressed it and renamed it to manyfeatures.csv.gz for the unit test, I hope that was ok.

Being able to reproduce the disagreement, it got apparent what the problem was: fastforest stored the indices for the features used a each split in an unsigned char, which is of course very efficient but did not allow for more than 256 features. In this issue, we were dealing with 311 features.

The solution was is of course to change it to unsigned int, which causes the unit test based on you problem to pass. Please let me know if this works for you such that I can close the issue.

from xgboost-fastforest.

guitargeek commented on August 22, 2024

Hi @farjasju, does it work for you now? I would like to close the issue as soon as possible.

from xgboost-fastforest.

guitargeek commented on August 22, 2024

Issue solved by not having a maximum number of 256 features anymore in commit 3fc24b4.

@farjasju, feel very free to open another issue if you have further problems!

from xgboost-fastforest.

XGBoost and FastForest predictions don't match about xgboost-fastforest HOT 11 CLOSED

Comments (11)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent