Giter VIP home page Giter VIP logo

Comments (11)

guitargeek avatar guitargeek commented on August 22, 2024

Hi @farjasju, sorry for the late answer but unfortunately I don't seem to get email notifications when issues get opened here. I need to do something about that.

From the code you are sharing it seems you are doing everything correctly. Do you have some complete code so I can reproduce the problem? Right now I can only say that you might have done something wrong with the order of the features that you pass to the classifier. Unfortunately, the original order of the features as you pass them to the python model is not stored in the text dump from xgboost, so the fastforest class can't recover it. There is more in the README about this. Did you consider that?

Cheers!

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

Only this looks a bit strange:

float raw_output = fastForest(input.data(sample));

I don't know how you organize your input data, but if input is a vector like in the README, this will not work. Maybe Your problem is related to that? Did you check that the input features are the same for the xgboost and fastforest cases?

from xgboost-fastforest.

farjasju avatar farjasju commented on August 22, 2024

Hi and thanks for the answer!

  • Concerning the features, I'm passing the default ones to FastForest, ie. f0, f1, f2, f3... (as they appear in the .txt dump file of the model).
  • Concerning the input data, I have a std::vector<float> input = {}; for each sample that I fill in from a csv. Isn't fastForest(input.data()) the right way to get the prediction?

Here is my (simplified) code:

// Initializing FastForest
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4", "f5", ... "f304", "f305", "f306", "f307", "f308", "f309"};
FastForest fastForest("../xgb_trained_model.txt", features);

// Reading the csv dataset
std::ifstream csv_input("../dataset.csv");
std::string delimiter = ",";
for (std::string line; getline(csv_input, line);)
{
  std::vector<float> input = {}; // the sample vector
  size_t pos = 0;
  std::string token;
  while ((pos = line.find(delimiter)) != std::string::npos)
  {
    token = line.substr(0, pos);
    input.push_back(::atof(token.c_str()));
    line.erase(0, pos + delimiter.length());
  }
  // Predicting the probability
  float output = fastForest(input.data());
  float score = 1. / (1. + std::exp(-(output)));
}

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

How does your csv file look like? Taking a quick glance at your code, it seems to me that the last feature in each row of the CSV is not read, if a row looks like I would expect (with no trailing comma):

f0,f1,..,f309

I think your code only adds a feature to the input vector while there is a comma after the value, hence you would miss the last one. Did you check that input.length() is as you would expect?

Unfortunately fastforest doesn't check if the number of passed feature is correct because it just takes an array pointer as an input. I should change the interface such that the number of passed features is validated.

from xgboost-fastforest.

farjasju avatar farjasju commented on August 22, 2024

Well, you're totally right, it's missing the last feature. Thank you, I'm not used to working in C++.

Oddly, this doesn't seem to solve the problem and outputs are still between 0 and 2 where they are around 4 with XGBoost...

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

I'm sorry, in that case I can't spot the problem based on the code you shared. One more thing that I can think of concerning the csv file: maybe you exported it with the pandas to_csv() method with the default parameters? In that case, each row would start with an index instead of the first feature, which would give you the wrong result. Other than that, I'm afraid I can't help you further without having the files to reproduce the problem myself. And I would very much like to solve the problem, because an open issue "XGBoost and FastForest predictions don't match" is not good advertising for my library, even if it potentially is the users fault :)

from xgboost-fastforest.

farjasju avatar farjasju commented on August 22, 2024

I understand and I apologize for this bad advertising!
I uploaded my files to this repo if you want to reproduce the problem. The csv has been generated by myself in a really simple way.

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

Forget what I wrote here in the beginning, it turned out that the original fastforest behaviour was correct and that the < instead of a <= sign in the text dump from xgboost is incorrect and was misleading me.

Thanks a lot for that. I see you have also many categorical features in there that only take integer values. That helped me to spot a trivial bug in fastforest, which had the wrong behaviour when a feature was equal to the cut value:
4e9814f#diff-e7d15b6e1cd8e66791b5ad8ca099fc81R211

A new test case was added for this as well.

Independently from this little bug, I can't reproduce the problem that you describe:

Oddly, this doesn't seem to solve the problem and outputs are still between 0 and 2 where they are around 4 with XGBoost...

However, when I run your code from the repository I get values around 4 like you seem to expect (here the first 10 lines):

label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989892, primary output:4.584249)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989892, primary output:4.584249)
label: 1 - score: 1 (0.989035, primary output:4.502018)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.987967, primary output:4.408020)
label: 1 - score: 1 (0.985730, primary output:4.235198)
label: 1 - score: 1 (0.989885, primary output:4.583563)

Maybe the problem is just that the rows are ordered differently when you compare to your python output? Would it be possible to add also a binary file of the xgboost model in your repository such that I can load the model in python and do the comparison? You know, the file that you get as follows:

model = XGBClassifier(n_estimators=100, objective="binary:logitraw").fit(X, y)
model._Booster.save_model("model.bin")

Cheers!

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

Okay I could reproduce the problem myself by taking your csv file (which has actually 311 columns, so be careful with the feature names which should go up to f310) and train an xgboost classifier with that. This got implemented in a unit test now:
https://github.com/guitargeek/XGBoost-FastForest/blob/master/test/create_test_data.py#L43
https://github.com/guitargeek/XGBoost-FastForest/blob/master/test/test.cpp#L96

I took your file 1000_samples.csv from your repo, compressed it and renamed it to manyfeatures.csv.gz for the unit test, I hope that was ok.

Being able to reproduce the disagreement, it got apparent what the problem was: fastforest stored the indices for the features used a each split in an unsigned char, which is of course very efficient but did not allow for more than 256 features. In this issue, we were dealing with 311 features.

The solution was is of course to change it to unsigned int, which causes the unit test based on you problem to pass. Please let me know if this works for you such that I can close the issue.

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

Hi @farjasju, does it work for you now? I would like to close the issue as soon as possible.

from xgboost-fastforest.

guitargeek avatar guitargeek commented on August 22, 2024

Issue solved by not having a maximum number of 256 features anymore in commit 3fc24b4.

@farjasju, feel very free to open another issue if you have further problems!

from xgboost-fastforest.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.