Giter VIP home page Giter VIP logo

Comments (11)

scorebot avatar scorebot commented on July 18, 2024

@DiegoSong Thanks for your info. Could you attach your model here? then we can reproduce the issue easily.

from pypmml-spark.

DiegoSong avatar DiegoSong commented on July 18, 2024

@scorebot Thanks for your replay. I send pmml to this address [email protected].

from pypmml-spark.

scorebot avatar scorebot commented on July 18, 2024

@DiegoSong Thanks, I have got the pmml model. I tested it in my environment installed the latest pypmml 0.9.6:

>>> from pypmml import Model
>>> model = Model.load('model.pmml')
>>> model.predict({'cProb': 0.06, 'pProb': 0.3, 'oProb': 0.5, 'T3Prob': 0.1, 'cEmrgProb': 0.06, 'pEmrgProb': 0.3, 'oEmrgProb': 0.5})
{'probability(1)': 0.18478507571419323, 'probability(0)': 0.8152149242858068}

The input record contains all valid values for those input fields. I did not find anything wrong. What about the results from jpmml and pipeline.predict_proba?

from pypmml-spark.

DiegoSong avatar DiegoSong commented on July 18, 2024

https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
The data shows diff between jpmml and pypmml. And its equal when feature not nan.

from pypmml-spark.

scorebot avatar scorebot commented on July 18, 2024

I debug both pmml4s (used by pypmml) and jpmml, the differences are caused by the different missing value handling policies. Takes the first line of jpmml_pypmml.csv as an example:
Values of both fields oProb and oEmrgProb are missing, the missingValueReplacement="NaN" is used based on the PMML, so both take the value NaN, then we need to compute values of both derived fields cut(oProb) and cut(oEmrgProb):

  1. jpmml always returns the last interval for the NaN, cut(oProb)=0.444133, cut(oEmrgProb)=0.480491
  2. pypmml always returns a missing value for the NaN because there is no defaultValue defined, cut(oProb)=missing, cut(oEmrgProb)=missing

When evaluate imputer(cut(oProb)) and imputer(cut(oEmrgProb)), jpmml and pypmml return different values:

  1. jpmml: imputer(cut(oProb))=0.444133, cut(oEmrgProb)=0.480491
  2. pypmml: imputer(cut(oProb))=0.06101, cut(oEmrgProb)=-8.09E-4, the imputers are defined in the PMML.

That's the reason why the final results are different. I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.

BTW, what are the results of the native python model? if the results are the same as jpmml, the generated pmml model should have a problem, you need to file a bug for JPMML-SkLearn.

from pypmml-spark.

vruusmann avatar vruusmann commented on July 18, 2024

@DiegoSong Don't ever specify Domain.missing_value_replacement = NaN. Use some meaningful constant.

I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.

NaN is an invalid value, not a missing value.

See http://dmg.org/pmml/v4-4/Transformations.html

notANumber. This is the value returned from such meaningless expressions as the logarithm of a negative number. It should not be used as a generic missing value, as missing values are properly those which are unknown, indeterminate, or non-applicable. All arithmetic expressions involving NaN will evaluate to NaN.

We may agree that a Discretize element should return NaN when inputted with NaN, but returning a Discretize@defaultValue is definitely not correct.

from pypmml-spark.

scorebot avatar scorebot commented on July 18, 2024

@vruusmann Thanks for your comments. Yes, NaN (not a number) is an invalid value, not a missing value. Suppose Discretize returns NaN when the input is NaN, what's the final result? NaN or a normal result based on the imputers?

from pypmml-spark.

vruusmann avatar vruusmann commented on July 18, 2024

Suppose Discretize returns NaN when the input is NaN, what's the final result?

It seems to me that the NaN should be propagated between expression and model elements till the end - resulting in a NaN prediction. For example, when one numeric predictor is NaN (whether supplied directly by the end user, or computed during data pre-processing), then RegressionModel should also predict NaN.

This issue manifests itself with NaN right now. But if should apply to all missing values. For example, consider the following field definition:

<DataField name="x" dataType="double">
  <Value value="0" property="invalid"/>
</DataField>

When the end user supplies x = 0, then it is classified as a missing value, and should trigger the same chain of handlers as x = NaN would.

The handling of invalid values is almost completely unspecified by DMG.org. They should clarify.

from pypmml-spark.

DiegoSong avatar DiegoSong commented on July 18, 2024

Thank you for your reply!
Missing value also have information. I meant to be able to replace missing value with a meaningful value which should comes from target 'y'. So if fillna with some constant like -9999. It may merge into other bin. In my case it causes 0.5% AUC reduce.

pypmml always returns a missing value for the NaN

This is more reasonable for me.
Here may be a safe way when using sklearn2pmml to handle missing value:

ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=None)

instead of

ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=np.nan)

Parameter missingValueReplacement will not generate in pmml file.
After that jpmml, pypmml,sklearn2pmml gives the same output.
see: https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
sorry for the 000000|999999

from pypmml-spark.

vruusmann avatar vruusmann commented on July 18, 2024

Parameter missingValueReplacement not generates in pmml file.

@DiegoSong The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated. However, you can keep it in your Python code for extra clarity.

from pypmml-spark.

DiegoSong avatar DiegoSong commented on July 18, 2024

The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated.

It may helpful for predict on csv file.
To avoid this error:

Exception in thread "main" org.jpmml.evaluator.InvalidResultException (at or around line 26 of the PMML document): Field "oProb" cannot accept user input value ""
at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:235)
at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:151)
at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111)
at org.jpmml.evaluator.InputField.prepare(InputField.java:70)
at org.jpmml.evaluator.example.EvaluationExample.execute(EvaluationExample.java:413)
at org.jpmml.evaluator.example.Example.execute(Example.java:92)
at org.jpmml.evaluator.example.EvaluationExample.main(EvaluationExample.java:262)

from pypmml-spark.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.