I made a pmml which using PMMLPipeline = DataFrameMapper + LogisticRegression Data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Output unmatch between jpmml and pypmml about pypmml-spark HOT 11 CLOSED

autodeployai commented on July 18, 2024

Output unmatch between jpmml and pypmml

from pypmml-spark.

Comments (11)

scorebot commented on July 18, 2024

@DiegoSong Thanks for your info. Could you attach your model here? then we can reproduce the issue easily.

from pypmml-spark.

DiegoSong commented on July 18, 2024

@scorebot Thanks for your replay. I send pmml to this address [email protected].

from pypmml-spark.

scorebot commented on July 18, 2024

@DiegoSong Thanks, I have got the pmml model. I tested it in my environment installed the latest pypmml 0.9.6:

>>> from pypmml import Model
>>> model = Model.load('model.pmml')
>>> model.predict({'cProb': 0.06, 'pProb': 0.3, 'oProb': 0.5, 'T3Prob': 0.1, 'cEmrgProb': 0.06, 'pEmrgProb': 0.3, 'oEmrgProb': 0.5})
{'probability(1)': 0.18478507571419323, 'probability(0)': 0.8152149242858068}

The input record contains all valid values for those input fields. I did not find anything wrong. What about the results from jpmml and pipeline.predict_proba?

from pypmml-spark.

DiegoSong commented on July 18, 2024

https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
The data shows diff between jpmml and pypmml. And its equal when feature not nan.

from pypmml-spark.

scorebot commented on July 18, 2024

I debug both pmml4s (used by pypmml) and jpmml, the differences are caused by the different missing value handling policies. Takes the first line of jpmml_pypmml.csv as an example:
Values of both fields oProb and oEmrgProb are missing, the missingValueReplacement="NaN" is used based on the PMML, so both take the value NaN, then we need to compute values of both derived fields cut(oProb) and cut(oEmrgProb):

jpmml always returns the last interval for the NaN, cut(oProb)=0.444133, cut(oEmrgProb)=0.480491
pypmml always returns a missing value for the NaN because there is no defaultValue defined, cut(oProb)=missing, cut(oEmrgProb)=missing

When evaluate imputer(cut(oProb)) and imputer(cut(oEmrgProb)), jpmml and pypmml return different values:

jpmml: imputer(cut(oProb))=0.444133, cut(oEmrgProb)=0.480491
pypmml: imputer(cut(oProb))=0.06101, cut(oEmrgProb)=-8.09E-4, the imputers are defined in the PMML.

That's the reason why the final results are different. I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.

BTW, what are the results of the native python model? if the results are the same as jpmml, the generated pmml model should have a problem, you need to file a bug for JPMML-SkLearn.

from pypmml-spark.

vruusmann commented on July 18, 2024

@DiegoSong Don't ever specify Domain.missing_value_replacement = NaN. Use some meaningful constant.

I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.

NaN is an invalid value, not a missing value.

See http://dmg.org/pmml/v4-4/Transformations.html

notANumber. This is the value returned from such meaningless expressions as the logarithm of a negative number. It should not be used as a generic missing value, as missing values are properly those which are unknown, indeterminate, or non-applicable. All arithmetic expressions involving NaN will evaluate to NaN.

We may agree that a Discretize element should return NaN when inputted with NaN, but returning a Discretize@defaultValue is definitely not correct.

from pypmml-spark.

scorebot commented on July 18, 2024

@vruusmann Thanks for your comments. Yes, NaN (not a number) is an invalid value, not a missing value. Suppose Discretize returns NaN when the input is NaN, what's the final result? NaN or a normal result based on the imputers?

from pypmml-spark.

vruusmann commented on July 18, 2024

Suppose Discretize returns NaN when the input is NaN, what's the final result?

It seems to me that the NaN should be propagated between expression and model elements till the end - resulting in a NaN prediction. For example, when one numeric predictor is NaN (whether supplied directly by the end user, or computed during data pre-processing), then RegressionModel should also predict NaN.

This issue manifests itself with NaN right now. But if should apply to all missing values. For example, consider the following field definition:

<DataField name="x" dataType="double">
  <Value value="0" property="invalid"/>
</DataField>

When the end user supplies x = 0, then it is classified as a missing value, and should trigger the same chain of handlers as x = NaN would.

The handling of invalid values is almost completely unspecified by DMG.org. They should clarify.

from pypmml-spark.

DiegoSong commented on July 18, 2024

Thank you for your reply！
Missing value also have information. I meant to be able to replace missing value with a meaningful value which should comes from target 'y'. So if fillna with some constant like -9999. It may merge into other bin. In my case it causes 0.5% AUC reduce.

pypmml always returns a missing value for the NaN

This is more reasonable for me.
Here may be a safe way when using sklearn2pmml to handle missing value:

ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=None)

instead of

ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=np.nan)

Parameter missingValueReplacement will not generate in pmml file.
After that jpmml, pypmml,sklearn2pmml gives the same output.
see: https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
sorry for the 000000|999999

from pypmml-spark.

vruusmann commented on July 18, 2024

Parameter missingValueReplacement not generates in pmml file.

@DiegoSong The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated. However, you can keep it in your Python code for extra clarity.

from pypmml-spark.

DiegoSong commented on July 18, 2024

The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated.

It may helpful for predict on csv file.
To avoid this error:

Exception in thread "main" org.jpmml.evaluator.InvalidResultException (at or around line 26 of the PMML document): Field "oProb" cannot accept user input value ""
at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:235)
at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:151)
at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111)
at org.jpmml.evaluator.InputField.prepare(InputField.java:70)
at org.jpmml.evaluator.example.EvaluationExample.execute(EvaluationExample.java:413)
at org.jpmml.evaluator.example.Example.execute(Example.java:92)
at org.jpmml.evaluator.example.EvaluationExample.main(EvaluationExample.java:262)

from pypmml-spark.

Output unmatch between jpmml and pypmml about pypmml-spark HOT 11 CLOSED

Comments (11)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent