Comments (11)
@DiegoSong Thanks for your info. Could you attach your model here? then we can reproduce the issue easily.
from pypmml-spark.
@scorebot Thanks for your replay. I send pmml to this address [email protected].
from pypmml-spark.
@DiegoSong Thanks, I have got the pmml model. I tested it in my environment installed the latest pypmml 0.9.6
:
>>> from pypmml import Model
>>> model = Model.load('model.pmml')
>>> model.predict({'cProb': 0.06, 'pProb': 0.3, 'oProb': 0.5, 'T3Prob': 0.1, 'cEmrgProb': 0.06, 'pEmrgProb': 0.3, 'oEmrgProb': 0.5})
{'probability(1)': 0.18478507571419323, 'probability(0)': 0.8152149242858068}
The input record contains all valid values for those input fields. I did not find anything wrong. What about the results from jpmml and pipeline.predict_proba?
from pypmml-spark.
https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
The data shows diff between jpmml and pypmml. And its equal when feature not nan.
from pypmml-spark.
I debug both pmml4s (used by pypmml) and jpmml, the differences are caused by the different missing value handling policies. Takes the first line of jpmml_pypmml.csv
as an example:
Values of both fields oProb
and oEmrgProb
are missing, the missingValueReplacement="NaN"
is used based on the PMML, so both take the value NaN
, then we need to compute values of both derived fields cut(oProb)
and cut(oEmrgProb)
:
- jpmml always returns the last interval for the NaN, cut(oProb)=0.444133, cut(oEmrgProb)=0.480491
- pypmml always returns a missing value for the NaN because there is no
defaultValue
defined, cut(oProb)=missing, cut(oEmrgProb)=missing
When evaluate imputer(cut(oProb))
and imputer(cut(oEmrgProb))
, jpmml and pypmml return different values:
- jpmml: imputer(cut(oProb))=0.444133, cut(oEmrgProb)=0.480491
- pypmml: imputer(cut(oProb))=0.06101, cut(oEmrgProb)=-8.09E-4, the imputers are defined in the PMML.
That's the reason why the final results are different. I think pypmml is more reasonable, the double value NaN
(not a number) should be treated as a missing value because there is no interval contains such value.
BTW, what are the results of the native python model? if the results are the same as jpmml, the generated pmml model should have a problem, you need to file a bug for JPMML-SkLearn
.
from pypmml-spark.
@DiegoSong Don't ever specify Domain.missing_value_replacement = NaN
. Use some meaningful constant.
I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.
NaN is an invalid value, not a missing value.
See http://dmg.org/pmml/v4-4/Transformations.html
notANumber. This is the value returned from such meaningless expressions as the logarithm of a negative number. It should not be used as a generic missing value, as missing values are properly those which are unknown, indeterminate, or non-applicable. All arithmetic expressions involving NaN will evaluate to NaN.
We may agree that a Discretize
element should return NaN
when inputted with NaN
, but returning a Discretize@defaultValue
is definitely not correct.
from pypmml-spark.
@vruusmann Thanks for your comments. Yes, NaN (not a number) is an invalid value, not a missing value. Suppose Discretize
returns NaN when the input is NaN, what's the final result? NaN or a normal result based on the imputers?
from pypmml-spark.
Suppose Discretize returns NaN when the input is NaN, what's the final result?
It seems to me that the NaN
should be propagated between expression and model elements till the end - resulting in a NaN
prediction. For example, when one numeric predictor is NaN
(whether supplied directly by the end user, or computed during data pre-processing), then RegressionModel
should also predict NaN
.
This issue manifests itself with NaN
right now. But if should apply to all missing values. For example, consider the following field definition:
<DataField name="x" dataType="double">
<Value value="0" property="invalid"/>
</DataField>
When the end user supplies x = 0
, then it is classified as a missing value, and should trigger the same chain of handlers as x = NaN
would.
The handling of invalid values is almost completely unspecified by DMG.org. They should clarify.
from pypmml-spark.
Thank you for your reply!
Missing value also have information. I meant to be able to replace missing value with a meaningful value which should comes from target 'y'. So if fillna with some constant like -9999. It may merge into other bin. In my case it causes 0.5% AUC reduce.
pypmml always returns a missing value for the NaN
This is more reasonable for me.
Here may be a safe way when using sklearn2pmml
to handle missing value:
ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=None)
instead of
ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=np.nan)
Parameter missingValueReplacement
will not generate in pmml file.
After that jpmml, pypmml,sklearn2pmml gives the same output.
see: https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
sorry for the 000000|999999
from pypmml-spark.
Parameter missingValueReplacement not generates in pmml file.
@DiegoSong The expression ContinuousDomain(missing_value_replacement=None)
is a default no-op instruction, hence there's no PMML markup generated. However, you can keep it in your Python code for extra clarity.
from pypmml-spark.
The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated.
It may helpful for predict on csv file.
To avoid this error:
Exception in thread "main" org.jpmml.evaluator.InvalidResultException (at or around line 26 of the PMML document): Field "oProb" cannot accept user input value ""
at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:235)
at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:151)
at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111)
at org.jpmml.evaluator.InputField.prepare(InputField.java:70)
at org.jpmml.evaluator.example.EvaluationExample.execute(EvaluationExample.java:413)
at org.jpmml.evaluator.example.Example.execute(Example.java:92)
at org.jpmml.evaluator.example.EvaluationExample.main(EvaluationExample.java:262)
from pypmml-spark.
Related Issues (8)
- java.io.NotSerializableException: org.pmml4s.transformations.FieldColumnPair #1 HOT 2
- Potential dependency conflicts between pypmml-spark and py4j HOT 4
- NoClassDefFoundError: scala/Product$class HOT 5
- this tool only support pyspark<3.0 ? when i was trying using ScoreModel.fromFile("{"file_path}"), an Error called """TypeError: 'JavaPackage' object is not callable""" ocurred HOT 6
- It couldn't output raw prediction probability. HOT 12
- Importing .pmml file throws error TypeError: 'JavaPackage' object is not callable HOT 2
- TypeError: 'JavaPackage' object is not callable error despite linking jars into spark succesfully HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pypmml-spark.