Giter VIP home page Giter VIP logo

jpmml-sparkml's Introduction

Java API for producing and scoring models in Predictive Model Markup Language (PMML).

IMPORTANT

This is a legacy codebase.

Starting from March 2014, this project has been superseded by [JPMML-Model] (https://github.com/jpmml/jpmml-model) and [JPMML-Evaluator] (https://github.com/jpmml/jpmml-evaluator) projects.

Features

Class model

  • Full support for PMML 3.0, 3.1, 3.2, 4.0 and 4.1 schemas:
    • Class hierarchy.
    • Schema version annotations.
  • Fluent API:
    • Value constructors.
  • SAX Locator information
  • [Visitor pattern] (http://en.wikipedia.org/wiki/Visitor_pattern):
    • Validation agents.
    • Optimization and transformation agents.

Evaluation engine

Installation

JPMML library JAR files (together with accompanying Java source and Javadocs JAR files) are released via [Maven Central Repository] (http://repo1.maven.org/maven2/org/jpmml/). Please join the [JPMML mailing list] (https://groups.google.com/forum/#!forum/jpmml) for release announcements.

The current version is 1.0.22 (17 February, 2014).

Class model

<!-- Class model classes -->
<dependency>
	<groupId>org.jpmml</groupId>
	<artifactId>pmml-model</artifactId>
	<version>${jpmml.version}</version>
</dependency>
<!-- Class model annotations -->
<dependency>
	<groupId>org.jpmml</groupId>
	<artifactId>pmml-schema</artifactId>
	<version>${jpmml.version}</version>
</dependency>

Evaluation engine

<dependency>
	<groupId>org.jpmml</groupId>
	<artifactId>pmml-evaluator</artifactId>
	<version>${jpmml.version}</version>
</dependency>

Usage

Class model

The class model consists of two types of classes. There is a small number of manually crafted classes that are used for structuring the class hierarchy. They are permanently stored in the Java sources directory /pmml-model/src/main/java. Additionally, there is a much greater number of automatically generated classes that represent actual PMML elements. They can be found in the generated Java sources directory /pmml-model/target/generated-sources/xjc after a successful build operation.

All class model classes descend from class org.dmg.pmml.PMMLObject. Additional class hierarchy levels, if any, represent common behaviour and/or features. For example, all model classes descend from class org.dmg.pmml.Model.

There is not much documentation accompanying class model classes. The application developer should consult with the [PMML specification] (http://www.dmg.org/v4-1/GeneralStructure.html) about individual PMML elements and attributes.

Example applications

Evaluation engine

A model evaluator class can be instantiated directly when the contents of the PMML document is known:

PMML pmml = ...;

ModelEvaluator<TreeModel> modelEvaluator = new TreeModelEvaluator(pmml);

Otherwise, a PMML manager class should be instantiated first, which will inspect the contents of the PMML document and instantiate the right model evaluator class later:

PMML pmml = ...;

PMMLManager pmmlManager = new PMMLManager(pmml);
 
ModelEvaluator<?> modelEvaluator = (ModelEvaluator<?>)pmmlManager.getModelManager(null, ModelEvaluatorFactory.getInstance());

Model evaluator classes follow functional programming principles. Model evaluator instances are cheap enough to be created and discarded as needed (ie. not worth the pooling effort).

It is advisable for application code to work against the org.jpmml.evaluator.Evaluator interface:

Evaluator evaluator = (Evaluator)modelEvaluator;

An evaluator instance can be queried for the definition of active (ie. independent), predicted (ie. primary dependent) and output (ie. secondary dependent) fields:

List<FieldName> activeFields = evaluator.getActiveFields();
List<FieldName> predictedFields = evaluator.getPredictedFields();
List<FieldName> outputFields = evaluator.getOutputFields();

The PMML scoring operation must be invoked with valid arguments. Otherwise, the behaviour of the model evaluator class is unspecified.

The preparation of field values:

Map<FieldName, FieldValue> arguments = new LinkedHashMap<FieldName, FieldValue>();

List<FieldName> activeFields = evaluator.getActiveFields();
for(FieldName activeField : activeFields){
	// The raw (ie. user-supplied) value could be any Java primitive value
	Object rawValue = ...;

	// The raw value is passed through: 1) outlier treatment, 2) missing value treatment, 3) invalid value treatment and 4) type conversion
	FieldValue activeValue = evaluator.prepare(activeField, rawValue);

	arguments.put(activeField, activeValue);
}

The scoring:

Map<FieldName, ?> results = evaluator.evaluate(arguments);

Typically, a model has exactly one predicted field, which is called the target field:

FieldName targetName = evaluator.getTargetField();
Object targetValue = results.get(targetName);

The target value is either a Java primitive value (as a wrapper object) or an instance of org.jpmml.evaluator.Computable:

if(targetValue instanceof Computable){
	Computable computable = (Computable)targetValue;

	Object primitiveValue = computable.getResult();
}

The target value may implement interfaces that descend from interface org.jpmml.evaluator.ResultFeature:

// Test for "entityId" result feature
if(targetValue instanceof HasEntityId){
	HasEntityId hasEntityId = (HasEntityId)targetValue;
	HasEntityRegistry<?> hasEntityRegistry = (HasEntityRegistry<?>)evaluator;
	BiMap<String, ? extends Entity> entities = hasEntityRegistry.getEntityRegistry();
	Entity winner = entities.get(hasEntityId.getEntityId());

	// Test for "probability" result feature
	if(targetValue instanceof HasProbability){
		HasProbability hasProbability = (HasProbability)targetValue;
		Double winnerProbability = hasProbability.getProbability(winner.getId());
	}
}
Example applications

Additional information

Please contact [[email protected]] (mailto:[email protected])

jpmml-sparkml's People

Contributors

vruusmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jpmml-sparkml's Issues

java.lang.IllegalArgumentException when calling ConverterUtil.toPMML

Hello,

I'm trying to use pmml export on a very small data sample used with DecisionTreeRegressor and I am getting a java.lang.IllegalArgumentExceptionerror when calling ConverterUtil.toPMML
Here is the code:

val sourceData = session.read.format("myformat").
  load(DataFileURL)

val assembler = new VectorAssembler()
  .setInputCols(Array("X1", "X2", "X3"))
  .setOutputCol("features")

val dt = new DecisionTreeRegressor()
  .setLabelCol("Y")
  .setFeaturesCol("features")
  .setImpurity("variance")
  .setMaxDepth(30)
  .setMaxBins(32)

val pipeline = new Pipeline().setStages(Array(assembler, dt))

val model = pipeline.fit(sourceData)

val pmml = ConverterUtil.toPMML(sourceData.schema, model)

X1 and X2 have a NominalAttribute attribute while X3 has the NumericAttribute

If I print the DecisionTreeRegressionModel, I get this result:

DecisionTreeRegressionModel (uid=dtr_8e66c6f292fc) of depth 7 with 57 nodes
  If (feature 2 <= 20.0)
   If (feature 1 in {0.0,1.0})
    If (feature 1 in {1.0})
     If (feature 0 in {0.0})
      Predict: 309.38
     Else (feature 0 not in {0.0})
      Predict: 569.6666666666669
    Else (feature 1 not in {1.0})
     If (feature 0 in {0.0})
      Predict: 583.5585714285714
     Else (feature 0 not in {0.0})
      Predict: 591.8775
   Else (feature 1 not in {0.0,1.0})
    If (feature 0 in {1.0})
     Predict: 1882.7800000000002
    Else (feature 0 not in {1.0})
     Predict: 2435.3799999999997
  Else (feature 2 > 20.0)
   If (feature 1 in {0.0})
    If (feature 2 <= 22.0)
     If (feature 0 in {0.0})
      If (feature 2 <= 21.0)
       Predict: 160.80599999999998
      Else (feature 2 > 21.0)
       Predict: 418.02833333333336
     Else (feature 0 not in {0.0})
      If (feature 2 <= 21.0)
       Predict: 636.2533333333334
      Else (feature 2 > 21.0)
       Predict: 273.82000000000005
    Else (feature 2 > 22.0)
     If (feature 2 <= 24.0)
      If (feature 0 in {0.0})
       If (feature 2 <= 23.0)
        Predict: 196.11
       Else (feature 2 > 23.0)
        Predict: 214.44
      Else (feature 0 not in {0.0})
       Predict: 303.5300000000001
     Else (feature 2 > 24.0)
      Predict: 152.13000000000002
   Else (feature 1 not in {0.0})
    If (feature 2 <= 22.0)
     If (feature 1 in {2.0})
      If (feature 2 <= 21.0)
       If (feature 0 in {1.0})
        Predict: 238.91666666666666
       Else (feature 0 not in {1.0})
        Predict: 244.89999999999998
      Else (feature 2 > 21.0)
       Predict: 333.3599999999999
     Else (feature 1 not in {2.0})
      If (feature 2 <= 21.0)
       If (feature 0 in {0.0})
        Predict: 387.8825
       Else (feature 0 not in {0.0})
        Predict: 446.2525
      Else (feature 2 > 21.0)
       If (feature 0 in {1.0})
        Predict: 316.75
       Else (feature 0 not in {1.0})
        Predict: 402.85714285714283
    Else (feature 2 > 22.0)
     If (feature 2 <= 24.0)
      If (feature 0 in {0.0})
       If (feature 2 <= 23.0)
        Predict: 239.59000000000003
       Else (feature 2 > 23.0)
        If (feature 1 in {2.0})
         Predict: 541.51
        Else (feature 1 not in {2.0})
         Predict: 1087.8500000000001
      Else (feature 0 not in {0.0})
       If (feature 2 <= 23.0)
        If (feature 1 in {1.0})
         Predict: 842.3125000000001
        Else (feature 1 not in {1.0})
         Predict: 1059.3700000000003
       Else (feature 2 > 23.0)
        Predict: 384.6300000000001
     Else (feature 2 > 24.0)
      If (feature 1 in {1.0})
       Predict: 350.58000000000004
      Else (feature 1 not in {1.0})
       Predict: 477.96000000000004

What am I missing to be able to get a PMML model?

java.lang.NoSuchMethodError on org.dmg.pmml.MiningField.setUsageType

Dear Sir or Madam,
When I try to export my model, I encounter following error.
I am using the jpmml-sparkml 1.1.6, and spark 2.0.2

scala> val pmml = ConverterUtil.toPMML(df.schema, model)
java.lang.NoSuchMethodError: org.dmg.pmml.MiningField.setUsageType(Lorg/dmg/pmml/MiningField$UsageType;)Lorg/dmg/pmml/MiningField;
  at org.jpmml.converter.ModelUtil.createMiningField(ModelUtil.java:73)
  at org.jpmml.converter.ModelUtil.createMiningSchema(ModelUtil.java:57)
  at org.jpmml.converter.ModelUtil.createMiningSchema(ModelUtil.java:46)
  at org.jpmml.sparkml.model.RandomForestClassificationModelConverter.encodeModel(RandomForestClassificationModelConverter.java:45)
  at org.jpmml.sparkml.model.RandomForestClassificationModelConverter.encodeModel(RandomForestClassificationModelConverter.java:33)
  at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:131)
  ... 52 elided

Thanks in advance for helping.

training model using spark and predict same data using jpmm-evaluator , but got low accuracy.

@vruusmann , When training model using spark and predict same data using jpmm-evaluator , but got low accuracy. What's wrong with my codes?

  1. spark train gbdt model and save as pmml format and training acc is 0.8516456322692403:
(training, test) = data.toDF("label","degree","tcNum","pageRank","commVertexNum","normQ","gtRate","eqRate","ltRate").randomSplit(Array(1.0 - fracTest, fracTest), 1234)
// Set up Pipeline
    val stages = new mutable.ArrayBuffer[PipelineStage]()
    // (1) For classification, re-index classes.
    val labelColName = if (algo == "classification") "indexedLabel" else "label"
    if (algo == "classification") {
      val labelIndexer = new StringIndexer()
        .setInputCol("label")
        .setOutputCol(labelColName)
      stages += labelIndexer
    }

    val vectorAssember = new VectorAssembler()
    vectorAssember.setInputCols(Array("degree","tcNum","pageRank","commVertexNum","normQ","gtRate","eqRate","ltRate"))
    vectorAssember.setOutputCol("features")
    val vectorData = vectorAssember.transform(training)

//    val vectorData = vectorAssember.transform(training)

    stages += vectorAssember
    // (3) Learn GBT.
    val dt = algo match {
      case "classification" =>
        new GBTClassifier()
          .setLabelCol(labelColName)
          .setFeaturesCol("features")
          .setMaxDepth(params.maxDepth)
          .setMaxBins(params.maxBins)
          .setMinInstancesPerNode(params.minInstancesPerNode)
          .setMinInfoGain(params.minInfoGain)
          .setCacheNodeIds(params.cacheNodeIds)
          .setCheckpointInterval(params.checkpointInterval)
          .setMaxIter(params.maxIter)
      case "regression" =>
        new GBTRegressor()
          .setFeaturesCol("features")
          .setLabelCol(labelColName)
          .setMaxDepth(params.maxDepth)
          .setMaxBins(params.maxBins)
          .setMinInstancesPerNode(params.minInstancesPerNode)
          .setMinInfoGain(params.minInfoGain)
          .setCacheNodeIds(params.cacheNodeIds)
          .setCheckpointInterval(params.checkpointInterval)
          .setMaxIter(params.maxIter)
      case _ => throw new IllegalArgumentException("Algo ${params.algo} not supported.")
    }
    stages += dt
    val pipeline = new Pipeline().setStages(stages.toArray)

    // Fit the Pipeline.
    val startTime = System.nanoTime()
    val pipelineModel = pipeline.fit(training)
    val elapsedTime = (System.nanoTime() - startTime) / 1e9
    println(s"Training time: $elapsedTime seconds")

    /**
      * write model pmml format to hdfs
      */
    val modelPmmlPath = "sjmei/pmmlmodel"
    val pmml = ConverterUtil.toPMML(training.schema, pipelineModel);
//    val conf = new Configuration();
//    HadoopFileUtil.deleteFile(modelPmmlPath)
//    val path = new Path(modelPmmlPath);
//    val fs = path.getFileSystem(conf);
//    val out = fs.create(path);
    MetroJAXBUtil.marshalPMML(pmml, new FileOutputStream(modelPmmlPath));

2.load pmml model and using jpmml-evaluator to predict data, but predict acc is only :

acc count:4537
error count:5553
acc rate:0.44965312190287415
public class ScoreTest {

    public static void main(String[] args) throws Exception {
        PMML pmml = readPMML(new File("sjmei/pmmlmodel/rf.pmml"));
        ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
//        System.out.println(pmml.getModels().get(0));
//        Evaluator evaluator = modelEvaluatorFactory.newModelEvaluator(pmml);
        ModelEvaluator evaluator = new MiningModelEvaluator(pmml);

        List<InputField> inputFields = evaluator.getInputFields();

        InputStream is = new FileInputStream(new File("jrdm-dm/data/graph.result.final.vertices.wide.tbl/part-00000"));
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line;

        int diffDelta = 0;
        int sameDelta = 0;
        while((line = br.readLine()) != null) {
            String[] splits = line.split("\t",-1);

            String label = splits[14];

            Map<FieldName, FieldValue> arguments = readArgumentsFromLine(splits, inputFields);

            Map<FieldName, ?> results = evaluator.evaluate(arguments);
//            System.out.println(results);
            List<TargetField> targetFields = evaluator.getTargetFields();
            for(TargetField targetField : targetFields){
                FieldName targetFieldName = targetField.getName();
                Object targetFieldValue = results.get(targetFieldName);

                ProbabilityDistribution nodeMap = (ProbabilityDistribution)targetFieldValue;
                Object result = nodeMap.getResult();
                if(String.valueOf(transToDouble(label)).equalsIgnoreCase(result.toString())){
                    sameDelta +=1;
                }else{
                    diffDelta +=1;
                }
            }
        }

        System.out.println("acc count:"+sameDelta);
        System.out.println("error count:"+diffDelta);
        System.out.println("acc rate:"+(sameDelta*1.0d/(sameDelta+diffDelta)));

    }

    /**
     * 从文件中读取pmml模型文件
     * @param file
     * @return
     * @throws Exception
     */
    public static PMML readPMML(File file) throws Exception {


        String pmmlString = new Scanner(file).useDelimiter("\\Z").next();
        InputStream is = new ByteArrayInputStream(pmmlString.getBytes());
        InputSource source = new InputSource(is);
        SAXSource transformedSource = ImportFilter.apply(source);

        return JAXBUtil.unmarshalPMML(transformedSource);
    }

    /**
     * 构造模型输入特征字段
     * @param splits
     * @param inputFields
     * @return
     */
    public static Map<FieldName, FieldValue> readArgumentsFromLine(String[] splits, List<InputField> inputFields) {

        List<Double> lists = new ArrayList<Double>();
        lists.add(Double.valueOf(splits[3]));
        lists.add(Double.valueOf(splits[4]));
        lists.add(Double.valueOf(splits[5]));
        lists.add(Double.valueOf(splits[7]));
        lists.add(Double.valueOf(splits[8]));
        lists.add(Double.valueOf(splits[9]));
        lists.add(Double.valueOf(splits[10]));
        lists.add(Double.valueOf(splits[11]));

        Map<FieldName, FieldValue> arguments = new LinkedHashMap<FieldName, FieldValue>();

        int i = 0;
        for(InputField inputField : inputFields){
            FieldName inputFieldName = inputField.getName();
            Object rawValue = lists.get(i);
            FieldValue inputFieldValue = inputField.prepare(rawValue);

            arguments.put(inputFieldName, inputFieldValue);
            i+=1;
        }

        return arguments;
    }

    public static Double transToDouble(String label) {
        try {
            return Double.valueOf(label);
        }catch (Exception e){
            return Double.valueOf(0);
        }
    }
}

How to convert spark rdd based gbdt model to pmml model?

Jpmml-sparkml can only convert Spark ML pipelines to PMML. But I trained a spark rdd based gbdt mllib model, how can i convert the mllib model to pmml model.

PMML model export - RDD-based API show that only KMeansModel, LinearRegressionModel, RidgeRegressionModel, LassoModel, SVMModel, Binary LogisticRegressionModel can be converted to pmml model. What about gbdt model, Is there no method to convert it to pmml model?

Can anyone help me?

Exception in thread "main" java.lang.IllegalArgumentException: skip

I am using spark2.1.1 and jpmml 1.2.12 in execution, reporting the following error:

Exception in thread "main" java.lang.IllegalArgumentException: skip
	at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:65)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
	at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
	at com.nubia.train.Ad_ctr_train$.main(Ad_ctr_train.scala:182)
	at com.nubia.train.Ad_ctr_train.main(Ad_ctr_train.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Custom transformers for configuring continuous and categorical feature information

As requested in jpmml/jpmml-evaluator#56

The JPMML-SkLearn project defines two custom transformation types sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain, which provide the ability to configure missing value, invalid value etc. treatments. For example:

mapper = DataFrameMapper([
  ("Sepal.Length", ContinuousDomain(missing_value_treatment = "as_is", invalid_value_treatment = "as_is"))
])

The JPMML-SparkML should provide identical functionality.

setup with sbt rather than maven

Hello,

How should I include exclusions and shading within my build.sbt ?

My current build.sbt :

scalaVersion := "2.10.6"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
  "org.apache.spark" % "spark-sql_2.11" % "2.1.0",
  "org.apache.hbase" % "hbase-common" % "1.2.2",
  "org.apache.spark" % "spark-mllib_2.11" % "2.1.0",
  "org.jpmml" % "jpmml-sparkml" % "1.2.6"
)

excludeDependencies += "org.jpmml" % "pmml-model"

Thanks

Unsupported vector type on datasource that provides it

Hello,

We are using Spark with a custom datasource that directly gives a label, vector(features) dataframe which saves using a VectorAssembler in the pipeline.
While this works just fine to train ML models, we can't export them to PMML using jpmml-sparkml because we receive this error
java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Looking around on various sites, I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

UnsupportedOperationException when exporting StringIndexer with LogisticRegression

Hi,

I'm testing a very simple case just to evaluate the library and ran into an issue. Here's the code:

        // Load training data
        Dataset training = getTrainingData(jsc, sqlContext);
        StructType schema = training.schema();

        // Define the pipeline
        StringIndexer countryIndexer = new StringIndexer()
                .setInputCol("country")
                .setOutputCol("country_index");

        VectorAssembler assembler = new VectorAssembler()
                .setInputCols(new String[]{"country_index", "a", "b"})
                .setOutputCol("features");

        LogisticRegression lr = new LogisticRegression()
                .setMaxIter(10)
                .setRegParam(0.3)
                .setElasticNetParam(0.8);

        Pipeline pipeline = new Pipeline();
        pipeline.setStages(new PipelineStage[]{countryIndexer, assembler, lr});

        // Fit the model
        PipelineModel pipelineModel = pipeline.fit(training);

        // Predict
        Dataset testing = getTestingData(jsc, sqlContext);
        Dataset predictions = pipelineModel.transform(testing);
        predictions.show();

        // Export to PMML
        PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);

Here's a piece of relevant output (predictions.show() and the exception):

+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+
|label|country|  a|   b|country_index|      features|       rawPrediction|         probability|prediction|
+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+
|  0.0|     FR|1.0|-0.2|          0.0|[0.0,1.0,-0.2]|[0.43756144584300...|[0.60767781895595...|       0.0|
|  1.0|     DE|0.9| 0.5|          1.0| [1.0,0.9,0.5]|[-0.7827870058785...|[0.31371953355157...|       1.0|
+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+

Exception in thread "main" java.lang.UnsupportedOperationException
	at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:63)
	at org.jpmml.converter.regression.RegressionModelUtil.createRegressionTable(RegressionModelUtil.java:232)
	at org.jpmml.converter.regression.RegressionModelUtil.createBinaryLogisticClassification(RegressionModelUtil.java:113)
	at org.jpmml.converter.regression.RegressionModelUtil.createBinaryLogisticClassification(RegressionModelUtil.java:87)
	at org.jpmml.sparkml.model.LogisticRegressionModelConverter.encodeModel(LogisticRegressionModelConverter.java:52)
	at org.jpmml.sparkml.model.LogisticRegressionModelConverter.encodeModel(LogisticRegressionModelConverter.java:39)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:165)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:81)
	at com.vika.pmml.PmmlExample.run(PmmlExample.java:99)
	at com.vika.pmml.PmmlExample.main(PmmlExample.java:40)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

the training data:

    private static final StructType SCHEMA = new StructType(new StructField[]{
            createStructField("label", DoubleType, false),
            createStructField("country", StringType, false),
            createStructField("a", DoubleType, false),
            createStructField("b", DoubleType, false)
    });

    private Dataset getTrainingData(JavaSparkContext jsc, SQLContext sqlContext) {

        JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
                RowFactory.create(1.0, "DE", 1.1, 0.1),
                RowFactory.create(0.0, "FR", 1.0, -1.0),
                RowFactory.create(0.0, "FR", 1.3, 1.0),
                RowFactory.create(1.0, "DE", 1.2, -0.5)
        ));
        return sqlContext.createDataFrame(jrdd, SCHEMA);
    }

The exception is thrown when the country feature is handled in RegressionModelUtil.createRegressionTable().

Am I doing something wrong? Or it seems like using StringIndexer with LogisticRegression is not working right.

By the way, I also tried the same code with the library version 1.0.9 and Spark 1.6, it did get exported:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
    <Header>
        <Application name="JPMML-SparkML" version="1.0.9"/>
        <Timestamp>2017-07-14T16:20:50Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="country" optype="categorical" dataType="string">
            <Value value="FR"/>
            <Value value="DE"/>
        </DataField>
        <DataField name="a" optype="continuous" dataType="double"/>
        <DataField name="b" optype="continuous" dataType="double"/>
        <DataField name="label" optype="categorical" dataType="double">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
    </DataDictionary>
    <RegressionModel functionName="classification" normalizationMethod="softmax">
        <MiningSchema>
            <MiningField name="label" usageType="target"/>
            <MiningField name="country"/>
            <MiningField name="a"/>
            <MiningField name="b"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_0" feature="probability" value="0"/>
            <OutputField name="probability_1" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="-0.4375614458430096" targetCategory="1">
            <NumericPredictor name="country" coefficient="1.2203484517215881"/>
            <NumericPredictor name="a" coefficient="0.0"/>
            <NumericPredictor name="b" coefficient="0.0"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
    </RegressionModel>
</PMML>

however evaluating this PMML didn't work:

Exception in thread "main" org.jpmml.evaluator.TypeCheckException: Expected DOUBLE, but got STRING (FR)
	at org.jpmml.evaluator.TypeUtil.toDouble(TypeUtil.java:617)
	at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:424)
	at org.jpmml.evaluator.FieldValue.getValue(FieldValue.java:320)
	at org.jpmml.evaluator.FieldValue.asNumber(FieldValue.java:269)
	at org.jpmml.evaluator.RegressionModelEvaluator.evaluateRegressionTable(RegressionModelEvaluator.java:194)
	at org.jpmml.evaluator.RegressionModelEvaluator.evaluateClassification(RegressionModelEvaluator.java:146)
	at org.jpmml.evaluator.RegressionModelEvaluator.evaluate(RegressionModelEvaluator.java:70)
	at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:346)

Thank you very much beforehand!

Load PMML to spark

Great, I succeeded in my demo.
Also, I want to load PMML to spark, have you considered this?

Problem with underscore when using RegexTokenizer()

Hello,
here is an issue I'm facing when using RegexTokenizer:
When using RegexTokenizer in Spark pipeline, jpmml-sparkml allows two types of patterns:
"\s+" and "\W+".
When using "\W+" with gaps=True, it removes non alphanumerical characters, but also underscores ("_") are not removed.
However, in the case when underscores appear in the text, the function toPMMLBytes returns an error which is related to the underscore.
So it looks like underscores can not be removed, but also can't be left inside.

Thanks

java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Hello,

I encountered some problems when using the JPMML model transformation. This is my data source:
val trainingDataFrame = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features")
The schema of "trainingDataFrame" contains the VectorUDT type, so when I use ConverterUtil.toPMML (newSchema, loadedModel), it will prompt java.lang.IllegalArgumentException.
Here is the code:

  val training = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features")

  val vi = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexed")
      .setMaxCategories(693)

   val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcaFeatures")
      .setK(3)

   val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)
      .setProbabilityCol("myProbability")

    val pipeline = new Pipeline().setStages(Array(vi, pca, lr))

    val model = pipeline.fit(training)

    model.write.overwrite().save(modelSavePath)

    training.show(10)
    println("==========================")
    println("traing dataframe's schema is:  " + training.schema.mkString)
    println("==========================")
    val schema = training.schema
    val pmml = ConverterUtil.toPMML(schema, model)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

The full stack trace is:

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
+-----+--------------------+
only showing top 10 rows

==========================
traing dataframe's schema is: 	
StructField(label,DoubleType,true)StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)
==========================
Exception in thread "main" java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:160)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
	at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:56)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:75)
	at com.myhexin.oryx.batchlayer.TestPMML$.trainModel(TestPMML.scala:138)
	at com.myhexin.oryx.batchlayer.TestPMML$.main(TestPMML.scala:29)
	at com.myhexin.oryx.batchlayer.TestPMML.main(TestPMML.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What should I do to solve this VectorUDT unsupported problem?

Add VectorToScalar transformer class

Use case: A classification model is returning a probability distribution. The data scientist wants to extract the probability of a specific class out of it, and apply further transformations to it ("decision engineering").

The probability distribution is returned as VectorUDT. It is possible to splice it into a one-element VectorUDT using ml.feature.VectorSlicer. However, most common transformer classes (eg. ml.feature.Bucketizer) refuse to accept vector as input.

The VectorToScalar pseudo-transformer class would simply unwrap a single-element vector to a scalar numeric value (ie. int, float or double). The data type of the output column can be manually overriden.

spark gbtmodel Segmentation, MiningField as feature?

hello, I pmml as fllows ,i do not know why “label” is usageType="target" in MiningSchema, but "label‘’ is active in MiningModel/Segmentation[segment@id=1]?

<DataDictionary>
		<DataField name="label" optype="categorical" dataType="double">
			<Value value="0.0"/>
			<Value value="1.0"/>
		</DataField>
		<DataField name="feature_442" optype="continuous" dataType="double"/>
		<DataField name="feature_443" optype="continuous" dataType="double"/>
		<DataField name="feature_481" optype="continuous" dataType="double"/>
		<DataField name="feature_894" optype="continuous" dataType="double"/>
		<DataField name="feature_1862" optype="continuous" dataType="double"/>
	</DataDictionary>
	<MiningModel functionName="classification">
		<MiningSchema>
			<MiningField name="label" usageType="target"/>
			<MiningField name="feature_442"/>
			<MiningField name="feature_443"/>
			<MiningField name="feature_481"/>
			<MiningField name="feature_894"/>
			<MiningField name="feature_1862"/>
		</MiningSchema>
		<Segmentation multipleModelMethod="modelChain">
			<Segment id="1">
				<True/>
				<MiningModel functionName="regression">
					<MiningSchema>
						<MiningField name="feature_442"/>
						<MiningField name="feature_443"/>
						<MiningField name="feature_481"/>
						<MiningField name="feature_894"/>
						<MiningField name="feature_1862"/>
						<MiningField name="label"/>
					</MiningSchema>
					<Output>
						<OutputField name="gbtValue" optype="continuous" dataType="double" feature="predictedValue" isFinalResult="false"/>
						<OutputField name="binarizedGbtValue" optype="continuous" dataType="double" feature="transformedValue" isFinalResult="false">
							<Apply function="if">
								<Apply function="greaterThan">
									<FieldRef field="gbtValue"/>
									<Constant dataType="double">0</Constant>
								</Apply>
								<Constant dataType="double">-1</Constant>
								<Constant dataType="double">1</Constant>
							</Apply>
						</OutputField>
					</Output>
					<Segmentation multipleModelMethod="sum">
						<Segment id="1">
							<True/>
							<TreeModel functionName="regression" splitCharacteristic="binarySplit">
								<MiningSchema>
									<MiningField name="label"/>
								</MiningSchema>
								<Node score="-0.08980349484734046">
									<True/>
									<Node score="-1">
										<SimplePredicate field="label" operator="lessOrEqual" value="0"/>
									</Node>
									<Node score="1">
										<SimplePredicate field="label" operator="greaterThan" value="0"/>
									</Node>
								</Node>
							</TreeModel>
						</Segment>
						<Segment id="2">
							<True/>
							<TreeModel functionName="regression" splitCharacteristic="binarySplit">
								<MiningSchema>
									<MiningField name="feature_442"/>
									<MiningField name="feature_443"/>
									<MiningField name="feature_481"/>
									<MiningField name="feature_894"/>
									<MiningField name="feature_1862"/>
									<MiningField name="label"/>
								</MiningSchema>
								<Targets>
									<Target rescaleFactor="0.1"/>
								</Targets>
								<Node score="-0.04281935597440249">
									<True/>
									<Node score="-0.47681168808845653">
										<SimplePredicate field="label" operator="lessOrEqual" value="0"/>
										<Node score="-0.47681168808847174">
											<SimplePredicate field="feature_442" operator="lessOrEqual" value="-0.5888127277121523"/>
											<Node score="-0.4768116880884725">
												<SimplePredicate field="feature_894" operator="lessOrEqual" value="-0.6830283900955506"/>
											</Node>
											<Node score="-0.47681168808847285">
												<SimplePredicate field="feature_894" operator="greaterThan" value="-0.6830283900955506"/>
											</Node>
										</Node>
										<Node score="-0.47681168808847096">
											<SimplePredicate field="feature_442" operator="greaterThan" value="-0.5888127277121523"/>
											<Node score="-0.47681168808847013">
												<SimplePredicate field="feature_443" operator="lessOrEqual" value="-1.2352702594745397"/>
											</Node>
											<Node score="-0.4768116880884723">
												<SimplePredicate field="feature_443" operator="greaterThan" value="-1.2352702594745397"/>
											</Node>
										</Node>
									</Node>
									<Node score="0.47681168808845853">
										<SimplePredicate field="label" operator="greaterThan" value="0"/>
										<Node score="0.47681168808846963">
											<SimplePredicate field="feature_1862" operator="lessOrEqual" value="-1.38258310890975"/>
											<Node score="0.4768116880884702">
												<SimplePredicate field="feature_481" operator="lessOrEqual" value="-1.128558484240802"/>
											</Node>
											<Node score="0.4768116880884703">
												<SimplePredicate field="feature_481" operator="greaterThan" value="-1.128558484240802"/>
											</Node>
										</Node>
										<Node score="0.47681168808847163">
											<SimplePredicate field="feature_1862" operator="greaterThan" value="-1.38258310890975"/>
										</Node>
									</Node>
								</Node>
							</TreeModel>
						</Segment>
					</Segmentation>
				</MiningModel>
			</Segment>
...

The StringIndexerModelConverter stores labels instead of indices in .pmml file, right?

Sorry to bother you again.

The schema of training data is as follows:
column: a1; data type: double; role: feature
column: a2; data type: double; role: feature
column: a3; data type: double; role: label

And there are only two values(-1.0, 1.0) in the label column(a3).

In order to train a pipeline, I put StringIndexer, VectorIndexer and Decision Tree Classifier together.
new Pipeline() .setStages(Array(labelIndexer, vectorIndexer, classifier))

After fitted the pipeline, the model is tranformed to pmml.
ConverterUtil.toPMML(schema, model.model.asInstanceOf[PipelineModel])

What confuesed me is that StringIndexerModelConverter stores the labels("-1.0" and "1.0") in the pmml file instead of indices of labels("0" and "1"). Is it right? Then how jpmml-sparkml transforms the labels to indices? I just cannot find the related code. Sad...

<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
    <Header>
        <Application/>
        <Timestamp>2017-02-20T03:17:32Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="a3" optype="categorical" dataType="double">
            <Value value="-1.0"/>
            <Value value="1.0"/>
        </DataField>
        <DataField name="a1" optype="continuous" dataType="double"/>
        <DataField name="a2" optype="continuous" dataType="double"/>
        <DataField name="prediction" optype="categorical" dataType="double"/>
    </DataDictionary>
    <TreeModel functionName="classification" splitCharacteristic="binarySplit">
        <MiningSchema>
            <MiningField name="a3" usageType="target"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_-1.0" feature="probability" value="-1.0"/>
            <OutputField name="probability_1.0" feature="probability" value="1.0"/>
        </Output>
        <Node score="-1.0" recordCount="5300.0">
            <True/>
            <ScoreDistribution value="-1.0" recordCount="2924.0"/>
            <ScoreDistribution value="1.0" recordCount="2376.0"/>
        </Node>
    </TreeModel>
</PMML>

Another question is that does jpmml-spark supports loading a pipleline model(including transformers and a classifier or a regressor) from pmml file? I use jpmml-spark to load the above pipeline model from the pmml file, but it seems the StringIndexerModel doesn't work correctly.

Looking forward to your reply. Thanks a lot!

Classloading pb

I'm facing an unsolvable pb with HDP 2.4.

The spark assembly in HDP 2.4 had an incompatible version of jpmml that is not shaded.

For this line ConverterUtil.toPMML(schema, pipe), we have these error:

java.lang.NoSuchMethodError: org.dmg.pmml.DataField.setOpType(Lorg/dmg/pmml/OpType;)Lorg/dmg/pmml/DataField;
        at org.jpmml.sparkml.FeatureMapper.toContinuous(FeatureMapper.java:185)
        at org.jpmml.sparkml.FeatureMapper.createSchema(FeatureMapper.java:135)
        at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:123)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
        at $iwC$$iwC$$iwC.<init>(<console>:45)
        at $iwC$$iwC.<init>(<console>:47)
        at $iwC.<init>(<console>:49)
        at <init>(<console>:51)
        at .<init>(<console>:55)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Support for `boolean` target fields

A binary classification model with boolean target field cannot be converted:

val formula = new org.apache.spark.ml.feature.RFormula().setFormula("ResultConverted ~ distancevaluefromcentrallocation + availabilitynext3days")

Schema:

root
|-- ResultConverted: boolean (nullable = true)
|-- distancevaluefromcentrallocation: double (nullable = true)
|-- availabilitynext3days: long (nullable = true)

The exception is:

Exception in thread "main" java.lang.IllegalArgumentException: Expected 2 target categories, got 0 target categories
      at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:134)
      at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:161)

Float not supported by SparkMLEncoder

The code appears to accept only fields of data type String, Integer, Double, or Boolean.

My use case includes float columns and generates the exception below:

Caused by: java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got float type
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:303)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:232)
	at org.jpmml.sparkml.feature.VectorAssemblerConverter.encodeFeatures(VectorAssemblerConverter.java:43)
	at org.jpmml.sparkml.SparkMLEncoder.append(SparkMLEncoder.java:74)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:123)

mvn: Importing the multiarray numpy extension module failed

When I try to build a jpmml-sparkml package with pyspark profile:
mvn -Ppyspark clean package

I am getting an error:

Traceback (most recent call last):
  File "setup.py", line 3, in <module>
    from jpmml_sparkml import __license__, __version__
  File "/home/bluedata/jpmml-sparkml-package/target/egg-sources/jpmml_sparkml/__init__.py", line 4, in <module>
    from pyspark.ml.common import _py2java
  File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/__init__.py", line 22, in <module>
    from pyspark.ml.base import Estimator, Model, Transformer
  File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 21, in <module>
    from pyspark.ml.param import Params
  File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/param/__init__.py", line 26, in <module>
    import numpy as np
  File "/usr/lib64/python3.4/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/usr/lib64/python3.4/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/usr/lib64/python3.4/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/usr/lib64/python3.4/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/usr/lib64/python3.4/site-packages/numpy/core/__init__.py", line 26, in <module>
    raise ImportError(msg)
ImportError:
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: **cannot import name multiarray**

[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
        at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
        at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
        at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
        at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
        at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
        at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.jav
        at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
        at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
        at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
        at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
        at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

My PYTHONPATH variable:
/usr/lib64/python3.4/site-packages:/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/:/opt/bluedata/vagent/vagent/python:/opt/bluedata/vagent/vagent/python

I uninstalled numpy (numpy-1.13.0) and installed again - no progress.

This error does not appear when I build without pyspark profile:
mvn clean package
However, no EGG file is created and when I try to run a code from Zeppelin:

from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, vectorized_CV_data, CV_model)
print(pmmlBytes.decode("UTF-8"))

I am getting:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark-1803236182413842559.py", line 337, in <module> exec(code) File "<stdin>", line 1, in <module> ImportError: No module named 'jpmml_sparkml'

Any help on how to solve this issue would be appreciated.

Thanks, Michal

How to get features in encodeFeatures method of VectorIndexerModelConverter?

Suppose we obtained a VectorIndexerModel with a param inputCol = "features" which specifies the name of input column(its type is Vector). Now how to get features in encodeFeatures method of VectorIndexerModelConverter?

It seems that this project doesn't support Vector type.

Looking forward to your reply. Thanks!

Support for Decimal Types?

Not sure if this is a noob question, but I'm wondering why there seems to be no support for DecimalType inputs to models?

When my featuresDF includes these types, I get the following error:

IllegalArgumentExceptionTraceback (most recent call last)
<ipython-input-46-d8430332ffa6> in <module>()
      1 from jpmml_sparkml import toPMMLBytes
----> 2 pmmlBytes = toPMMLBytes(spark, DF, pipelineModel)
      3 print(pmmlBytes)

/home/hadoop/pyenv/eggs/jpmml_sparkml-1.1rc0-py2.7.egg/jpmml_sparkml/__init__.pyc in toPMMLBytes(sc, df, pipelineModel)
     17         if(not isinstance(javaConverter, JavaClass)):
     18                 raise RuntimeError("JPMML-SparkML not found on classpath")
---> 19         return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)

/usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Expected string, integral, double or boolean type, got decimal(18,0) type'

I was eventually able to address this in pyspark with the following pre-model hack:

DF = ...
# convert all decimals to double
for f in DF.schema.fields:
    d = json.loads(f.json())
    if 'decimal' in d["type"]:
        DF = DF.withColumn(d['name'], DF[d["name"]].cast("double"))

However, i'm curious why DecimalType, which is effectively synonymous with DoubleType, is not natively supported?

Add support for `FPGrowth` model type

Getting the following exception -

java.lang.IllegalArgumentException: Transformer class org.apache.spark.ml.fpm.FPGrowthModel is not supported

I am trying to convert an FPGrowth model into PMML. Is it not supported?

Support for 32-bit float type?

Hi @vruusmann ,
I have java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got float type when the train data's schema is float type
I think we should add floatType to include more general circumstances, maybe could change code here?

switch(dataType){
case STRING:
feature = new WildcardFeature(this, dataField);
break;
case INTEGER:
case DOUBLE:
feature = new ContinuousFeature(this, dataField);
break;
case BOOLEAN:
feature = new BooleanFeature(this, dataField);
break;
default:
throw new IllegalArgumentException("Data type " + dataType + " is not supported");
}

Thank you!
Bests,
Yuanda

Add "dummy" estimator classes

Hello:
When I use Converter like this

val oneHotPMML = ConverterUtil.toPMML(onehotSource.schema, oneHotModel)
,
I got a Error like this:

Exception in thread "main" java.lang.IllegalArgumentException: Expected a pipeline with one or more models, got a pipeline with zero models
	at com.netease.mail.yanxuan.rms.utils.ConverterUtil.toPMML(ConverterUtil.java:118)
	at com.netease.mail.yanxuan.rms.scala.nn.feature.FeatureModelExport$.main(FeatureModelExport.scala:29)
	at com.netease.mail.yanxuan.rms.scala.nn.feature.FeatureModelExport.main(FeatureModelExport.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

After debug, I got the reason.
There didn't have any ModelConverter in my model.
Is it necessary that must have a ModelConverter in my pipelinemodel?

Request matching mode and setMinTokenLength suports for RegexTokenizer

Hello Villu,
Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:

My input: a column named 'sentence'
My output: a column named 'prediction' produced by logistic classification for the column 'sentence'
My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression

Problem 1:
my RegexTokenizer code as below

tokenizer = feature.RegexTokenizer()
  .setGaps(False)\
  .setPattern("\\b[a-zA-Z]{3,}\\b")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

But it throws a error

IllegalArgumentException: 'Expected splitter mode, got token matching mode'

So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:

Problem 2:

IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:

tokenizer = feature.RegexTokenizer()
  .setGaps(True)\
  .setPattern("\\s+")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

Then I got:

Problem 3:

 java.lang.IllegalArgumentException: .
	at org.jpmml.sparkml.feature.CountVectorizerModelConverter.encodeFeatures(CountVectorizerModelConverter.java:118)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:80)

After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period ( . ). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily.
I have no choice, but continue hacking. Then, I try to split sentence by pattern \\b[^a-zA-Z]{0,}\\b which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error

Problem 4:_

java.lang.IllegalArgumentException: Expected 1 as minimum token length, got 3 as minimum token length
	at org.jpmml.sparkml.feature.RegexTokenizerConverter.encodeFeatures(RegexTokenizerConverter.java:51)

As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.

I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.

java.lang.ClassNotFoundException: org.jpmml.sparkml.feature.NGramConverter

Hello,
I'm trying to use pmml export on a spark ml model and I am getting ajava.lang.ClassNotFoundExceptionerror when calling ConverterUtil.toPMML

I dealt with those conflicts by refering to readme.md and employing Maven Shade Plugin.

Here is my pom.xml file:

    <dependency>
  		<groupId>org.jpmml</groupId>
  		<artifactId>jpmml-sparkml</artifactId>
  		<version>1.3.3</version>
  		<scope>compile</scope>
  </dependency>

  <build>
  	<resources>
  		<resource>
  			<directory>src/main/resources</directory>
  			<excludes>
  				<exclude>**/*.xml</exclude>
  			</excludes>
  			<filtering>true</filtering>
  		</resource>
  	</resources>
  	<plugins>
  		<plugin>
  			<groupId>net.alchim31.maven</groupId>
  			<artifactId>scala-maven-plugin</artifactId>
  			<version>3.2.1</version>
  			<executions>
  				<execution>
  					<id>compile</id>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  					<phase>process-resources</phase>
  				</execution>
  			</executions>
  			<configuration>
  				<scalaVersion>${scala.version}</scalaVersion>
  			</configuration>
  		</plugin>
  		<plugin>
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-shade-plugin</artifactId>
  			<version>${maven.shade.version}</version>
  			<executions>
  				<execution>
  					<phase>package</phase>
  					<goals>
  						<goal>shade</goal>
  					</goals>
  					<configuration>
  						<relocations>
  							<relocation>
  								<pattern>org.dmg.pmml</pattern>
  								<shadedPattern>org.shaded.dmg.pmml</shadedPattern>
  							</relocation>
  							<relocation>
  								<pattern>org.jpmml</pattern>
  								<shadedPattern>org.shaded.jpmml</shadedPattern>
  							</relocation>
  						</relocations>
  					</configuration>
  				</execution>
  			</executions>
  		</plugin>
  	</plugins>
  </build>

Here is the error stack trace:

18/03/06 21:40:01 WARN sparkml.ConverterUtil: Failed to load transformer converter class
java.lang.ClassNotFoundException: org.jpmml.sparkml.feature.NGramConverter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:351)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:318)
	at org.shaded.jpmml.sparkml.ConverterUtil.<clinit>(ConverterUtil.java:369)
	at com.nubia.train.Ad_ctr_train_PMML$.main(Ad_ctr_train_PMML.scala:157)
	at com.nubia.train.Ad_ctr_train_PMML.main(Ad_ctr_train_PMML.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/03/06 21:40:01 WARN sparkml.ConverterUtil: Failed to load transformer class
java.lang.ClassNotFoundException: org.apache.spark.ml.feature.MaxAbsScalerModel
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:341)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:318)
	at org.shaded.jpmml.sparkml.ConverterUtil.<clinit>(ConverterUtil.java:369)
	at com.nubia.train.Ad_ctr_train_PMML$.main(Ad_ctr_train_PMML.scala:157)
	at com.nubia.train.Ad_ctr_train_PMML.main(Ad_ctr_train_PMML.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
......

Is it because Maven Shade Plugin?
Thanks in advance :)

Support transformed labels

Running Spark 2.1.2, using jpmml-sparkml 1.2.7.

While attempting to run the following pyspark in order to convert a simple pipeline with a RandomForestClassifer model with either toPMMLByteArray or toPMML, I'm receiving the a NullPointerException.

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import *

def updateFlightsSchema(dataSet):
    return ( dataSet.withColumn("DepDelay_Double",  dataSet["DepDelay"].cast("Double"))
                    .withColumn("DepDelay",         dataSet["DepDelay"].cast("Double"))
                    .withColumn("ArrDelay",         dataSet["ArrDelay"].cast("Double"))
                    .withColumn("Month",            dataSet["Month"].cast("Double"))
                    .withColumn("DayofMonth",       dataSet["DayofMonth"].cast("Double"))
                    .withColumn("CRSDepTime",       dataSet["CRSDepTime"].cast("Double"))
                    .withColumn("Distance",         dataSet["Distance"].cast("Double"))
                    .withColumn("AirTime",          dataSet["AirTime"].cast("Double"))
            )
    
data2007 = updateFlightsSchema(sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("2007_short.csv"))

removeCancelled = SQLTransformer(statement="select * from __THIS__ where Cancelled = \"0\" AND Diverted = \"0\"")
data2007 = removeCancelled.transform(data2007)

binarizer = Binarizer(threshold=15.0, inputCol="DepDelay_Double", outputCol="DepDelay_Bin")
featuresAssembler = VectorAssembler(inputCols=["Month", "CRSDepTime", "Distance"], outputCol="features")
rfc3 = RandomForestClassifier(labelCol="DepDelay_Bin", featuresCol="features", numTrees=3, maxDepth=5, seed=10305)

pipelineRF3 = Pipeline(stages=[binarizer, featuresAssembler, rfc3])

model3 = pipelineRF3.fit(data2007)

from py4j.java_gateway import JavaClass
from pyspark.ml.common import _py2java

javaDF = _py2java(sc, data2007)
javaSchema = javaDF.schema.__call__()

jvm = sc._gateway.jvm

javaConverter = sc._gateway.jvm.org.jpmml.sparkml.ConverterUtil
if(not isinstance(javaConverter, JavaClass)):
    raise RuntimeError("JPMML-SparkML not found on classpath")

pmml = jvm.org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(javaSchema, model3._to_java())
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.jpmml.sparkml.ConverterUtil.toPMMLByteArray.
: java.lang.NullPointerException
	at org.jpmml.converter.CategoricalLabel.<init>(CategoricalLabel.java:35)
	at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:82)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:162)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:86)
	at org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(ConverterUtil.java:142)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Following #22 I attempted to use the different Indexers on features and label columns to try and hint that these are categorical, but this resulted in the same error. Further, when I print the final tree, I do not see categorical feature declarations.

Dataset used, and tree output attached.
2007_short.zip
rfc.txt

Handling columns with null values

Exception in thread "main" java.lang.IllegalArgumentException: Field a1 has valid values [b, a]
	at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)
	at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:98)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:96)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:68)

I get the above exception when the column has null values. Any ideas on how to resolve this? Please comment if further details are needed.

jpmml-sparkml with python 3

how do I get mvn -Ppyspark clean package to use python 3.5 instead of python 2.7?

thank you for your help

Error: org.apache.spark.ml.feature.VectorAssembler is not supported

I ran programs in spark-local sucessfully. But when I ran codes in spark-yarn online, the following error message occurred (I have shaded org.jpmml to org.shaded.jpmml):

java.lang.IllegalArgumentException: Transformer class org.apache.spark.ml.feature.VectorAssembler is not supported
	at org.shaded.jpmml.sparkml.ConverterFactory.newConverter(ConverterFactory.java:53)
	at org.shaded.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:109)
	at com.iqiyi.columbus.joint_prevent.model.pmml.PMMLModelLocal$.saveModelAndEvaluate(PMMLModelLocal.scala:93)
	at com.iqiyi.columbus.joint_prevent.model.pmml.PMMLModelLocal$.main(PMMLModelLocal.scala:134)
	at com.iqiyi.columbus.joint_prevent.model.pmml.PMMLModelLocal.main(PMMLModelLocal.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)

Here is the stages of PipelineModel and the codes of buliding pmml:
val pipeline = new Pipeline().setStages(Array(vectorAssem, labelIndexer, featureIndexer, gbt, labelConverter))

val pipelineModel = trainPipelineModel(data, trainingData)
val pmml = new PMMLBuilder(schema, pipelineModel).build()

I used Spark 2.2.0, jpmml-sparkml 1.3.8, pmml-model 1.4.3(or 1.4.2, both failed).

Add support for `TrainValidationSplitModel` transformation type

Most Spark ML tutorials include this (pseudo-)transformation type into sample workflows. From the PMML perspective this is a no-op transformation, which can be simply skipped.

Currently, users have to manually "re-package" their fitted pipeline models, which is prone to error. Example issue - repackaging a fitted pipeline model, and neglecting label and feature column definitions: #18 (comment)

Custom Estimator to add to JPMML-SPARKML

Hi

I do have a custom estimator that merges rare categorical values into one value as 'RARE' so that I can group all the rare labels as together. I would like to know if it is possible and how can I add my custom modelconverter as you did for spark standard ml-features.

Ti give an example my custom estimator handles rare columns for categorical columns. So, if there are 1000 categories and only 30 of them are used in most of the time the rest 970 columns will be marked as RARE. So in my model I only save the rare labels. If you need I can paste the code itself as well.

Even if I manage it, I am not sure if jpmml-evaluater will be able to make it run.

val pmml_model = new PMMLBuilder(schema,pipeline_model).build() => error skip

When I train the sample, I use StringIndexer to execute the following statement:
Val pmml_model = new PMMLBuilder(schema,pipeline_model).build()

The error is:

Exception in thread "main" java.lang.IllegalArgumentException: skip
	at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:65)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
	at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
	at com.nubia.train.Ad_ctr_train$.main(Ad_ctr_train.scala:182)
	at com.nubia.train.Ad_ctr_train.main(Ad_ctr_train.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NoClassDefFoundError: org/dmg/pmml/mining/MiningModel

Im testing jpmml-spark on the spark-shell. We are running it on top of Yarn, using Spark 2.0.1 and scala 2.11.

I built the jar for the package and start the session like :

$SPARK_HOME/bin/spark-shell --jars jpmml-sparkml-1.1-SNAPSHOT.jar --packages com.databricks:spark-avro_2.11:3.0.1 --master yarn --deploy-mode client

However, I get an error when exporting a pipeline toPMMLByteArray.

import org.jpmml.sparkml.ConverterUtil

.... all of  the code to create the pipeline

val sparkPipelinePMMLEstimator = new Pipeline().setStages( categoricalFeatureIndexers.union(categoricalFeatureOneHotEncoders.union(Seq(featureAssemblerLr, featureAssemblerRf) )) :+ randomForest)

val sparkPipelinePMML = sparkPipelinePMMLEstimator.fit(df)
val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, sparkPipelinePMML)

This fails with the following error:

scala> val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, sparkPipelinePMML)
java.lang.NoClassDefFoundError: org/dmg/pmml/mining/MiningModel
  ... 48 elided
Caused by: java.lang.ClassNotFoundException: org.dmg.pmml.mining.MiningModel
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 48 more

Which seems like there is a package org.dmg.pmml.mining missing. How can I fix this issue?

Label field is required and Features field cannot be Vector for Random Forest Regression

I have a generated RandomForestRegressionModel model which was created somewhat similar to that done here: https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#random-forest-regression . The main difference with mine is that the features vector is created using a VectorAssembler and only the generated RandomForestRegressionModel is in the pipeline that i'm trying to export to PMML.

PipelineModel model = pipeline.fit(trainingData);

TrainValidationSplitModel tvsm = (TrainValidationSplitModel) model.stages()[0];
RandomForestRegressionModel rfrm = (RandomForestRegressionModel) tvsm.bestModel();

List<Transformer> stages = new ArrayList<>();
stages.add(rfrm);

final PipelineModel pipelineModel = new PipelineModel(
    UUID.randomUUID().toString(),
    stages);

StructType schema = testData.schema();

PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);

When trying to export the model to PMML, I get the following exception stating that the label field doesn't exist.

Exception in thread "main" java.lang.IllegalArgumentException: Field "label" does not exist.
	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
	at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
	at scala.collection.AbstractMap.getOrElse(Map.scala:59)
	at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:139)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
	at org.jpmml.sparkml.SparkMLEncoder.getOnlyFeature(SparkMLEncoder.java:60)
	at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:66)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:161)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:76)
	at org.loesoft.playground.datascience.mllib.Foo.main(Foo.java:144)

Although I know its probably wrong to do this but, I can get around this by adding the the "label" field to the schema however, then I get the following error further down when trying to parse the "features" field with the following exception:

Exception in thread "main" java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:160)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
	at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:140)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:161)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:76)
	at org.loesoft.playground.datascience.mllib.Foo.main(Foo.java:146)

So with this, two things stand out as issues to me. The first is the requirement of the "label" field. I do not believe this is used in the execution of the model so I'm not sure why it is required. The other is the requirement that the "features" field is expected to be a string, integral, double, or boolean when the RandomForestRegressionModel requires it to be a vector.

I'm using version 1.2.2 of jpmml-sparkml and 2.1.1 of spark-mllib_2.11

Bear in mind that I'm fairly new to Spark ML and JPMML so if I am incorrect on this matter, then I would appreciate some education as to where I am wrong.

Maven Dependency For JPMML Libraries

I want to use jpmml libraries in my project through maven repositories. Currently,I am using it by installing jpmml jar in my system m2 directory,but want direct links from maven

Warning about JPMML-SparkML and Apache Spark ML version incompatibility

The JPMML-SparkML project contains three active development branches (1.1.X, 1.2.X and 1.3.X), which target specific Apache Spark ML versions (2.0.X, 2.1.X and 2.2.X, respectively).

Depending on the complexity of pipeline, the following scenarios may take place when there's a version mismatch between the two:

  1. The conversion fails (eg. by throwing some sort of exception).
  2. The conversion succeeds, but the resulting PMML document is incorrect in a sense that is contains "outdated" prediction logic, so that (J)PMML and Apache Spark ML predictions are different.
  3. The conversion succeeds, and the resulting PMML document is correct.

The JPMML-SparkML library should contain special logic to rule out the first two scenarios. It should detect the version of the Apace Spark ML environment, and refuse to execute if it's not the correct one (eg. by throwing an exception that states "This version of JPMML-SparkML is compatible with Apache Spark ML version 2.X, but the current execution environment is Apache Spark ML 2.Y").

Customizing the "missing value handling"-mode of models

Hi,

I am using "jpmml-sparkml, version 1.2.4" to generate pmml models in spark (using Scala) and saving that output to local file system, but can't figure out how to set the following properties

<xs:attribute name="missingValueStrategy" type="MISSING-VALUE-STRATEGY" default="none"/>
<xs:attribute name="missingValuePenalty" type="PROB-NUMBER" default="1.0"/>
<xs:attribute name="noTrueChildStrategy" type="NO-TRUE-CHILD-STRATEGY" default="returnNullPrediction"/>

I have searched online but haven't found any clues.

Really appreciate the help..

Thanks,
Raj

Tests fail when installing

I have spark pipelines from spark 2.1
So, I checked out the tag 1.2.7 to build.

The build fails,
I don't think, it is related to the conflicts mentioned in the README.

These are the logs

mvn clean install

[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Building JPMML-SparkML 1.2.7
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ jpmml-sparkml ---
[INFO] Deleting /home/delhivery/dev/jpmml-sparkml/target
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven) @ jpmml-sparkml ---
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (default) @ jpmml-sparkml ---
[INFO] 
[INFO] --- jacoco-maven-plugin:0.7.9:prepare-agent (pre-unit-test) @ jpmml-sparkml ---
[INFO] jacoco.agent set to -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ jpmml-sparkml ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ jpmml-sparkml ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 53 source files to /home/delhivery/dev/jpmml-sparkml/target/classes
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/WeightedTermFeature.java: Some input files use or override a deprecated API.
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/WeightedTermFeature.java: Recompile with -Xlint:deprecation for details.
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterUtil.java: /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterUtil.java uses unchecked or unsafe operations.
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterUtil.java: Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ jpmml-sparkml ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 62 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile) @ jpmml-sparkml ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 5 source files to /home/delhivery/dev/jpmml-sparkml/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ jpmml-sparkml ---
[INFO] Surefire report directory: /home/delhivery/dev/jpmml-sparkml/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:510)
	at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:522)
Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented.
	at org.jacoco.agent.rt.internal_8ff85ea.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:140)
	at org.jacoco.agent.rt.internal_8ff85ea.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:101)
FATAL ERROR in native method: processing of -javaagent failed
	at org.jacoco.agent.rt.internal_8ff85ea.PreMain.createRuntime(PreMain.java:55)
	at org.jacoco.agent.rt.internal_8ff85ea.PreMain.premain(PreMain.java:47)
	... 6 more
Caused by: java.lang.NoSuchFieldException: $jacocoAccess
	at java.base/java.lang.Class.getField(Class.java:1958)
	at org.jacoco.agent.rt.internal_8ff85ea.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138)
	... 9 more
Aborted (core dumped)

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.266 s
[INFO] Finished at: 2018-05-23T16:26:03+05:30
[INFO] Final Memory: 79M/270M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project jpmml-sparkml: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project jpmml-sparkml: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:213)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:154)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:146)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:309)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:955)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:290)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:194)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:564)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:145)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:208)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:154)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:146)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:309)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:955)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:290)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:194)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:564)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
Caused by: java.lang.RuntimeException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork (ForkStarter.java:590)
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork (ForkStarter.java:460)
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run (ForkStarter.java:229)
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run (ForkStarter.java:201)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider (AbstractSurefireMojo.java:1026)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked (AbstractSurefireMojo.java:862)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute (AbstractSurefireMojo.java:755)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:134)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:208)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:154)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:146)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:309)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:955)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:290)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:194)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:564)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

Spark pipeline was not converted properly

I am writing a scala project using spark pipeline with GBDT model and stringIndexer to convert NominalColumns. And I convert the pipeline model to PMML. But when i using PMML model to predict there is a input data mismatch. I think that is because stringIndexer is not converted to PMML. But JPMML support spark stringIndex operation. I don't know why. Exception messages as follow.

Caused by: org.jpmml.evaluator.InvalidResultException (at or around line 23)
	at org.jpmml.evaluator.FieldValueUtil.performInvalidValueTreatment(FieldValueUtil.java:178)
	at org.jpmml.evaluator.FieldValueUtil.prepareInputValue(FieldValueUtil.java:90)
	at org.jpmml.evaluator.InputField.prepare(InputField.java:64)

What would be involved in supporting Tokenizer, IDF, and HashingTF features?

Hi, first thanks for this excellent project (and also pmml-evaluator)! I have a Spark ML pipeline which uses Tokenizer, HashingTF, and IDF in order to feed a column containing text to a multiclass classifier which predicts a category. How feasible / hard would it be to support such a pipeline in jpmml-sparkml? I was thinking about taking a shot at it. Should Tokenizer get converted to an org.dmg.pmml.DocumentTermMatrix, or something else? And what about HashingTF and IDF? What pmml objects should those be converted to? Thanks in advance

Can it support Imputer?

Imputer is supported in JPMML-SkLearn, and it will produce ‘missingValueReplacement’ and 'missingValueTreatment' in PMML, Can jpmml-sparkml support it?

ConverterUtil can not transform models that trained by sparse data

As we discussed in emails last week, I hope that this project can transform model to pmml in the second way as showed in the following code.
Thanks.

Hello,

Sometimes, my training data would be sparse in libsvm format, and the Data
Frame is suitable to be formatted as follow rather than using RFormula in mllib.

root
|-- features: vector (nullable = false)
|-- label: double (nullable = false)

This kind of "data layout" contains very little feature information.
Sure, it could be converted to PMML, but in that case the "feature"
column would be expanded into n double columns "x1", "x2", .., "x_n".

You could open an feature request in JPMML-SparkML issue tracker
(https://github.com/jpmml/jpmml-sparkml/issues), and I would take care
of it then. Also, please include a reproducible sample code.

VR

  def testPMML(sc: SparkContext) = {
    val rdd = sc.makeRDD(Seq((1.0, 2.0, 3.0, 0.0), (0.0, 2.0, 0.0, 3.0) , (1.0, 0.0, 0.0, 2.0)))
      .map(a => Row(a._1, Vectors.dense(Array(a._2, a._3, a._4)).toSparse))
    val schema = StructType(List(StructField("label", DoubleType), StructField("features", new VectorUDT)))
    val sqlContext = new SQLContext(sc)
    val irisData = sqlContext.createDataFrame(rdd, schema)

    val classifier = new LogisticRegression()
      .setLabelCol("label")
      .setFeaturesCol("features")

    // the first way
    val pipeline = new Pipeline()
      .setStages(Array(classifier))
    val pipelineModel = pipeline.fit(irisData)
    var pmml = ConverterUtil.toPMML(schema, pipelineModel)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

    // the second way
    val lrModel = classifier.fit(irisData)
    pmml = ConverterUtil.toPMML(schema, lrModel)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))
  }

Add support for multinomial `LogisticRegression` models

Current version of SparkML encoder supports only scalar column as inputs, while the majority of current implementations are using vectors to describe inputs. In a lot of cases those vectors are created using VectorAssembler which creates a mapping metadata from vector to individual columns from which it was produced. This metadata can be used for mapping vector back to the individual columns. I am enclosing a sample code of such implementation for your consideration.

The other issue that I have encountered is usage of the converter for Logistic regression. SparkML currently supports multiple labels, while exporter limits this to 2. Any plans on extending this?

SparkMLEncoder.java.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.