load pmml (generated by sklearn) in spark to predict but get error - pmml

I am following instruction jpmml-evaluator-spark to load local pmml model
my code is like below
import java.io.File
import org.jpmml.evaluator.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
// load pmml
val new File(getClass.getClassLoader.getResource("random_forest.pmml").getFile)
// create evaluator
val evaluator = EvaluatorUtil.createEvaluator(pmmlFile)
I cannot show the error message directly, so I put it here
guesses:
there are some reasons i think may cause this problem
1, "jpmml-evaluator-spark" does not support PMML4.3, even if the author said new version 1.1.0 has already supported PMML4.3
2, there are some problems about my "random_forest.pmml", because this file is from others
Note:
development environment
spark 2.1.1
scala 2.11.8
and I run on the local, mac system version is OS X El Capitan Version 10.11.6

You are using Apache Spark 2.0, 2.1 or 2.2, which has prepended a legacy version of the JPMML-Model library (1.2.15, to be precise) to your application classpath. This issue is documented in SPARK-15526.
Solution - fix your application classpath as described in JPMML-Evaluator-Spark documentation (alternatively, consider switching to Apache Spark 2.3.0 or newer).

Another option of using PMML in Spark is PMML4S-Spark, which supports the latest PMML4.4, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile(pmmlFile)
val scoreDf = model.transform(df)

Related

Cannot import Cosmosdb in databricks

I setup a new cluster on databricks which using databricks runtime version 10.1 (includes Apache Spark 3.2.0, Scala 2.12). I also installed azure_cosmos_spark_3_2_2_12_4_6_2.jar in Libraries.
I create a new notebook with Scala
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
But I still get error: object cosmosdb is not a member of package com.microsoft.azure
Does anyone know which step I missing?
Thanks
Looks like the imports you are doing are for the older Spark Connector (https://github.com/Azure/azure-cosmosdb-spark).
For the Spark 3.2 Connector, you might want to follow the quickstart guides: https://learn.microsoft.com/azure/cosmos-db/sql/create-sql-api-spark
The official repository is: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3-2_2-12
Complete Scala sample: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples/Scala-Sample.scala
Here is the configuration reference: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md
You may be missing the pip install step:
pip install azure-cosmos

java.lang.NoSuchMethodError: com.mongodb.internal.connection.Cluster.selectServer

I am new to Apache Spark and I am using Scala and Mongodb to learn it.
https://docs.mongodb.com/spark-connector/current/scala-api/
I am trying to read the RDD from my MongoDB database, my notebook script as below:
import com.mongodb.spark.config._
import com.mongodb.spark._
val readConfig = ReadConfig(Map("uri" -> "mongodb+srv://root:root#mongodbcluster.td5gp.mongodb.net/test_database.test_collection?retryWrites=true&w=majority"))
val testRDD = MongoSpark.load(sc, readConfig)
print(testRDD.collect)
At the print(testRDD.collect) line, I got this error:
java.lang.NoSuchMethodError:
com.mongodb.internal.connection.Cluster.selectServer(Lcom/mongodb/selector/ServerSelector;)Lcom/mongodb/internal/connection/Server;
And more than 10 lines "at..."
Used libraries:
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
org.mongodb.scala:mongo-scala-driver_2.12:4.2.3
Is this the problem from Mongodb internal libraries or how could I fix it?
Many thanks
I suspect that there is a conflict between mongo-spark-connector and mongo-scala-driver. The former is using Mongo driver 4.0.5, but the later is based on the version 4.2.3. I would recommend to try only with mongo-spark-connector
I was also facing the same problem, solved it using mongo-spark-connector-2.12:3.0.1 jar and with that also added jar of Scalaj HTTP » 2.4.2. It's working fine now.

How to install kafka module in pyspark

I have problem when import KafkaUtils is :
No module named 'pyspark.streaming.kafka' But i don't know how to install kafka module.
I use python 3.6.8, spark 2.2.0 and kafka_2.12-2.5.0
As it turns out, KafkaUtils is being deprecated and replaced with Spark Structured Streaming. Which means you have two paths forward:
Redesign your application to use Structured Streaming instead (see https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html for a primer).
Downgrade your version of Spark to a version that still includes KafkaUtils as part of the distribution (you'll find that KafkaUtils won't need to be installed separately).

Why does from_json fail with “not found : value from_json"? (2)

Have already read the answer to this question that is on SO. None of those fixes are my problem.
I am unable to call the function "from_json".
I already had below in my code:
import org.apache.spark.sql.functions._
I also tried adding:
import org.apache.spark.sql.Column
I am running Scala/Spark through Eclipse. Scala Version 2.11.11, Spark Version 2.0.0.
Any ideas?
from_json function isn't available in Spark 2.0
It is available from Spark 2.1
Release notes of spark 2.1 mentions about adding from_json functionality

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

I have installed Cloudera VM (Single node) and inside this VM i have Spark running on top of Yarn. I would like to use Eclipse IDE (with scala plugin) for testing/learning with Spark.
If i instantiate SparkContext as following, everything works as i expected
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster("local[2]")
However, if i want now to connect to local server by changing the master to 'yarn-client' then it does not work:
val master = "yarn-client"
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster(master)
Specifically im getting following errors:
Error details displayed in the Eclipse console:
Error details from the NodeManager logs:
Here are the things i have tried so far:
1. Dependencies
I added all the dependencies through Maven repository
Cloudera version is 5.5 and corresponding Hadoop version is 2.6.0 and Spark version is 1.5.0.
2. Configurations
I added 3 path variables into Eclipse classpath:
SPARK_CONF_DIR=/etc/spark/conf/
HADOOP_CONF_DIR=/usr/lib/hadoop/
YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
Can anybody clarify me what is the problem here and ways to solve it?
I worked around it! I still don't understand what the exact problems is but i created a folder with my username in hadoop , i.e. /user/myusername directory and it worked. Anyway now i switched to Hortonworks distribution and i found it much more smoother to get started with than the Cloudera distribution.