org.jpmml.sparkml.PMMLBuilder does not exist in the JVM - pyspark

Thanks a lot for any help.
My goal is to save a trained model in XML format and Im really stragling with this error and warnings
---------------------------------------------------------------------------
Exception in thread "Thread-4" java.lang.ExceptionInInitializerError
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Expected Apache Spark ML version 3.1, got version 3.2 (3.2.0)
at org.jpmml.sparkml.ConverterFactory.checkVersion(ConverterFactory.java:114)
at org.jpmml.sparkml.PMMLBuilder.init(PMMLBuilder.java:481)
at org.jpmml.sparkml.PMMLBuilder.<clinit>(PMMLBuilder.java:545)
... 10 more
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 480, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 503, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
/tmp/ipykernel_20251/3496938591.py in <module>
----> 1 pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)
~/.local/lib/python3.8/site-packages/pyspark2pmml/__init__.py in __init__(self, sc, df, pipelineModel)
10 javaSchema = javaDf.schema.__call__()
11 javaPipelineModel = pipelineModel._to_java()
---> 12 javaPmmlBuilderClass = sc._jvm.org.jpmml.sparkml.PMMLBuilder
13 if(not isinstance(javaPmmlBuilderClass, JavaClass)):
14 raise RuntimeError("JPMML-SparkML not found on classpath")
~/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __getattr__(self, name)
1647 answer[proto.CLASS_FQN_START:], self._gateway_client)
1648 else:
-> 1649 raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
1650
1651
Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM
My code is the folowing:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("SparkApp_ETL_ML").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df=pd.read_parquet("https://s3.eu-de.cloud-object-storage.appdomain.cloud/cloud-object-storage-yy-cos-standard-js4/data.parquet")
sdf = spark.createDataFrame(df)
from pyspark.sql.types import DoubleType
sdf = sdf.withColumn("x", sdf.x.cast(DoubleType()))
sdf = sdf.withColumn("y", sdf.y.cast(DoubleType()))
sdf = sdf.withColumn("z", sdf.z.cast(DoubleType()))
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
input_columns = ["x", "y", "z"] # input columns to consider
train, test = sdf.randomSplit([0.8, 0.2], seed=1)
indexer = StringIndexer(inputCol="class", outputCol="label")
vectorAssembler = VectorAssembler(inputCols=input_columns, outputCol="features")
normalizer = MinMaxScaler(inputCol="features", outputCol="features_norm")
pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer])
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction"). \
setLabelCol("label")
df_train = pipeline.fit(train).transform(train)
df_test = pipeline.fit(test).transform(test)
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol='features_norm', labelCol='label', maxDepth=20, numTrees=7, seed=1)
rfModel = rf.fit(df_train)
from pyspark2pmml import PMMLBuilder
model_target = "HMP_frModel.xml"
pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)
All works well till the last line in code.
I tried all solutions i found on the internet but unfortunatly without success.
I am working with jupyter notebook not anaconda and installed pyspark with pip and I added those variables in the .bashrc
export PATH=$PATH:~/.local/bin
export SPARK_HOME=~/.local/lib/python3.8/site-packages/pyspark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.2-src.zip
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
I also downloaded those jar files jpmml-sparkml-executable-1.7.2.jar jpmml-sparkml-executable-1.8.0.jar and put them in this directory ~/.local/lib/python3.8/site-packages/pyspark/jars

Related

Show() brings error after applying pandas udf to dataframe

I am having problems to make this trial code work. The final line df.select(plus_one(col("x"))).show() doesn't work, I also tried to save in a variable ( vardf = df.select(plus_one(col("x"))) followed by vardf.show() and fails too.
import pyspark
import pandas as pd
from typing import Iterator
from pyspark.sql.functions import col, pandas_udf, struct
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
pdf = pd.DataFrame([1, 2, 3], columns=["x"])
df = spark.createDataFrame(pdf)
df.show()
#pandas_udf("long")
def plus_one(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
for s in batch_iter:
yield s + 1
df.select(plus_one(col("x"))).show()
Error message (parts of it):
File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\spyder_kernels\py3compat.py", line 356, in compat_exec
exec(code, globals, locals)
File "c:\bigdatasetup\dataanalysiswithpythonandpyspark-trunk\code\ch09\untitled0.py", line 24, in
df.select(plus_one(col("x"))).show()
File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\pyspark\sql\dataframe.py", line 494, in show
print(self._jdf.showString(n, 20, vertical))
File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\py4j\java_gateway.py", line 1321, in call
return_value = get_return_value(
File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
raise converted from None
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
...
...
ERROR 2022-04-21 09:48:24,423 7608 org.apache.spark.scheduler.TaskSetManager [task-result-getter-0] Task 0 in stage 3.0 failed 1 times; aborting job

How to fix this error in PySpark with the select method?

I'm following an example from the internet, but it gives me an error that I can't solve. The code is the following:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pyspark import SparkContext
from IPython.display import display, HTML
from pyspark.sql import SQLContext
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.sql import Column as c
from pyspark.sql.functions import array, udf, lit, col as c
import pyspark.sql.functions as f
pd.set_option('max_colwidth',100)
plt.style.use('seaborn-paper')
try:
sc = SparkContext('local[*]')
except:
sc = SparkContext.getOrCreate('local[*]')
sqlContext = SQLContext(sc)
#Leyendo los dataframe
whiteWinnePath = 'winequality-White.csv'
redWinnePath = 'winequality-Red.csv'
"""
df_p = pd.read_csv(whiteWinnePath, sep =";")
print("Mostrar el data frame")
display(df_p)
"""
whiteWinneDF = sqlContext.createDataFrame(pd.read_csv(whiteWinnePath, sep = ";")).withColumn('type',lit(0))
redWinneDF = sqlContext.createDataFrame(pd.read_csv(redWinnePath, sep = ";")).withColumn('type',lit(1))
whiteWinneDF.printSchema()
#Dividiendo conjunto de entranamiento y prueba
whiteTrainingDF, whiteTestingDF = whiteWinneDF.randomSplit([0.7,0.3])
redTrainingDF, redTestingDF = redWinneDF.randomSplit([0.7,0.3])
trainingDF = whiteTrainingDF.union(redTrainingDF)
testingDF = whiteTestingDF.union(redTestingDF)
#Preparando el dataframe para PCA
idCol = ['type']
features = [column for column in redWinneDF.columns if column not in idCol]
p = len(features)
meanVector = trainingDF.describe().where(c('summary')==lit('mean')).toPandas()[[0][1:p+1]].values
"""
meanVector2= meanVector1.toPandas()
#print("="*50)
#print(type(meanVector2))
#meanVector= meanVector2.as_matrix()#[0][1:p+1]
meanVector= meanVector2[[0][1:p+1]].values
"""
labeledVectorsDF = trainingDF.select(features+['type']).rdd\
.map(lambda x:(Vectors.dense(x[0:p]-Vectors.dense(meanVector)),x[p]))\
.toDF(['features','type'])
labeledVectorsDF.limit(5).toPandas()
When I run the code, this is the error i get:
File "D:\UGR\Investigación\Cosas de Reinaldo\mis script\Seleccion_caracteristicas\PCA_wine_quality.py", line 78, in <module>
labeledVectorsDF = trainingDF.select(features+['type']).rdd\
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 61, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 605, in createDataFrame
return self._create_dataframe(data, schema, samplingRatio, verifySchema)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 628, in _create_dataframe
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 425, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio, names=schema)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 396, in _inferSchema
first = rdd.first()
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\rdd.py", line 1464, in first
rs = self.take(1)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\rdd.py", line 1446, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\context.py", line 1118, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "C:\spark-3.0.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 128, in deco
return f(*a, **kw)
File "C:\spark-3.0.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 17, LAPTOP-3G67L0HS, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
I don't know if the error is with the select method or the rdd method. I verified that features has all the names of the columns of the dataframe except type.
How could I solve it?

ModuleNotFoundError: No module named 'pyspark.dbutils'

I am running pyspark from an Azure Machine Learning notebook. I am trying to move a file using the dbutil module.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def get_dbutils(spark):
try:
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
dbutils = get_dbutils(spark)
dbutils.fs.cp("file:source", "dbfs:destination")
I got this error:
ModuleNotFoundError: No module named 'pyspark.dbutils'
Is there a workaround for this?
Here is the error in another Azure Machine Learning notebook:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-183f003402ff> in get_dbutils(spark)
4 try:
----> 5 from pyspark.dbutils import DBUtils
6 dbutils = DBUtils(spark)
ModuleNotFoundError: No module named 'pyspark.dbutils'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-1-183f003402ff> in <module>
10 return dbutils
11
---> 12 dbutils = get_dbutils(spark)
<ipython-input-1-183f003402ff> in get_dbutils(spark)
7 except ImportError:
8 import IPython
----> 9 dbutils = IPython.get_ipython().user_ns["dbutils"]
10 return dbutils
11
KeyError: 'dbutils'

How to limit FPGrowth itemesets to just 2 or 3

I am running the FPGrowth algorithm using pyspark in python3.6 using jupyter notebook. When I am trying to save the association rules output of rules generated is huge. So I want to limit the number of consequent. Here is the code which I have tried. I also changed the spark context parameters.
Maximum Pattern Length fpGrowth (Apache) PySpark
from pyspark.sql.functions import col, size
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
conf = SparkConf().setAppName("App")
conf = (conf.setMaster('local[*]')
.set('spark.executor.memory', '100G')
.set('spark.driver.memory', '400G')
.set('spark.driver.maxResultSize', '200G'))
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
R = Row('ID', 'items')
df=spark.createDataFrame([R(i, x) for i, x in enumerate(lol)])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.7, minConfidence=0.9)
model = fpGrowth.fit(df)
ar=model.associationRules.where(size(col('antecedent')) == 2).where(size(col('cosequent')) == 1)
ar.cache()
ar.toPandas().to_csv('output.csv')
It gives an error
TypeError Traceback (most recent call last)
<ipython-input-1-f90c7a9f11ae> in <module>
---> 73 ar=model.associationRules.where(size(col('antecedent')) ==
2).where(size(col('consequent')) == 1)
TypeError: 'str' object is not callable
Can someone help me to solve the issue.
Here lol is list of list of transactions: [['a','b'],['c','a','e']....]
Python: 3.6.5
Pyspark
Windows 10
From the above discussion and following this link, it helped me to resolve the problem.
'str' object is not callable TypeError
import pyspark.sql.functions as func
model.associationRules.where(func.size(func.col('antecedent')) == 1).where(func.size(func.col('consequent')) == 1).show()

Multiclass Classification Evaluator in PySpark

from pyspark.ml.classification import MultilayerPerceptronClassifier
inputneurons = len(pipe_df.columns)
nn = MultilayerPerceptronClassifier(layers = [inputneurons,20,2])
nn_model = nn.fit(train_data)
results = nn_model.transform(test_data)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator()
mlp_accuracy = evaluator.evaluate(results)
and when run it, shows errors ---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
in ()
23 evaluator = MulticlassClassificationEvaluator()
24
---> 25 mlp_accuracy = evaluator.evaluate(results)
26
27
and I tried BinaryClassificationEvaluator but it does't work as well..
Is anyone know whats wrong here? I am new to PySpark...