org.jpmml.sparkml.PMMLBuilder does not exist in the JVM - pyspark
Thanks a lot for any help.
My goal is to save a trained model in XML format and Im really stragling with this error and warnings
---------------------------------------------------------------------------
Exception in thread "Thread-4" java.lang.ExceptionInInitializerError
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Expected Apache Spark ML version 3.1, got version 3.2 (3.2.0)
at org.jpmml.sparkml.ConverterFactory.checkVersion(ConverterFactory.java:114)
at org.jpmml.sparkml.PMMLBuilder.init(PMMLBuilder.java:481)
at org.jpmml.sparkml.PMMLBuilder.<clinit>(PMMLBuilder.java:545)
... 10 more
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 480, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/home/mbg/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 503, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
/tmp/ipykernel_20251/3496938591.py in <module>
----> 1 pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)
~/.local/lib/python3.8/site-packages/pyspark2pmml/__init__.py in __init__(self, sc, df, pipelineModel)
10 javaSchema = javaDf.schema.__call__()
11 javaPipelineModel = pipelineModel._to_java()
---> 12 javaPmmlBuilderClass = sc._jvm.org.jpmml.sparkml.PMMLBuilder
13 if(not isinstance(javaPmmlBuilderClass, JavaClass)):
14 raise RuntimeError("JPMML-SparkML not found on classpath")
~/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __getattr__(self, name)
1647 answer[proto.CLASS_FQN_START:], self._gateway_client)
1648 else:
-> 1649 raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
1650
1651
Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM
My code is the folowing:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("SparkApp_ETL_ML").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df=pd.read_parquet("https://s3.eu-de.cloud-object-storage.appdomain.cloud/cloud-object-storage-yy-cos-standard-js4/data.parquet")
sdf = spark.createDataFrame(df)
from pyspark.sql.types import DoubleType
sdf = sdf.withColumn("x", sdf.x.cast(DoubleType()))
sdf = sdf.withColumn("y", sdf.y.cast(DoubleType()))
sdf = sdf.withColumn("z", sdf.z.cast(DoubleType()))
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
input_columns = ["x", "y", "z"] # input columns to consider
train, test = sdf.randomSplit([0.8, 0.2], seed=1)
indexer = StringIndexer(inputCol="class", outputCol="label")
vectorAssembler = VectorAssembler(inputCols=input_columns, outputCol="features")
normalizer = MinMaxScaler(inputCol="features", outputCol="features_norm")
pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer])
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction"). \
setLabelCol("label")
df_train = pipeline.fit(train).transform(train)
df_test = pipeline.fit(test).transform(test)
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol='features_norm', labelCol='label', maxDepth=20, numTrees=7, seed=1)
rfModel = rf.fit(df_train)
from pyspark2pmml import PMMLBuilder
model_target = "HMP_frModel.xml"
pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)
All works well till the last line in code.
I tried all solutions i found on the internet but unfortunatly without success.
I am working with jupyter notebook not anaconda and installed pyspark with pip and I added those variables in the .bashrc
export PATH=$PATH:~/.local/bin
export SPARK_HOME=~/.local/lib/python3.8/site-packages/pyspark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.2-src.zip
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
I also downloaded those jar files jpmml-sparkml-executable-1.7.2.jar jpmml-sparkml-executable-1.8.0.jar and put them in this directory ~/.local/lib/python3.8/site-packages/pyspark/jars
Related
Show() brings error after applying pandas udf to dataframe
I am having problems to make this trial code work. The final line df.select(plus_one(col("x"))).show() doesn't work, I also tried to save in a variable ( vardf = df.select(plus_one(col("x"))) followed by vardf.show() and fails too. import pyspark import pandas as pd from typing import Iterator from pyspark.sql.functions import col, pandas_udf, struct spark = pyspark.sql.SparkSession.builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") pdf = pd.DataFrame([1, 2, 3], columns=["x"]) df = spark.createDataFrame(pdf) df.show() #pandas_udf("long") def plus_one(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]: for s in batch_iter: yield s + 1 df.select(plus_one(col("x"))).show() Error message (parts of it): File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\spyder_kernels\py3compat.py", line 356, in compat_exec exec(code, globals, locals) File "c:\bigdatasetup\dataanalysiswithpythonandpyspark-trunk\code\ch09\untitled0.py", line 24, in df.select(plus_one(col("x"))).show() File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\pyspark\sql\dataframe.py", line 494, in show print(self._jdf.showString(n, 20, vertical)) File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\py4j\java_gateway.py", line 1321, in call return_value = get_return_value( File "C:\bigdatasetup\anaconda3\envs\pyspark-env\lib\site-packages\pyspark\sql\utils.py", line 117, in deco raise converted from None PythonException: An exception was thrown from the Python worker. Please see the stack trace below. ... ... ERROR 2022-04-21 09:48:24,423 7608 org.apache.spark.scheduler.TaskSetManager [task-result-getter-0] Task 0 in stage 3.0 failed 1 times; aborting job
How to fix this error in PySpark with the select method?
I'm following an example from the internet, but it gives me an error that I can't solve. The code is the following: import matplotlib.pyplot as plt import pandas as pd import seaborn as sns from pyspark import SparkContext from IPython.display import display, HTML from pyspark.sql import SQLContext from pyspark.ml.feature import PCA from pyspark.ml.linalg import Vectors from pyspark.sql import Column as c from pyspark.sql.functions import array, udf, lit, col as c import pyspark.sql.functions as f pd.set_option('max_colwidth',100) plt.style.use('seaborn-paper') try: sc = SparkContext('local[*]') except: sc = SparkContext.getOrCreate('local[*]') sqlContext = SQLContext(sc) #Leyendo los dataframe whiteWinnePath = 'winequality-White.csv' redWinnePath = 'winequality-Red.csv' """ df_p = pd.read_csv(whiteWinnePath, sep =";") print("Mostrar el data frame") display(df_p) """ whiteWinneDF = sqlContext.createDataFrame(pd.read_csv(whiteWinnePath, sep = ";")).withColumn('type',lit(0)) redWinneDF = sqlContext.createDataFrame(pd.read_csv(redWinnePath, sep = ";")).withColumn('type',lit(1)) whiteWinneDF.printSchema() #Dividiendo conjunto de entranamiento y prueba whiteTrainingDF, whiteTestingDF = whiteWinneDF.randomSplit([0.7,0.3]) redTrainingDF, redTestingDF = redWinneDF.randomSplit([0.7,0.3]) trainingDF = whiteTrainingDF.union(redTrainingDF) testingDF = whiteTestingDF.union(redTestingDF) #Preparando el dataframe para PCA idCol = ['type'] features = [column for column in redWinneDF.columns if column not in idCol] p = len(features) meanVector = trainingDF.describe().where(c('summary')==lit('mean')).toPandas()[[0][1:p+1]].values """ meanVector2= meanVector1.toPandas() #print("="*50) #print(type(meanVector2)) #meanVector= meanVector2.as_matrix()#[0][1:p+1] meanVector= meanVector2[[0][1:p+1]].values """ labeledVectorsDF = trainingDF.select(features+['type']).rdd\ .map(lambda x:(Vectors.dense(x[0:p]-Vectors.dense(meanVector)),x[p]))\ .toDF(['features','type']) labeledVectorsDF.limit(5).toPandas() When I run the code, this is the error i get: File "D:\UGR\Investigación\Cosas de Reinaldo\mis script\Seleccion_caracteristicas\PCA_wine_quality.py", line 78, in <module> labeledVectorsDF = trainingDF.select(features+['type']).rdd\ File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 61, in toDF return sparkSession.createDataFrame(self, schema, sampleRatio) File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 605, in createDataFrame return self._create_dataframe(data, schema, samplingRatio, verifySchema) File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 628, in _create_dataframe rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 425, in _createFromRDD struct = self._inferSchema(rdd, samplingRatio, names=schema) File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 396, in _inferSchema first = rdd.first() File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\rdd.py", line 1464, in first rs = self.take(1) File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\rdd.py", line 1446, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\context.py", line 1118, in runJob sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "C:\spark-3.0.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__ return_value = get_return_value( File "C:\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 128, in deco return f(*a, **kw) File "C:\spark-3.0.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value raise Py4JJavaError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 17, LAPTOP-3G67L0HS, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): I don't know if the error is with the select method or the rdd method. I verified that features has all the names of the columns of the dataframe except type. How could I solve it?
ModuleNotFoundError: No module named 'pyspark.dbutils'
I am running pyspark from an Azure Machine Learning notebook. I am trying to move a file using the dbutil module. from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() def get_dbutils(spark): try: from pyspark.dbutils import DBUtils dbutils = DBUtils(spark) except ImportError: import IPython dbutils = IPython.get_ipython().user_ns["dbutils"] return dbutils dbutils = get_dbutils(spark) dbutils.fs.cp("file:source", "dbfs:destination") I got this error: ModuleNotFoundError: No module named 'pyspark.dbutils' Is there a workaround for this? Here is the error in another Azure Machine Learning notebook: --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-1-183f003402ff> in get_dbutils(spark) 4 try: ----> 5 from pyspark.dbutils import DBUtils 6 dbutils = DBUtils(spark) ModuleNotFoundError: No module named 'pyspark.dbutils' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-1-183f003402ff> in <module> 10 return dbutils 11 ---> 12 dbutils = get_dbutils(spark) <ipython-input-1-183f003402ff> in get_dbutils(spark) 7 except ImportError: 8 import IPython ----> 9 dbutils = IPython.get_ipython().user_ns["dbutils"] 10 return dbutils 11 KeyError: 'dbutils'
How to limit FPGrowth itemesets to just 2 or 3
I am running the FPGrowth algorithm using pyspark in python3.6 using jupyter notebook. When I am trying to save the association rules output of rules generated is huge. So I want to limit the number of consequent. Here is the code which I have tried. I also changed the spark context parameters. Maximum Pattern Length fpGrowth (Apache) PySpark from pyspark.sql.functions import col, size from pyspark.ml.fpm import FPGrowth from pyspark.sql import Row from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark import SparkConf conf = SparkConf().setAppName("App") conf = (conf.setMaster('local[*]') .set('spark.executor.memory', '100G') .set('spark.driver.memory', '400G') .set('spark.driver.maxResultSize', '200G')) sc = SparkContext.getOrCreate(conf=conf) spark = SparkSession(sc) R = Row('ID', 'items') df=spark.createDataFrame([R(i, x) for i, x in enumerate(lol)]) fpGrowth = FPGrowth(itemsCol="items", minSupport=0.7, minConfidence=0.9) model = fpGrowth.fit(df) ar=model.associationRules.where(size(col('antecedent')) == 2).where(size(col('cosequent')) == 1) ar.cache() ar.toPandas().to_csv('output.csv') It gives an error TypeError Traceback (most recent call last) <ipython-input-1-f90c7a9f11ae> in <module> ---> 73 ar=model.associationRules.where(size(col('antecedent')) == 2).where(size(col('consequent')) == 1) TypeError: 'str' object is not callable Can someone help me to solve the issue. Here lol is list of list of transactions: [['a','b'],['c','a','e']....] Python: 3.6.5 Pyspark Windows 10
From the above discussion and following this link, it helped me to resolve the problem. 'str' object is not callable TypeError import pyspark.sql.functions as func model.associationRules.where(func.size(func.col('antecedent')) == 1).where(func.size(func.col('consequent')) == 1).show()
Multiclass Classification Evaluator in PySpark
from pyspark.ml.classification import MultilayerPerceptronClassifier inputneurons = len(pipe_df.columns) nn = MultilayerPerceptronClassifier(layers = [inputneurons,20,2]) nn_model = nn.fit(train_data) results = nn_model.transform(test_data) from pyspark.ml.evaluation import MulticlassClassificationEvaluator evaluator = MulticlassClassificationEvaluator() mlp_accuracy = evaluator.evaluate(results) and when run it, shows errors --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) in () 23 evaluator = MulticlassClassificationEvaluator() 24 ---> 25 mlp_accuracy = evaluator.evaluate(results) 26 27 and I tried BinaryClassificationEvaluator but it does't work as well.. Is anyone know whats wrong here? I am new to PySpark...