I have trained a model in pyspark
##Model
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
Here I am saving pipeline and model
#Save pipeline
pipelineModel.write().overwrite().save("s3://data-production/pipelineModel_v1")
#Save Model
gbtModel.save("s3://data-production/first_trade.model_v0")
Now in production /future datasets, I am loading pipeline and model
pipelineModel = PipelineModel.load("s3://data-production/pipelineModel_v1")
new_test= pipelineModel.transform(new_df1)
model = GBTClassifier.load("s3://data-production/first_trade.model_v0")
I am getting this error after model load
Py4JJavaError: An error occurred while calling o4701.load.
: java.lang.NoSuchMethodException: org.apache.spark.ml.classification.GBTClassificationModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:496)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error occurred while calling o4701.load.\n', JavaObject id=o4702), <traceback object at 0x7f247a3eb9c8>)
The saved model is essentially a serialized version of your trained GBTClassifier. To deserialize the model you would need the original classes in the production code as well. Add this line to the set of import statements.
from pyspark.ml.classification import GBTClassifier, GBTClassificationModel
Related
After looking at the documentation and looking at some stackoverflow posts, I tried two different methods, but I feel I might be doing something wrong before that.
To start off, this is what some_list_without_header.csv looks like:
12345
23456
34567
schema_for_list = StructType([StructField('_c0', StringType())])
some_list = spark.read.csv("some_list_without_header.csv", header=False, schema = schema_for_list).collect()
###
schema_for_data = StructType([
StructField('COLUMN_TO_CHECK', StringType()),
StructField('COLUMN2', StringType())
])
data = spark.read.csv("data.csv", header = True, schema = schema_for_data)
### Tried few ways, here are couple examples, neither working
data.filter(data.COLUMN_TO_CHECK.isin(some_list))
data.where(F.col('COLUMN_TO_CHECK').isin(some_list))
Here is the error message
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: org.apache.spark.SparkRuntimeException: The feature is not supported: literal for '[EXAMPLE_NUMBER_STRING]' of class java.util.ArrayList.
at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:296)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101)
at org.apache.spark.sql.functions$.lit(functions.scala:125)
at org.apache.spark.sql.functions.lit(functions.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
When I finished the data processing,I need to save the data back to the Postgre DB.
I can use the same JDBC properties which I used in reading data out. I faced the error . I can insert the data in Data Grip with the same config and that JDBC config is not for read-only. I can use the same config to write via Data Grip.
I cannot insert the data with logic below :
glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(dataFrame, glueContext,"tableName"),
connection_type="postgresql",
connection_options={
"url": jdbcConfig["url"],
"user": jdbcConfig["user"],
"password": jdbcConfig["password"],
"dbtable": "tableName"
}
)
And below is the error I faced(Got from CloudWatch log):
ERROR [Thread-9] postgresql.Driver (Driver.java:connect(233)): Error in url: jdbc:postgresql://cc2-data-analytics-cluster-ndec.cluster-abcdefg.us-west-2.rds.amazonaws.com:5440
2022-04-13 22:36:26,838 INFO [Thread-9] log.GlueLogger (GlueLogger.scala:info(8)): Exception as An error occurred while calling o203.pyWriteDynamicFrame.
: java.lang.Exception: Connection can not be established. Please check your JDBC url, username and password.
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:968)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:964)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$8.apply(JDBCUtils.scala:920)
at scala.Option.getOrElse(Option.scala:121)
at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:920)
at scala.Option.getOrElse(Option.scala:121)
at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:920)
at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:963)
at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:734)
at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:734)
at com.amazonaws.services.glue.util.JDBCWrapper.writeDF(JDBCUtils.scala:883)
at com.amazonaws.services.glue.sinks.PostgresDataSink.writeDynamicFrame(PostgresDataSink.scala:41)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
INFO [Thread-9] log.GlueLogger (GlueLogger.scala:info(8)): Exception as An error occurred while calling o203.pyWriteDynamicFrame. : java.lang.Exception: Connection can not be established. Please check your JDBC url, username and password. at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:968) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$12.apply(JDBCUtils.scala:964) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$8.apply(JDBCUtils.scala:920) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:920) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:920) at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:963) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:734) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:734) at com.amazonaws.services.glue.util.JDBCWrapper.writeDF(JDBCUtils.scala:883) at com.amazonaws.services.glue.sinks.PostgresDataSink.writeDynamicFrame(PostgresDataSink.scala:41) at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
I tried to use another method but the same error as above method:
dataFrame.select(*igDataDfSchema).write.format("jdbc")\
.option("url", jdbcConfig['url']) \
.option("driver", "org.postgresql.Driver").option("dbtable", "tableName") \
.option("user", jdbcConfig['user']).option("password",jdbcConfig['password']).save()
Good day,
I am using a sample code from snowflake documentation on using pyspark to connect to it:
sfparams = {
"sfURL": "SOME_URL",
"sfUser": "SOME_USER",
"sfPassword": "SOME_PASSWORD",
"sfDatabase": "SOME_DB",
"sfSchema": "SOME_SCHEMA",
"sfWarehouse": "SOME_WH",
"sfRole": "sysadmin"
}
df = self.spark_sql_context\
.read\
.format('snowflake')\
.options(**sfparams)\
.option('query', "SELECT * FROM TABLE1 LIMIT 10")\
.load()
df.show(truncate = False)
I have downloaded the required jar files (snowflake-jdbc-3.9.2.jar and spark-snowflake_2.11-2.9.3-spark_2.4.jar) and put them inside spark jars directory. Have also added the following to spark config:
.set("spark.jars", "/path_to/spark-2.4.5-bin-without-hadoop/jars/snowflake-jdbc-3.4.2.jar") \
.set("spark.jars", "/path_to/spark-2.4.5-bin-without-hadoop/jars/spark-snowflake_2.11-2.9.3-spark_2.4.jar")
However, whenever I try to run the code above, the following exception shows up:
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/SnowflakeLoggedFeatureNotSupportedException
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:68)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: net.snowflake.client.jdbc.SnowflakeLoggedFeatureNotSupportedException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
Couldn't find anything anywhere on how to deal with it so here I am.
Educated guess. As the query is simple .option('query', "SELECT * FROM TABLE1 LIMIT 10")\
it may contain a column with not supported data type like BLOB/BINARY. If that is the case then explicit column list and omitting such column will help.
Using the Spark Connector - From Snowflake to Spark SQL
I have a dse graph in validation/prod environement.
The problem occurs when I try to launch a DSEGraphFrame query using Spark in Scala.
val graph = spark.dseGraph("my_graph")
generates the following exception:
Exception in thread "main"
com.datastax.driver.core.exceptions.InvalidQueryException: The method
DseGraphRpc.getSchemaBlob does not exist. Make sure that the required
component for that method is active/enabled
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:40)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:26)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:284)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:37)
at com.sun.proxy.$Proxy27.execute(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:37)
at com.sun.proxy.$Proxy28.execute(Unknown Source)
at com.datastax.bdp.util.rpc.RpcUtil.callInternal(RpcUtil.java:57)
at com.datastax.bdp.util.rpc.RpcUtil.call(RpcUtil.java:40)
at com.datastax.bdp.graph.spark.DseGraphRpc.callGetSchema(DseGraphRpc.java:45)
at com.datastax.bdp.graph.spark.graphframe.DseGraphFrame$$anonfun$getSchemaFromServer$1.apply(DseGraphFrame.scala:586)
at com.datastax.bdp.graph.spark.graphframe.DseGraphFrame$$anonfun$getSchemaFromServer$1.apply(DseGraphFrame.scala:586)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:115)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:114)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:158)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:114)
at com.datastax.bdp.graph.spark.graphframe.DseGraphFrame$.getSchemaFromServer(DseGraphFrame.scala:586)
at com.datastax.bdp.graph.spark.graphframe.DseGraphFrameBuilder$.apply(DseGraphFrameBuilder.scala:257)
at com.datastax.bdp.graph.spark.graphframe.SparkSessionFunctions.dseGraph(SparkSessionFunctions.scala:20)
What could I do to run DSEGraphFrame properly?
The problem comes from a node in dse cluster wich the graph isn't activated
I want to use a md5 function to RDD[(String,Array[Double])], but there is an Error
of Null pointer exception. And I found the question on stack overflow.call of distinct and map together throws NPE in spark library.
my code:
def md5(s: String) = {
MessageDigest.getInstance("MD5").digest(s.getBytes).
map("%02x".format(_)).mkString.substring(0,8)
}
val rdd=sc.makeRDD(Array(1,8,6,4,9,3,76,4))//.collect().foreach(println)
val rdd2 = rdd.map(r=>(r+"s",Array(1.0,2.0)))
rdd2.map{
case(a,b) => (md5(a)+"_"+a,b)
}.foreach(println)
in the local mode, it's ok, but in the cluster mode, it's error.
java.lang.NullPointerException
Can I have another way to do this? thx :)
error:
Exception in thread "main" java.lang.NullPointerException
at no1.no1$.no1$no1$$md5$1(no1.scala:139)
at no1.no1$$anonfun$8.apply(no1.scala:143)
at no1.no1$$anonfun$8.apply(no1.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at no1.no1$.main(no1.scala:141)
at no1.no1.main(no1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
the code above is an example, but this code seems to be right. I am confused.
I see no way for the RDD to provide a null string to your MD5 function, and the failure is clearly inside it:
java.lang.NullPointerException
at no1.no1$.no1$no1$$md5$1(no1.scala:139) <-- here!
My money would be that the static call MessageDigest.getInstance("MD5") is returning null on the executors. That or the .digest call. Check for conditions on which that can happen, maybe the inputs your are trying locally do no contain the failure cases.