I have created and registered temporary udf in snowflake using snowpark scala. My code is attached below. I am getting error when I try to run this code.
I am assuming there is something else to do with snowflake udf. Any leads would be of great help
session.udf.registerTemporary("lookupFromArrayUdf", (s: Array[String], schema: String, table: String, config: String, colName: String, action: String) => {
implicit val formats: DefaultFormats.type = net.liftweb.json.DefaultFormats
....code goes here....
})
THe error I am getting is below
Exception in thread "main" net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Found a function matching HASHSENSITIVESTRINGUDF, but IMPORTS or TARGET_PATH could not be resolved.
at net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowExceptionSub(SnowflakeUtil.java:127)
at net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowException(SnowflakeUtil.java:67)
at net.snowflake.client.core.StmtUtil.pollForOutput(StmtUtil.java:442)
at net.snowflake.client.core.StmtUtil.execute(StmtUtil.java:345)
at net.snowflake.client.core.SFStatement.executeHelper(SFStatement.java:487)
at net.snowflake.client.core.SFStatement.executeQueryInternal(SFStatement.java:198)
at net.snowflake.client.core.SFStatement.executeQuery(SFStatement.java:135)
at net.snowflake.client.core.SFStatement.describe(SFStatement.java:154)
at net.snowflake.client.jdbc.SnowflakePreparedStatementV1.describeSqlIfNotTried(SnowflakePreparedStatementV1.java:96)
at net.snowflake.client.jdbc.SnowflakePreparedStatementV1.getMetaData(SnowflakePreparedStatementV1.java:549)
at com.snowflake.snowpark.internal.ServerConnection.$anonfun$getResultAttributes$1(ServerConnection.scala:500)
at com.snowflake.snowpark.internal.ServerConnection.withValidConnection(ServerConnection.scala:810)
at com.snowflake.snowpark.internal.ServerConnection.getResultAttributes(ServerConnection.scala:478)
at com.snowflake.snowpark.Session.getResultAttributes(Session.scala:841)
at com.snowflake.snowpark.internal.SchemaUtils$.analyzeAttributes(SchemaUtils.scala:44)
at com.snowflake.snowpark.internal.analyzer.SnowflakePlan.attributes$lzycompute(SnowflakePlan.scala:39)
at com.snowflake.snowpark.internal.analyzer.SnowflakePlan.attributes(SnowflakePlan.scala:38)
at com.snowflake.snowpark.internal.analyzer.SnowflakePlan.output$lzycompute(SnowflakePlan.scala:94)
at com.snowflake.snowpark.internal.analyzer.SnowflakePlan.output(SnowflakePlan.scala:94)
at com.snowflake.snowpark.DataFrame.output$lzycompute(DataFrame.scala:2580)
at com.snowflake.snowpark.DataFrame.output(DataFrame.scala:2580)
at com.snowflake.snowpark.DataFrame.withColumns(DataFrame.scala:1877)
at com.snowflake.snowpark.DataFrame.withColumn(DataFrame.scala:1834)
at us.company.snowpark.app.SnowparkAppUdfDriver$.$anonfun$processVariantColumn$2(SnowparkAppUdfDriver.scala:364)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at us.company.snowpark.app.SnowparkAppUdfDriver$.processVariantColumn(SnowparkAppUdfDriver.scala:364)
at us.company.snowpark.app.SnowparkAppUdfDriver$.$anonfun$applyHashing$6(SnowparkAppUdfDriver.scala:425)
at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.MapLike$DefaultKeySet.foreach(MapLike.scala:181)
at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:108)
at us.company.snowpark.app.SnowparkAppUdfDriver$.applyHashing(SnowparkAppUdfDriver.scala:425)
at us.company.snowpark.app.SnowparkAppUdfDriver$.process(SnowparkAppUdfDriver.scala:454)
at us.company.snowpark.app.SnowparkAppUdfDriver$.main(SnowparkAppUdfDriver.scala:178)
at us.company.snowpark.app.SnowparkAppUdfDriver.main(SnowparkAppUdfDriver.scala)
Problem: I just kept the dependencies jar in the stage but the application jar is also needed to be present in Snowflake Stage
Resolution: The issue got resolved when I uploaded the entire application jar to the stage.
Related
I wrote a UDF in Spark (3.0.0) to do an MD5 hash of columns that looks like this:
def Md5Hash(text: String): String = {
java.security.MessageDigest.getInstance("MD5")
.digest(text.getBytes())
.map(0xFF & _)
.map("%02x".format(_))
.foldLeft("") { _ + _ }
}
val md5Hash: UserDefinedFunction = udf(Md5Hash(_))
This function has worked fine for me for months, but it is now failing at runtime:
org.apache.spark.SparkException: Failed to execute user defined function(UDFs$$$Lambda$3876/1265187815: (string) => string)
....
Caused by: java.lang.ArrayIndexOutOfBoundsException
at sun.security.provider.DigestBase.engineUpdate(DigestBase.java:116)
at sun.security.provider.MD5.implDigest(MD5.java:109)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:207)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:186)
at java.security.MessageDigest$Delegate.engineDigest(MessageDigest.java:592)
at java.security.MessageDigest.digest(MessageDigest.java:365)
at java.security.MessageDigest.digest(MessageDigest.java:411)
It still works on some small datasets, but I have another larger dataset (10Ms of rows, so not terribly huge) that fails here. I couldn't find any indication that the data I'm trying to hash are bizarre in any way -- all input values are non-null, ASCII strings. What might cause this error when it previously worked fine? I'm running in AWS EMR 6.1.0 with Spark 3.0.0.
I want to rename HDFS Files in parallel using spark. But I am getting serialization exception, I have mention the exception after my code.
I am getting this issue while using spark.sparkContext.parallelize. Also I am able to rename all the files, when doing it in a loop.
def renameHdfsToS3(spark : SparkSession, hdfsFolder :String, outputFileName:String,
renameFunction: (String,String) => String, bktOutput:String, folderOutput:String, kmsKey:String): Boolean = {
try {
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val path = new Path(hdfsFolder)
val files = fs.listStatus(path)
.filter(fs => fs.isFile)
val parallelRename=spark.sparkContext.parallelize(files).map(
f=>{
parallelRenameHdfs(fs,outputFileName,renamePartFileWithTS,f)
}
)
val hdfsTopLevelPath=fs.getWorkingDirectory()+"/"+hdfsFolder
return true
} catch {
case NonFatal(e) => {
e.printStackTrace()
return false
}
}
}
Below is the exception I am getting
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.LocalFileSystem
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.LocalFileSystem, value: org.apache.hadoop.fs.LocalFileSystem#1d96d872)
- field (class: at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
The approach is incorrect as sc.parallelize is for consuming data via RDDs. You need to be working at the operating system level. Many such posts exist.
Something like this should suffice blending it with your own logic, note par which allows parallel processing, e.g.:
originalpath.par.foreach( e => hdfs.rename(e,e.suffix("finish")))
You need to check how parallelism is defined with .par. Look here https://docs.scala-lang.org/overviews/parallel-collections/configuration.html
Created one project 'spark-udf' & written hive udf as below:
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
Built it & created jar for it. Tried to use this udf in another spark program:
spark.sql("CREATE OR REPLACE FUNCTION uppercase AS 'com.spark.udf.UpperCase' USING JAR '/home/swapnil/spark-udf/target/spark-udf-1.0.jar'")
But following line is giving me exception:
spark.sql("select uppercase(Car) as NAME from cars").show
Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
handler for UDAF 'com.spark.udf.UpperCase'. Use
sparkSession.udf.register(...) instead.; line 1 pos 7 at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionExpression(SessionCatalog.scala:1105)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1247)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
Any help around this is really appreciated.
As mentioned in comments, it's better to write Spark UDF:
val uppercaseUDF = spark.udf.register("uppercase", (s : String) => s.toUpperCase)
spark.sql("select uppercase(Car) as NAME from cars").show
Main cause is that you didn't set enableHiveSupport during creation of SparkSession. In such situation, default SessionCatalog will be used and makeFunctionExpression function in SessionCatalog scans only for User Defined Aggregate Function. If function is not an UDAF, it won't be found.
Created Jira task to implement this
Issue is class needs to be public.
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
public class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
}
I am trying to pass a word2vec model object to my spark udf. Basically I have a test set with movie Ids and I want to pass the ids along with the model object to get an array of recommended movies for each row.
def udfGetSynonyms(model: org.apache.spark.ml.feature.Word2VecModel) =
udf((col : String) => {
model.findSynonymsArray("20", 1)
})
however this gives me a null pointer exception. When I run model.findSynonymsArray("20", 1) outside the udf I get the expected answer. For some reason it doesn't understand something about the function within the udf but can run it outside the udf.
Note: I added "20" here just to get a fixed answer to see if that would work. It does the same when I replace "20" with col.
Thanks for the help!
StackTrace:
SparkException: Job aborted due to stage failure: Task 0 in stage 23127.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23127.0 (TID 4646648, 10.56.243.178, executor 149): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$udfGetSynonyms1$1: (string) => array<struct<_1:string,_2:double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Word2VecModel.findSynonymsArray(Word2Vec.scala:273)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:7)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:4)
... 12 more
The SQL and udf API is a bit limited and I am not sure if there is a way to use custom types as columns or as inputs to udfs. A bit of googling didn't turn up anything too useful.
Instead, you can use the DataSet or RDD API and just use a regular Scala function instead of a udf, something like:
val model: Word2VecModel = ...
val inputs: DataSet[String] = ...
inputs.map(movieId => model.findSynonymsArray(movieId, 10))
Alternatively, I guess you could serialize the model to and from a string, but that seems much uglier.
I think this issue happens because wordVectors is a transient variable
class Word2VecModel private[ml] (
#Since("1.4.0") override val uid: String,
#transient private val wordVectors: feature.Word2VecModel)
extends Model[Word2VecModel] with Word2VecBase with MLWritable {
I have solved this by broadcasting w2vModel.getVectors and re-creating the Word2VecModel model inside each partition
I have a simple word count program packed as object:
object MyApp {
val path = "file:///home/sergey/spark/spark-2.2.0/README.md"
val readMe = sc.textFile(path)
val stop = List("to","the","a")
val res = (readMe
.flatMap(_.split("\\W+"))
.filter(_.length > 0)
.map(_.toLowerCase)
.filter(!stop.contains(_))
.map((_, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
)
println(res.take(3).mkString)
}
When I try to execute it I get:
scala> MyApp
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
... 51 elided
Caused by: java.io.NotSerializableException: MyApp$
Serialization stack:
- object not serializable (class: MyApp$, value: MyApp$#7bd44868)
- field (class: MyApp$$anonfun$5, name: $outer, type: class MyApp$)
- object (class MyApp$$anonfun$5, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
(the culprit being .filter(!stop.contains(_)) line.
However, when I execute the same code line by line it runs well and produces expected results.
I would really appreciate answers to 2 questions:
What is so different between line-by-line execution and singleton execution, so that one runs whereas the other fails?
What could be other solutions than packing !stop.contains(_) closure together with stop list into another object?
Generally Speaking. Your program is a little strange.
Let me help to illustrate the detail. Hope it will help you!
You said you got correct answer if the program was executed line by line. From your description context, i guess this happened in spark-shell, right?
One important to note is that, if you open spark-shell in spark package, your are in a REPL environment, there will be a pre-constructed spark context object for you already, making your own SparkContext will not work.
For example,
$ ./bin/spark-shell --master local[4]
Now you want to get an spark application, represented by an program text. Such as your file MyApp.scala, then the parallel operation on RDD, here for readMe, a RDD[String], the spark will break the job triggered by action take to many tasks and forward there tasks to worker to execute,
But, now pay your attention to your code!, in order to have your operation to construct the closure(those variables and methods which must be visible for the executor to perform its computations on the RDD), but from your code, all your closure calculation is in you whole object.
However, in the singleton object MyApp, SparkContext type, sc, for example, can not and should not be serializable, because it will stand only on driver node to tell spark how to access the cluster, so your submit will fail.
I help to revise this code, you can run it on your machine. But for your purpose, your revised code should be submitted to spark-submit script.
import org.apache.spark.{SparkConf, SparkContext}
object MyApp {
private var sc: SparkContext = _
def init(): Unit = {
val sparkConf = new SparkConf().setAppName(this.getClass.getName)
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sc = new SparkContext(sparkConf)
}
def mission(): Unit = {
val path = "file:///home/sergey/spark/spark-2.2.0/README.md"
val readMe = sc.textFile(path)
val stop = List("to", "the", "a")
val res = readMe
.flatMap(_.split("\\W+"))
.filter(_.length > 0)
.map(_.toLowerCase)
.filter(!stop.contains(_))
.map((_, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
println(res.take(3).mkString)
}
def main(args: Array[String]): Unit = {
init()
mission()
}
}
Please refer to submit usage on page, sooner or later you have to use it.