Spark UDF now throws ArrayIndexOutOfBoundsException - scala

I wrote a UDF in Spark (3.0.0) to do an MD5 hash of columns that looks like this:
def Md5Hash(text: String): String = {
java.security.MessageDigest.getInstance("MD5")
.digest(text.getBytes())
.map(0xFF & _)
.map("%02x".format(_))
.foldLeft("") { _ + _ }
}
val md5Hash: UserDefinedFunction = udf(Md5Hash(_))
This function has worked fine for me for months, but it is now failing at runtime:
org.apache.spark.SparkException: Failed to execute user defined function(UDFs$$$Lambda$3876/1265187815: (string) => string)
....
Caused by: java.lang.ArrayIndexOutOfBoundsException
at sun.security.provider.DigestBase.engineUpdate(DigestBase.java:116)
at sun.security.provider.MD5.implDigest(MD5.java:109)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:207)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:186)
at java.security.MessageDigest$Delegate.engineDigest(MessageDigest.java:592)
at java.security.MessageDigest.digest(MessageDigest.java:365)
at java.security.MessageDigest.digest(MessageDigest.java:411)
It still works on some small datasets, but I have another larger dataset (10Ms of rows, so not terribly huge) that fails here. I couldn't find any indication that the data I'm trying to hash are bizarre in any way -- all input values are non-null, ASCII strings. What might cause this error when it previously worked fine? I'm running in AWS EMR 6.1.0 with Spark 3.0.0.

Related

Unable to get the value of first column from a row in dataset using spark scala

I'm trying to iterate a dataframe using foreachpartition for inserting a value into database. I used foreachpartition and group the rows and using foreach to iterate each row. Please find my code below,
val endDF=spark.read.parquet(path).select("pc").filter(col("pc").isNotNull);
endDF.foreachpartition((partition: Iterator[Row]) =>
class.forname(driver)
val con=DriverManager.connection(jdbcurl,user,pwd)
partition.grouped(100).foreach(batch => {
val st=con.createStatement()
batch.foreach(row => {
val pc=row.get(0).toString()
val in=s"""insert tshdim (pc) values(${pc})""".stripMargin
st.addBatch(in)
})
st.executeLargeBatch
})
con.close()
})
When I try to get the pc value from the row(val pc=row.get(0).toString()) it's throwing the following exception. I'm doing this in spark-shell
org.apache.spark.SparkException : Task not serializable . .
Caused by:
Java.io.NotSerializable exception:
org.apache.spark.sql.DataSet$RDDQueryExecution$ Serialization stack:
Object not serializable
(class:org.apache.spark.sql.DataSet$RDDQueryExecution$, value:
org.apache.spark.sql.DataSet$RDDQueryExecution$#jfaf )
-field(class:org.apache.spark.sql.DataSet, name:RDDQueryExecutionModule, type:
org.apache.spark.sql.DataSet$RDDQueryExecution$)
-object(class:org.apache.spark.sql.DataSet,[pc:String])
Function in foreachpartition need to be serialized and passed to executors.
So, in your case, spark is trying to serialize DriverManager class and everything for your jdbc connection, and some of that is not serializable.
foreachPartition works without DriverManager -
endDF.foreachPartition((partition: Iterator[Row]) => {
partition.grouped(100).foreach(batch => {
batch.foreach(row => {
val pc=row.get(0)
println(pc)
})
})
})
To save it in your DB, first do .collect

Spark-Scala: Parse Fixed width line into Dataframe Api with exception handling

I am beginner learning spark with scala .pardon for my broken english...I need to write a program to parse delimited and fixed width file into Dataframe using spark-scala Dataframe Api.Also if input data is corrupted then program must handle in below given way:
A:ignoring the input data
B:investigate the error in input
C:stop on error
To accomplish the above goal , i have succesfully done parsing with exception handling for delimited file using DataFrame Api options. But i dont have idea how to apply same technique for fixed width file. I am using Spark 2.4.3 version.
// predefined schema used in program
val schema = new StructType()
.add("empno",IntegerType,true)
.add("ename",StringType,true)
.add("designation",StringType,true)
.add("manager",StringType,true)
.add("hire_date",StringType,true)
.add("salary",DoubleType,true)
.add("deptno",IntegerType,true)
.add("_corrupt_record", StringType, true)
// parse csv file into DataFrame Api
// option("mode","PERMISSIVE") used to handle corrupt record
val textDF =sqlContext.read.format("csv").option("header", "true").schema(schema).option("mode", "PERMISSIVE").load("empdata.csv")
textDF.show
// program for fixed width line
// created lsplit method to split line into list of tokens based on width input / string
def lsplit(pos: List[Int], str: String): List[String] = {
val (rest, result) = pos.foldLeft((str, List[String]())) {
case ((s, res),curr) =>
if(s.length()<=curr)
{
val split=s.substring(0).trim()
val rest=""
(rest, split :: res)
}
else if(s.length()>curr)
{
val split=s.substring(0, curr).trim()
val rest=s.substring(curr)
(rest, split :: res)
}
else
{
val split=""
val rest=""
(rest, split :: res)
}
}
// list is reversed
result.reverse
}
// create case class to hold parsed data
case class EMP(empno:Int,ename:String,designation:String,manager:String,hire_dt:String,salary:Double,deptno:Int)
// create variable to hold width length
val sizeOfColumn=List(4,4,5,4,10,8,2);
// code to transform string to case class record
val ttRdd=textDF.map {
x =>
val row=lsplit(sizeOfColumn,x.mkString)
EMP(row(0).toInt,row(1),row(2),row(3),row(4).toDouble,row(5).toInt)
}
Code works fine for proper data but fails if incorrect data comes in file.
for e.g: "empno" column has some non-integer data..program throws exception NumberFormatException..
The program must handle if actual data in file does not match the specified schema as handled in delimited file.
Kindly help me here . I need to use same method for fixed width file as used for delimited file.
It's sort of obvious.
You are blending your own approach with the API "permissive" option.
The permissive will pick up errors such as wrong data type. Then your own process lsplit still executes and can get a null exception. E.g. If I put in empnum "YYY" this is clearly observable.
If the datatype is OK and the length wrong, you process in most cases correctly, but the fields are garbled.
Your lsplit needs to be more robust and you need to check for if an error exists in there or prior to invoking not invoking.
First case
+-----+-----+---------------+
|empno|ename|_corrupt_record|
+-----+-----+---------------+
| null| null| YYY,Gerry|
| 5555|Wayne| null|
+-----+-----+---------------+
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 30, localhost, executor driver): java.lang.NumberFormatException: For input string: "null"
Second case
+------+-----+---------------+
| empno|ename|_corrupt_record|
+------+-----+---------------+
|444444|Gerry| null|
| 5555|Wayne| null|
+------+-----+---------------+
res37: Array[EMP] = Array(EMP(4444,44Ger), EMP(5555,Wayne))
In short, some work to do and no need for a header in fact.

select query fails on large dataset i sqlcontext

My code is reading data from sqlcontext. The table has 20 million records in it. I want to calculate totalCount in table.
val finalresult = sqlContext.sql(“SELECT movieid,
tagname, occurrence AS eachTagCount, count AS
totalCount FROM result ORDER BY movieid”)
I want calculate the total count of one column without using groupby and save it in a textfile.
.I change my saving file without additional ]
>val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import sqlContext._
case class DataClass(UserId: Int, MovieId:Int, Tag: String)
// Create an RDD of DataClass objects and register it as a table.
val Data = sc.textFile("file:///usr/local/spark/dataset/tagupdate").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toInt, p(2).trim)).toDF()
Data.registerTempTable("tag")
val orderedId = sqlContext.sql("SELECT MovieId AS Id,Tag FROM tag ORDER BY MovieId")
orderedId.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output")
// orderedId.write.parquet("ordered.parquet")
val eachTagCount =orderedId.groupBy("Tag").count()
//eachTagCount.show()
eachTagCount.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output2")
ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 604) java.lang.ArrayIndexOutOfBoundsException: 1 at tags$$anonfun$6.apply(tags.scala:46) at tags$$anonfun$6.apply(tags.scala:46) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
Error NumberFormatException is probably thrown in this place:
p(1).trim.toInt
It is thrown because you're trying to parse 10] which is obviously not a valid number.
You could try to find that problematic place in your file and just remove additional ].
You could also try to catch an error and provide a default value in case there are any problems with parsing:
import scala.util.Try
Try(p(1).trim.toInt).getOrElse(0) //return 0 in case there is problem with parsing.
Another thing you could do is to remove characters, which are not digits from the string you're trying to parse:
//filter out everything which is not a digit
p(1).filter(_.isDigit).toInt)
It might also fail in case everything will be filtered out and an empty string will be left, so it might be a good idea to also wrap it in Try.

NullPointerException when using Word2VecModel with UserDefinedFunction

I am trying to pass a word2vec model object to my spark udf. Basically I have a test set with movie Ids and I want to pass the ids along with the model object to get an array of recommended movies for each row.
def udfGetSynonyms(model: org.apache.spark.ml.feature.Word2VecModel) =
udf((col : String) => {
model.findSynonymsArray("20", 1)
})
however this gives me a null pointer exception. When I run model.findSynonymsArray("20", 1) outside the udf I get the expected answer. For some reason it doesn't understand something about the function within the udf but can run it outside the udf.
Note: I added "20" here just to get a fixed answer to see if that would work. It does the same when I replace "20" with col.
Thanks for the help!
StackTrace:
SparkException: Job aborted due to stage failure: Task 0 in stage 23127.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23127.0 (TID 4646648, 10.56.243.178, executor 149): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$udfGetSynonyms1$1: (string) => array<struct<_1:string,_2:double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Word2VecModel.findSynonymsArray(Word2Vec.scala:273)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:7)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:4)
... 12 more
The SQL and udf API is a bit limited and I am not sure if there is a way to use custom types as columns or as inputs to udfs. A bit of googling didn't turn up anything too useful.
Instead, you can use the DataSet or RDD API and just use a regular Scala function instead of a udf, something like:
val model: Word2VecModel = ...
val inputs: DataSet[String] = ...
inputs.map(movieId => model.findSynonymsArray(movieId, 10))
Alternatively, I guess you could serialize the model to and from a string, but that seems much uglier.
I think this issue happens because wordVectors is a transient variable
class Word2VecModel private[ml] (
#Since("1.4.0") override val uid: String,
#transient private val wordVectors: feature.Word2VecModel)
extends Model[Word2VecModel] with Word2VecBase with MLWritable {
I have solved this by broadcasting w2vModel.getVectors and re-creating the Word2VecModel model inside each partition

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?
I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran