Spark-Scala: Parse Fixed width line into Dataframe Api with exception handling - scala

I am beginner learning spark with scala .pardon for my broken english...I need to write a program to parse delimited and fixed width file into Dataframe using spark-scala Dataframe Api.Also if input data is corrupted then program must handle in below given way:
A:ignoring the input data
B:investigate the error in input
C:stop on error
To accomplish the above goal , i have succesfully done parsing with exception handling for delimited file using DataFrame Api options. But i dont have idea how to apply same technique for fixed width file. I am using Spark 2.4.3 version.
// predefined schema used in program
val schema = new StructType()
.add("empno",IntegerType,true)
.add("ename",StringType,true)
.add("designation",StringType,true)
.add("manager",StringType,true)
.add("hire_date",StringType,true)
.add("salary",DoubleType,true)
.add("deptno",IntegerType,true)
.add("_corrupt_record", StringType, true)
// parse csv file into DataFrame Api
// option("mode","PERMISSIVE") used to handle corrupt record
val textDF =sqlContext.read.format("csv").option("header", "true").schema(schema).option("mode", "PERMISSIVE").load("empdata.csv")
textDF.show
// program for fixed width line
// created lsplit method to split line into list of tokens based on width input / string
def lsplit(pos: List[Int], str: String): List[String] = {
val (rest, result) = pos.foldLeft((str, List[String]())) {
case ((s, res),curr) =>
if(s.length()<=curr)
{
val split=s.substring(0).trim()
val rest=""
(rest, split :: res)
}
else if(s.length()>curr)
{
val split=s.substring(0, curr).trim()
val rest=s.substring(curr)
(rest, split :: res)
}
else
{
val split=""
val rest=""
(rest, split :: res)
}
}
// list is reversed
result.reverse
}
// create case class to hold parsed data
case class EMP(empno:Int,ename:String,designation:String,manager:String,hire_dt:String,salary:Double,deptno:Int)
// create variable to hold width length
val sizeOfColumn=List(4,4,5,4,10,8,2);
// code to transform string to case class record
val ttRdd=textDF.map {
x =>
val row=lsplit(sizeOfColumn,x.mkString)
EMP(row(0).toInt,row(1),row(2),row(3),row(4).toDouble,row(5).toInt)
}
Code works fine for proper data but fails if incorrect data comes in file.
for e.g: "empno" column has some non-integer data..program throws exception NumberFormatException..
The program must handle if actual data in file does not match the specified schema as handled in delimited file.
Kindly help me here . I need to use same method for fixed width file as used for delimited file.

It's sort of obvious.
You are blending your own approach with the API "permissive" option.
The permissive will pick up errors such as wrong data type. Then your own process lsplit still executes and can get a null exception. E.g. If I put in empnum "YYY" this is clearly observable.
If the datatype is OK and the length wrong, you process in most cases correctly, but the fields are garbled.
Your lsplit needs to be more robust and you need to check for if an error exists in there or prior to invoking not invoking.
First case
+-----+-----+---------------+
|empno|ename|_corrupt_record|
+-----+-----+---------------+
| null| null| YYY,Gerry|
| 5555|Wayne| null|
+-----+-----+---------------+
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 30, localhost, executor driver): java.lang.NumberFormatException: For input string: "null"
Second case
+------+-----+---------------+
| empno|ename|_corrupt_record|
+------+-----+---------------+
|444444|Gerry| null|
| 5555|Wayne| null|
+------+-----+---------------+
res37: Array[EMP] = Array(EMP(4444,44Ger), EMP(5555,Wayne))
In short, some work to do and no need for a header in fact.

Related

Spark UDF now throws ArrayIndexOutOfBoundsException

I wrote a UDF in Spark (3.0.0) to do an MD5 hash of columns that looks like this:
def Md5Hash(text: String): String = {
java.security.MessageDigest.getInstance("MD5")
.digest(text.getBytes())
.map(0xFF & _)
.map("%02x".format(_))
.foldLeft("") { _ + _ }
}
val md5Hash: UserDefinedFunction = udf(Md5Hash(_))
This function has worked fine for me for months, but it is now failing at runtime:
org.apache.spark.SparkException: Failed to execute user defined function(UDFs$$$Lambda$3876/1265187815: (string) => string)
....
Caused by: java.lang.ArrayIndexOutOfBoundsException
at sun.security.provider.DigestBase.engineUpdate(DigestBase.java:116)
at sun.security.provider.MD5.implDigest(MD5.java:109)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:207)
at sun.security.provider.DigestBase.engineDigest(DigestBase.java:186)
at java.security.MessageDigest$Delegate.engineDigest(MessageDigest.java:592)
at java.security.MessageDigest.digest(MessageDigest.java:365)
at java.security.MessageDigest.digest(MessageDigest.java:411)
It still works on some small datasets, but I have another larger dataset (10Ms of rows, so not terribly huge) that fails here. I couldn't find any indication that the data I'm trying to hash are bizarre in any way -- all input values are non-null, ASCII strings. What might cause this error when it previously worked fine? I'm running in AWS EMR 6.1.0 with Spark 3.0.0.

Spark Structured Streaming: Importing image files to create a simple ML app

I would like to build a Structured Streaming application which purpose is to retrieve images from an URL and build a pretty simple ML model that would do a classification based on the content of the image.
I have an URL (http://129.102.240.235/extraits/webcam/webcam.jpg) which is updated every X unit of time with a new image. My goal would be first to store those images or to import them directly using a readStream object (if it possible?). I know that since Spark 2.X we can directly use an image format to read content into a Dataframe. I was hesitating between different approach:
using a message's bus solution (as Kafka) that will produce my content to be consumed in Spark, I thought that this would not be bad because Kafka can be used to replicate files so the data loss is weaker.
Directly make use of the readStream object to read the image (This is what I tried to do, see below)
My next Scala code's purpose is just trying to show the content of the image , but it throws different errors when I test it using spark-shell, I will comment the errors on the corresponding part of my below code.
scala> val url = "http://129.102.240.235/extraits/webcam/webcam.jpg"
url: String = http://129.102.240.235/extraits/webcam/webcam.jpg
scala> spark.sparkContext.addFile(url)
scala> val image_df = spark.read.format("image").load("file://"+SparkFiles.get("webcam.jpg"))
image_df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
scala> image_df.select("image.origin").show(false)
19/10/25 13:33:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkException: File /tmp/spark-28741963-fd2d-44c2-8a6b-a489fdaae96d/userFiles-95b99fde-a1e2-4da6-9a17-382bfd2292c4/webcam.jpg exists and does not ma
tch contents of http://129.102.240.235/extraits/webcam/webcam.jpg
I also tried with using readStream:
scala> val scheme = " origin STRING, height INT, width INT, nChannels INT, mode INT, data BINARY"
scheme: String = " origin STRING, height INT, width INT, nChannels INT, mode INT, data BINARY"
scala> val image_df = spark.readStream.format("image").schema(scheme).load("file://"+SparkFiles.get("webcam.jpg"))
image_df: org.apache.spark.sql.DataFrame = [origin: string, height: int ... 4 more fields]
scala> val query_show = image_df.collect.foreach(println).writeStream.format("console").start()
<console>:26: error: value writeStream is not a member of Unit
val query_show = image_df.collect.foreach(println).writeStream.format("console").start()
// Based on what I red on StackO question, I suppose that this error might be caused because
// .writeStream should be on the next line, so I tried to put it on 2 lines but..
scala> val query_show = image_df.collect.foreach(println).
| writeStream.format("console").start()
<console>:27: error: value writeStream is not a member of Unit
possible cause: maybe a semicolon is missing before `value writeStream'?
writeStream.format("console").start()
// Also tried without declaring query_show but returns the same error..
// I know that if I make it work I will have to add query_show.awaitTermination()
Any help on debugging this code or idea to build my streaming pipeline would be highly appreciated!
I manage to find a way to show() my dataframe red using the "image" format.
I did it in 2 steps:
1/ run a Python script that would save the jpg image from the URL, the script is:
import urllib.request
import shutil
filename = '~/dididi/bpi-spark/images'
url = "http://129.102.240.235/extraits/webcam/webcam.jpg"
with urllib.request.urlopen(url) as response, open(filename, 'ab') as out_file:
shutil.copyfileobj(response, out_file)
2/ Then, using spark-shell I just executed those 2 lines:
val image_df = spark.read.format("image").option("inferSchema", true).load("bpi-spark/images").select($"image.origin",$"image.height",$"image.width", $"image.mode", $"image.data")
scala> image_df.show()
+--------------------+------+-----+----+--------------------+
| origin|height|width|mode| data|
+--------------------+------+-----+----+--------------------+
|file:///home/niki...| 480| 720| 16|[3C 3D 39 3C 3D 3...|
+--------------------+------+-----+----+--------------------+

select query fails on large dataset i sqlcontext

My code is reading data from sqlcontext. The table has 20 million records in it. I want to calculate totalCount in table.
val finalresult = sqlContext.sql(“SELECT movieid,
tagname, occurrence AS eachTagCount, count AS
totalCount FROM result ORDER BY movieid”)
I want calculate the total count of one column without using groupby and save it in a textfile.
.I change my saving file without additional ]
>val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import sqlContext._
case class DataClass(UserId: Int, MovieId:Int, Tag: String)
// Create an RDD of DataClass objects and register it as a table.
val Data = sc.textFile("file:///usr/local/spark/dataset/tagupdate").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toInt, p(2).trim)).toDF()
Data.registerTempTable("tag")
val orderedId = sqlContext.sql("SELECT MovieId AS Id,Tag FROM tag ORDER BY MovieId")
orderedId.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output")
// orderedId.write.parquet("ordered.parquet")
val eachTagCount =orderedId.groupBy("Tag").count()
//eachTagCount.show()
eachTagCount.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output2")
ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 604) java.lang.ArrayIndexOutOfBoundsException: 1 at tags$$anonfun$6.apply(tags.scala:46) at tags$$anonfun$6.apply(tags.scala:46) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
Error NumberFormatException is probably thrown in this place:
p(1).trim.toInt
It is thrown because you're trying to parse 10] which is obviously not a valid number.
You could try to find that problematic place in your file and just remove additional ].
You could also try to catch an error and provide a default value in case there are any problems with parsing:
import scala.util.Try
Try(p(1).trim.toInt).getOrElse(0) //return 0 in case there is problem with parsing.
Another thing you could do is to remove characters, which are not digits from the string you're trying to parse:
//filter out everything which is not a digit
p(1).filter(_.isDigit).toInt)
It might also fail in case everything will be filtered out and an empty string will be left, so it might be a good idea to also wrap it in Try.

How to read file names from column in DataFrame to process using SparkContext.textFile?

I'm so new using Spark and I'm so stuck with this issue:
From a DataFrame that I have created; called reportesBN, I want to get the value of a field, in order to use it to get a TextFile of a specific route. And after that, give to that file a specific process.
I have developed this code, but its not working:
reportesBN.foreach {
x =>
val file = x(0)
val insumo = sc.textFile(s"$file")
val firstRow = insumo.first.split("\\|", -1)
// Get values of next rows
val nextRows = insumo.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val dfNextRows = nextRows.map(a => a.split("\\|")).map(x=> BalanzaNextRows(x(0), x(1),
x(2), x(3), x(4))).toDF()
val validacionBalanza = new RevisionCampos(sc)
validacionBalanza.validacionBalanza(firstRow, dfNextRows)
}
The error log indicates that it is because of serialization.
7/06/28 18:55:45 INFO SparkContext: Created broadcast 0 from textFile at ValidacionInsumos.scala:56
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Is this problem caused by the Spark Context (sc) that is inside the foreach?
Is there another way to implement this?
Regards.
A very similar question you asked before and that's that same issue - you cannot use SparkContext inside a RDD transformation or action. In this case, you use sc.textFile(s"$file") inside reportesBN.foreach which as you said is a DataFrame:
From a DataFrame that I have created; called reportesBN
You should rewrite your transformation to take a file from the DataFrame and read it afterwards.
// This is val file = x(0)
// I assume that the column name is `files`
val files = reportesBN.select("files").as[String].collectAsList
Once you have the collection of files to process, you execute the code in your block.
files.foreach {
x => ...
}

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/