Best approch for parsing large structured file with Apache spark - scala

I have huge text file (in GBs) with plan text data in each line, which needs to be parsed and extracted to a structure for further processing. Each line has text with 200 charactor length and I have an Regular Expression to parse each line and split into different groups, which will later saved to a flat column data
data sample
1759387ACD06JAN1910MAR191234567ACRT
RegExp
(.{7})(.{3})(.{7})(.{7})(.{7})(.{4})
Data Structure
Customer ID, Code, From Date, To Date, TrasactionId, Product code
1759387, ACD, 06JAN19, 10MAR19, 1234567, ACRT
Please suggest a BEST approch to parse this huge data and push to In Memory grid, which will be used again by Spark Jobs for further processing, when respective APIs are invoked.

You can use the DF approach. Copy the serial file to HDFS using -copyFromLocal command
and use the below code to parse each records
I'm assuming the sample records in gireesh.txt as below
1759387ACD06JAN1910MAR191234567ACRT
2759387ACD08JAN1910MAY191234567ACRY
3759387ACD03JAN1910FEB191234567ACRZ
The spark code
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.Encoders._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object Gireesh {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Operations..").master("local[*]").getOrCreate()
import spark.implicits._
val pat="""(.{7})(.{3})(.{7})(.{7})(.{7})(.{4})""".r
val headers = List("custid","code","fdate","tdate","tranid","prdcode")
val rdd = spark.sparkContext.textFile("in/gireesh.txt")
.map( x => {
val y = scala.collection.mutable.ArrayBuffer[String]()
pat.findAllIn(x).matchData.foreach( m=> y.appendAll(m.subgroups))
(y(0).toLong,y(1),y(2),y(3),y(4).toLong,y(5))
}
)
val df = rdd.toDF(headers:_*)
df.printSchema()
df.show(false)
}
}
gives the below results.
root
|-- custid: long (nullable = false)
|-- code: string (nullable = true)
|-- fdate: string (nullable = true)
|-- tdate: string (nullable = true)
|-- tranid: long (nullable = false)
|-- prdcode: string (nullable = true)
+-------+----+-------+-------+-------+-------+
|custid |code|fdate |tdate |tranid |prdcode|
+-------+----+-------+-------+-------+-------+
|1759387|ACD |06JAN19|10MAR19|1234567|ACRT |
|2759387|ACD |08JAN19|10MAY19|1234567|ACRY |
|3759387|ACD |03JAN19|10FEB19|1234567|ACRZ |
+-------+----+-------+-------+-------+-------+
EDIT1:
You can have the map "transformation" in a separate function like below.
def parse(record:String) = {
val y = scala.collection.mutable.ArrayBuffer[String]()
pat.findAllIn(record).matchData.foreach( m=> y.appendAll(m.subgroups))
(y(0).toLong,y(1),y(2),y(3),y(4).toLong,y(5))
}
val rdd = spark.sparkContext.textFile("in/gireesh.txt")
.map( x => parse(x) )
val df = rdd.toDF(headers:_*)
df.printSchema()

You need to tell spark which file to read and how to process the content while reading it.
Here is an example:
val numberOfPartitions = 5 // this needs to be optimized based on the size of the file and the available resources (e.g. memory)
val someObjectsRDD: RDD[SomeObject] =
sparkContext.textFile("/path/to/your/file", numberOfPartitions)
.mapPartitions(
{ stringsFromFileIterator =>
stringsFromFileIterator.map(stringFromFile => //here process the raw string and return the result)
}
, preservesPartitioning = true
)
In the code snippet SomeObject is an object with the data structure from the question

Related

Scala explode followed by UDF on a dataframe fails

I have a scala dataframe with the following schema:
root
|-- time: string (nullable = true)
|-- itemId: string (nullable = true)
|-- itemFeatures: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to explode the itemFeatures column and then send my dataframe to a UDF. But as soon as I include the explode, calling the UDF results in this error:
org.apache.spark.SparkException: Task not serializable
I can't figure out why???
Environment: Scala 2.11.12, Spark 2.4.4
Full example:
val dataList = List(
("time1", "id1", "map1"),
("time2", "id2", "map2"))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode("itemFeatures"))
val doNextThingUDF: UserDefinedFunction = udf(doNextThing _)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time"))
where my UDF looks like this:
val doNextThing(time: String): String = {
time+"blah"
}
If I remove the explode, everything works fine, or if I don't call the UDF after the explode, everything works fine. I could imagine Spark is somehow unable to send each row to a UDF if it is dynamically executing the explode and doesn't know how many rows that are going to exist, but even when I add ex dfExploded.cache() and dfExploded.count() I still get the error. Is this a known issue? What am I missing?
I think the issue come from how you define your donextThing function. Also
there is couple of typos in your "full example".
Especially the itemFeatures column is a string in your example, I understand it should be a Map.
But here is a working example:
val dataList = List(
("time1", "id1", Map("map1" -> 1)),
("time2", "id2", Map("map2" -> 2)))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode($"itemFeatures"))
val doNextThing = (time: String) => {time+"blah"}
val doNextThingUDF = udf(doNextThing)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time")))

Processing Array[Byte] in spark dataframes

I have a dataframe df1 as below with schema:
scala> df1.printSchema
root
|-- filecontent: binary (nullable = true)
|-- filename: string (nullable = true)
The DF has filename and its content. The content is GZIPped. I could use something like the below to unzip the data in filecontent and save it to HDFS.
def decompressor(origRow: Row) = {
val filename = origRow.getString(1)
val filecontent = serialise(origRow.getString(0))
val unzippedData = new GZIPInputStream(new ByteArrayInputStream(filecontent))
val hadoop_fs = FileSystem.get(sc.hadoopConfiguration)
val filenamePath = new Path(filename)
val fos = hadoop_fs.create(filenamePath)
org.apache.hadoop.io.IOUtils.copyBytes(unzippedData, fos, sc.hadoopConfiguration)
fos.close()
}
My objective:
Since the filecontent column data in the df1 is a binary i.e Array[byte] i shouldnt distribute the data and have it together and pass it to the function so that it could decompress and save it to a file.
My Question:
How do I not distribute the data (column data)?
How do I make sure the processing happens for 1 row at a time?

Unable to find encoder for type stored in a Dataset. in spark structured streaming

I am trying example of spark structured streaming given on the spark website but it is throwing error
1. Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
2. not enough arguments for method as: (implicit evidence$2: org.apache.spark.sql.Encoder[data])org.apache.spark.sql.Dataset[data].
Unspecified value parameter evidence$2.
val ds: Dataset[data] = df.as[data]
Here is my code
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
object final_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
case class data(name: String, id: String)
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.load()
println(df.isStreaming)
val ds: Dataset[data] = df.as[data]
val value = ds.select("name").where("id > 10")
value.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
any help on how to make this work.?
I want final output like this
i want output like this
+-----+--------+
| name| id
+-----+--------+
|Jacek| 1
+-----+--------+
The reason for the error is that you are dealing with Array[Byte] as coming from Kafka and there are no fields to match data case class.
scala> println(schema.treeString)
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Change the line df.as[data] to the following:
df.
select($"value" cast "string").
map(value => ...parse the value to get name and id here...).
as[data]
I strongly recommend using select and functions object to deal with the incoming data.
The error is due to mismatch of number of column in dataframe and your case class.
You have [topic, timestamp, value, key, offset, timestampType, partition] columns in dataframe
Whereas your case class is only with two columns
case class data(name: String, id: String)
You can display the content of dataframe as
val display = df.writeStream.format("console").start()
Sleep for some seconds and then
display.stop()
And also use option("startingOffsets", "earliest") as mentioned here
Then create a case class as per your data.
Hope this helps!

Scala and Spark UDF function

I made a simple UDF to convert or extract some values from a time field in a temptabl in spark. I register the function but when I call the function using sql it throws a NullPointerException. Below is my function and process of executing it. I am using Zeppelin. Strangly this was working yesterday but it stopped working this morning.
Function
def convert( time:String ) : String = {
val sdf = new java.text.SimpleDateFormat("HH:mm")
val time1 = sdf.parse(time)
return sdf.format(time1)
}
Register the Function
sqlContext.udf.register("convert",convert _)
Test the function without SQL -- This works
convert(12:12:12) -> returns 12:12
Test the function with SQL in Zeppelin this FAILS.
%sql
select convert(time) from temptable limit 10
Structure of temptable
root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- serverip: string (nullable = true)
|-- request: string (nullable = true)
|-- resource: string (nullable = true)
|-- protocol: integer (nullable = true)
|-- sourceip: string (nullable = true)
Part of the stacktrace that I am getting.
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:643)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
Use udf instead of define a function directly
import org.apache.spark.sql.functions._
val convert = udf[String, String](time => {
val sdf = new java.text.SimpleDateFormat("HH:mm")
val time1 = sdf.parse(time)
sdf.format(time1)
}
)
A udf's input parameter is Column(or Columns). And the return type is Column.
case class UserDefinedFunction protected[sql] (
f: AnyRef,
dataType: DataType,
inputTypes: Option[Seq[DataType]]) {
def apply(exprs: Column*): Column = {
Column(ScalaUDF(f, dataType, exprs.map(_.expr), inputTypes.getOrElse(Nil)))
}
}
You have to define your function as a UDF.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
val convertUDF: UserDefinedFunction = udf((time:String) => {
val sdf = new java.text.SimpleDateFormat("HH:mm")
val time1 = sdf.parse(time)
sdf.format(time1)
})
Next you would apply your UDF on your DataFrame.
// assuming your DataFrame is already defined
dataFrame.withColumn("time", convertUDF(col("time"))) // using the same name replaces existing
Now, as to your actual problem, one reason you are receiving this error could be because your DataFrame contains rows which are nulls. If you filter them out before you apply the UDF, you should be able to continue no problem.
dataFrame.filter(col("time").isNotNull)
I'm curious what else causes a NullPointerException when running a UDF other than it encountering a null, if you found a reason different than my suggestion, I'd be glad to know.

How to apply a function to a column of a Spark DataFrame?

Let's assume that we have a Spark DataFrame
df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame
with the following schema
df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
| |-- element: string (containsNull = true)
Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?
You don't have to write a custom function because there is one:
import org.apache.spark.sql.functions.size
df.select(size($"tk"))
If you really want you can write an udf:
import org.apache.spark.sql.functions.udf
val size_ = udf((xs: Seq[String]) => xs.size)
or even create custom a expression but there is really no point in that.
One way is to access them using the sql like below.
df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")
df2.show()
To get size of array column,
val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()
If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.
I would also try for some thing that traverses.