Inconsistent saveAsTextFiles when using mapWithState in spark streaming - scala

To identify duplicate records i am using mapWithState with my row RDD. The code is supposed to check the state of each key it receives and return value and the state. Looking at the key i should be able to identify if i already processed the key(duplicate) or if it is a new key even across rdds in a dstream.
Here is the relevant part of my code.
object StreamingDeduplicate{
def main(args:Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("StreamingDeduplication")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Minutes(2))
ssc.checkpoint("streamingDedupCheckpoint")
val input = ssc.textFileStream("/test/spark/deduplicate")
val rowRdd = input.map(line => (line.split(",")(0), line))
val checkIfKeyExists = (key: String, value: Option[String], state: State[String]) => {
val currentState = state.getOption.getOrElse("new")
state.update("existing")
//value is the whole record
(value, currentState)
}
val checkDuplicates = rowRdd.mapWithState(StateSpec.function(checkIfKeyExists))
// STORE:1
checkDuplicates.saveAsTextFiles("/test/spark/out_all")
// STORE:2
rowRdd.saveAsTextFiles("/test/spark/out_temp")
ssc.start()
ssc.awaitTerminationOrTimeout(1000*60*10)
}}
The issue i am facing is when i run it and feed some files, in one run it picks up a few of them and the next time i clear the input files and clear the checkpoint and then feed it the same files it might pick fewer records or more records than the first time and it is not consistent.
Another finding is that when i COMMENT STORE1 and run i can see the same number of records in output(location2) as in input. this verifies that spark is able to pick up all the newly placed files but when i UNCOMMENT both i get the same behaviour as explained above.
The following would regenerate the issue in my environment
Delete both the output directories(/test/spark/out_duplicates and /test/spark/out_new)
Clear the checkpoint location(./streamingDedupCheckpoint)
Clear the streaming input folder(/test/spark/deduplicate)
Submit the spark job(spark-submit --master yarn-cluster --class "com.datametica.warehouse.deduplication.StreamingDeduplicate" Spark_DW-1.0-SNAPSHOT-jar-with-dependencies.jar)
Once the application state changes to running copy a bunch of new files of a total of few hundred MB to /test/spark/deduplicate(i made sure the copy completes within first two minutes)
Here are some 3 sample records from one of the the input data files
705,2320,Ron,64
111,2326,Sienna,29
918,3678,Morton,45

Related

What is the difference between saveAsObjectFile and persist in apache spark?

I am trying to compare java and kryo serialization and on saving the rdd on disk, it is giving the same size when saveAsObjectFile is used, but on persist it shows different in spark ui. Kryo one is smaller then java, but the irony is processing time of java is lesser then that of kryo which was not expected from spark UI?
val conf = new SparkConf()
.setAppName("kyroExample")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)
val sparkContext = new SparkContext(conf)
val personList: Array[Person] = (1 to 100000).map(value => Person(value + "", value)).toArray
val rddPerson: RDD[Person] = sparkContext.parallelize(personList)
val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)
case class Person(name: String, age: Int)
evenAgePerson.saveAsObjectFile("src/main/resources/objectFile")
evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)
evenAgePerson.count()
persist and saveAsObjectFile solve different needs.
persist has a misleading name. It is not supposed to be used to permanently persist the rdd result. Persist is used to temporarily persist a computation result of the rdd during a spark workfklow. The user has no control, over the location of the persisted dataframe. Persist is just caching with different caching strategy - memory, disk or both. In fact cache will just call persist with a default caching strategy.
e.g
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs again on the full dataframe of all the lines , repeats the above operation
errors.filter(col("line").like("%MySQL%")).count()
vs
val errors = df.filter(col("line").like("%ERROR%"))
errors.persist()
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs only on the errors tmp result containing only the filtered error lines
errors.filter(col("line").like("%MySQL%")).count()
saveAsObjectFile is for permanent persistance. It is used for serializing the final result of a spark job onto a persistent and usually distributed filessystem like hdfs or amazon s3
Spark persist and saveAsObjectFile is not the same at all.
persist - persist your RDD DAG to the requested StorageLevel, thats mean from now and on any transformation apply on this RDD will be calculated only from the persisted DAG.
saveAsObjectFile - just save the RDD into a SequenceFile of serialized object.
saveAsObjectFile doesn`t use "spark.serializer" configuration at all.
as you can see the below code:
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
saveAsObjectFile uses Utils.serialize to serialize your object, when serialize method defenition is:
/** Serialize an object using Java serialization */
def serialize[T](o: T): Array[Byte] = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(o)
oos.close()
bos.toByteArray
}
saveAsObjectFile use the Java Serialization always.
On the other hand, persist will use the configured spark.serializer you configure him.

Why does Spark Streaming execute foreachRDD even when no new data is available?

I have the following Spark streaming example:
val conf = new SparkConf().setAppName("Name").setMaster("local")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val directoryStream = ssc.textFileStream("""file:///C:/Users/something/something""")
directoryStream.foreachRDD(file => {
println(file.count())
})
ssc.start()
ssc.awaitTermination()
Even when the folder is empty it keeps printing 0 every 2 seconds like if an empty file was in the folder. I would like it to only go into the foreachRDD when a new file is present in the folder. Is there something I'm doing wrong?
I am using Spark 1.6 and Scala 2.10.7.
As your batch duration is 2 seconds the job will trigger for each and every 2 seconds, basically the triggering point is not the data availability it is batch duration, if the data present at the time the DStream contains the data otherwise it will be empty ( use below code to check for the same)
dstream.foreachRDD{ rdd => if (!rdd.isEmpty) {// do something } }

Spark Multiple Output Paths result in Multiple Input Reads

First, apologies for the title, I wasn't sure how to eloquently describe this succinctly.
I have a spark job that parses logs into JSON, and then using spark-sql converts specific columns into ORC and writes to various paths. For example:
val logs = sc.textFile("s3://raw/logs")
val jsonRows = logs.mapPartitions(partition => {
partition.map(log => {
logToJson.parse(log)
}
}
jsonRows.foreach(r => {
val contentPath = "s3://content/events/"
val userPath = "s3://users/events/"
val contentDf = sqlSession.read.schema(contentSchema).json(r)
val userDf = sqlSession.read.schema(userSchema).json(r)
val userDfFiltered = userDf.select("*").where(userDf("type").isin("users")
// Save Data
val contentWriter = contentDf.write.mode("append").format("orc")
eventWriter.save(contentPath)
val userWriter = userDf.write.mode("append").format("orc")
userWriter.save(userPath)
When I wrote this I expected that the parsing would occur one time, and then it would write to the respective locations afterward. However, it seems that it is executing all of the code in the file twice - once for content and once for users. Is this expected? I would prefer that I don't end up transferring the data from S3 and parsing twice, as that is the largest bottleneck. I am attaching an image from the Spark UI to show the duplication of tasks for a single Streaming Window. Thanks for any help you can provide!
Okay, this kind of nested DFs is a no go. DataFrames are meant to be a data structure for big datasets that won't fit into normal data structures (like Seq or List) and that needs to be processed in a distributed way. It is not just another kind of array. What you are attempting to do here is to create a DataFrame per log line, which makes little sense.
As far as I can tell from the (incomplete) code you have posted here, you want to create two new DataFrames from your original input (the logs) which you then want to store in two different locations. Something like this:
val logs = sc.textFile("s3://raw/logs")
val contentPath = "s3://content/events/"
val userPath = "s3://users/events/"
val jsonRows = logs
.mapPartitions(partition => {
partition.map(log => logToJson.parse(log))
}
.toDF()
.cache() // Or use persist() if dataset is larger than will fit in memory
jsonRows
.write
.format("orc")
.save(contentPath)
jsonRows
.filter(col("type").isin("users"))
.write
.format("orc")
.save(userPath)
Hope this helps.

Issue in inserting data to Hive Table using Spark and Scala

I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.