update date for a Dataframe and join with Kafka stream data live in spark - scala

I have a Kafka Stream source and a map table, which I want to join and then write the data to another Kafka topic. this job runs 24/7 without stop.
My issue is that the Map table that I wish to join is partitioned on date and every day I need the new updated Map table for the join.
But when the code runs it keeps using the same old map table, day after day without updating it.
import java.text.SimpleDateFormat
object joiningDF{
def newDate: String = {
val dFormat = new SimpleDateFormat("yyyy-MM-dd")
dateFormat.format(System.currentTimeMillis)
}
def main(args: Array[String]): Unit = {
var date=newDate
val source =spark.readStream.
format("kafka").
option("kafka.bootstrap.servers", "....").
option("subscribe", "....").
option("startingOffsets", "latest").
load()
// MAP TABLE date variable is used to get new date daily
var map=spark.read.parquet("path/day="+date)
val joinDF=source.join(map,Seq("id"),"left")
val outQ = joinDF.
writeStream.
outputMode("append").
format("kafka").
option("kafka.bootstrap.servers", "...").
option("topic", "...").
option("checkpointLocation", "...").
trigger(Trigger.ProcessingTime("300 seconds")).
start()
outQ.awaitTermination()
}
}
Is there a way to resolve it or a work around?

You could use a FileSystem source in continuous processing mode that is watching a directory into which you atomically move new versions of the file when they are ready. This will give you an updating stream to join with.

Related

Why is it required to use a new Spark Session after writing a streaming dataframe into an Iceberg table to show new changes?

If you use a spark session to create an Iceberg table with Spark scala in batch mode, and after that you do a writestream process with a merge into operation it's not possible to see new changes with spark session used in batch process.
You need to create a new one to see the actual state of the table. Why this happen?
write batch is done like this
df.writeTo(tableName).createOrReplace()
write streaming is done using these methods
override def write(df: DataFrame): StreamingQuery = {
implicit val spark: SparkSession = df.sparkSession
df.writeStream
.format("iceberg")
.trigger(Trigger.Once())
.option("fanout-enabled", "true")
.option("checkpointLocation", checkpointLocation)
.foreachBatch(mergeDFIntoTable _)
.outputMode("update")
.start()
}
private def mergeDFIntoTable(df: DataFrame, batchId: Long): Unit = {
df.createOrReplaceTempView("source_table")
val mergeSQL: String =
s"""
|MERGE INTO $TableIdentifier t
|USING source_table s
|ON s.identifier = t.identifier
|WHEN MATCHED THEN UPDATE SET *
|WHEN NOT MATCHED THEN INSERT *
|""".stripMargin
logger.info(s"BATCH ID: $batchId | SQL: $mergeSQL")
// The streaming query uses a cloned Spark Session from the original session that created it.
df.sparkSession.sql(mergeSQL)
df.unpersist()
}
If I use first spark session created and I read the table after batch process and again after streaming, show is the same (without adding or updating new rows)
spark.table(tableName).show()
However if after streaming process I create a new Spark Session the show process after streaming reflects the changes
val newSparkSession = spark.newSession()
newSparkSession.table(tableName).show()

Spark multi-thread write does not work properly in cluster-mode

I have several Hive tables to convert in Parquet format and write down on HDFS. Basically I load each one of these tables in a Spark DataFrame and write it back in Parquet format. In order to parallelize even more the writing phase (a single write of a DataFrame should be already parallelized since I am using 12 executors and 5 core per executor), I've tried to spawn several threads, each thread is used to write a subset of the tables.
val outputPath = "/user/xyz/testdb/tables"
val db = "testdb"
val partitionField = "nominal_time"
val partition = "20180729"
val numQueues = 6
for (i <- 0 until numQueues) {
val thread = new Thread {
override def run(): Unit = {
val tablesToProcess = queues(i)
tablesToProcess.foreach(t => {
val table = t.trim.toUpperCase
val tempTable = "tmp_" + table
val destTable = table
var dataFrame = sqc.read.table(s"$db.$tempTable")
// write
dataFrame.write.
mode("overwrite").
format("parquet").
save(s"$outputPath/$destTable/$partitionField=$partition")
println(s"\n\nWrite completed for table $table\n")
})
}
}
thread.start()
}
This code is working fine in YARN-CLIENT mode and I can observe a significant reduction of the time required for the process to complete.
The thing I don't uderstand is when I launch the same code in YARN-CLUSTER mode, the job completes very fast (too fast I have to say) but it does not write anything.
Am I missing something fundamental here that causes a multi-thread spark program to not work properly in cluster mode?

Issue in inserting data to Hive Table using Spark and Scala

I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.

Inconsistent saveAsTextFiles when using mapWithState in spark streaming

To identify duplicate records i am using mapWithState with my row RDD. The code is supposed to check the state of each key it receives and return value and the state. Looking at the key i should be able to identify if i already processed the key(duplicate) or if it is a new key even across rdds in a dstream.
Here is the relevant part of my code.
object StreamingDeduplicate{
def main(args:Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("StreamingDeduplication")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Minutes(2))
ssc.checkpoint("streamingDedupCheckpoint")
val input = ssc.textFileStream("/test/spark/deduplicate")
val rowRdd = input.map(line => (line.split(",")(0), line))
val checkIfKeyExists = (key: String, value: Option[String], state: State[String]) => {
val currentState = state.getOption.getOrElse("new")
state.update("existing")
//value is the whole record
(value, currentState)
}
val checkDuplicates = rowRdd.mapWithState(StateSpec.function(checkIfKeyExists))
// STORE:1
checkDuplicates.saveAsTextFiles("/test/spark/out_all")
// STORE:2
rowRdd.saveAsTextFiles("/test/spark/out_temp")
ssc.start()
ssc.awaitTerminationOrTimeout(1000*60*10)
}}
The issue i am facing is when i run it and feed some files, in one run it picks up a few of them and the next time i clear the input files and clear the checkpoint and then feed it the same files it might pick fewer records or more records than the first time and it is not consistent.
Another finding is that when i COMMENT STORE1 and run i can see the same number of records in output(location2) as in input. this verifies that spark is able to pick up all the newly placed files but when i UNCOMMENT both i get the same behaviour as explained above.
The following would regenerate the issue in my environment
Delete both the output directories(/test/spark/out_duplicates and /test/spark/out_new)
Clear the checkpoint location(./streamingDedupCheckpoint)
Clear the streaming input folder(/test/spark/deduplicate)
Submit the spark job(spark-submit --master yarn-cluster --class "com.datametica.warehouse.deduplication.StreamingDeduplicate" Spark_DW-1.0-SNAPSHOT-jar-with-dependencies.jar)
Once the application state changes to running copy a bunch of new files of a total of few hundred MB to /test/spark/deduplicate(i made sure the copy completes within first two minutes)
Here are some 3 sample records from one of the the input data files
705,2320,Ron,64
111,2326,Sienna,29
918,3678,Morton,45