Spark streaming with broadcast joins - scala

I have a spark streaming use case where I plan to keep a dataset broadcasted and cached on each executor. Every micro batch in streaming will create a dataframe out of the RDD and join the batch. My test code given below will perform the broadcast operation for each batch. Is there a way to broadcast it just once?
val testDF = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load("file:///shared/data/test-data.txt")
val lines = ssc.socketTextStream("DevNode", 9999)
lines.foreachRDD((rdd, timestamp) => {
val recordDF = rdd.map(_.split(",")).map(l => Record(l(0).toInt, l(1))).toDF()
val resultDF = recordDF.join(broadcast(testDF), "Age")
resultDF.write.format("com.databricks.spark.csv").save("file:///shared/data/output/streaming/"+timestamp)
}
For every batch this file was read and broadcast was performed.
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
Any suggestion on broadcast dataset only once?

It looks like for now broadcasted tables are not reused. See: SPARK-3863
Perform broadcasting outside foreachRDD loop:
val testDF = broadcast(sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load(...))
lines.foreachRDD((rdd, timestamp) => {
val recordDF = ???
val resultDF = recordDF.join(testDF, "Age")
resultDF.write.format("com.databricks.spark.csv").save(...)
}

Related

Improve performance of slow running spark streaming process that uses micro-batching

I am trying to create an application to process 10 million json files where the size of a json can vary from 1mb to 50mb.
To avoid burdening the driver I am using the structured streaming api to process 100,000 json files at a time rather than loading all the source files at once.
mySchema
val mySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("StartTime",DoubleType, true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Create Streaming Dataframe by picking 100,000 files at a time
val readDF = spark.readStream
.format("json")
.option("MaxFilesPerTrigger", 100000)
.option( "pathGlobFilter", "*.json")
.option( "recursiveFileLookup", "true")
.schema(mySchema)
.load("/mnt/source/2020/*")
writeStream to start streaming computation
val sensorFileWriter = binaryDF
.writeStream
.queryName( "myStream")
.format("delta")
.trigger(Trigger.ProcessingTime("30 seconds"))
.outputMode("append")
.option( "checkpointLocation", "/mnt/dir/checkpoint")
.foreachBatch(
(batchDF: DataFrame, batchId: Long) => {
batchDF.persist()
val parseDF = batchDF
.withColumn("New_Set", expr("transform(Data, x -> (x.field1 as f1, x.field2 as f2, x.field3 as field3))"))
.withColumn("Data_New",addCol(uuid(),to_json($"New_Set")))
.withColumn("data_size", size(col("Data")))
.withColumn("requestid", uuid())
.withColumn("start_epoch_double", bround($"StartTime").cast("long"))
.withColumn("Start_date", from_unixtime($"start_epoch_double", "YYYYMMdd"))
.withColumn("request", concat(lit("start"), col("Data_New"), lit("end")))
.persist()
val requestDF = parseDF
.select($"Start_date", $"request")
requestDF.write
.partitionBy("Start_date")
.mode("append")
.parquet("/mnt/Target/request")
}
)
In the above "addCol" is a user defined function that adds new StructField to Array of StructFields
val addCol = udf((id:String,json:String) => {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
import org.json4s.JsonDSL._
compact(parse(json).extract[List[Map[String,String]]].map(m => Map("requestid" -> id) ++ m))
})
"uuid" is another udf that generates a unique id
val uuid = udf(() => java.util.UUID.randomUUID().toString)
Databricks cluster Config:-
Apache Spark 2.4.5
70 Workers: 3920.0 GB Memory, 1120 Cores (i.e. 56.0 GB Memory and 16 Cores per Worker)
1 Driver: 128.0 GB Memory, 32 Cores
The below image is total tasks for writing each batch of 100,000 which takes more than an hour. The entire process takes days to complete processing the 10 million json files.
How can I make this streaming process run faster?
Should I be setting the property for "spark.sql.shuffle.partitions". If so what is a good value for this property?
In most cases, you want a 1-to-1 mapping of partitions to cores for streaming applications...Azure Databricks Structured Streaming.
To optimize mapping of your partitions to cores, try this:
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)
This will give you a 1-to-1 mapping of your partitions to cores.

Why dataframe cannot be accessed inside UDF ? [Apache Spark Scala] [duplicate]

This question already has answers here:
Why accesing DataFrame from UDF results in NullPointerException?
(2 answers)
Closed 3 years ago.
I am currently doing streaming project using Apache Spark. I have 2 data source, the first one I get news data from Kafka. This data is always updating every time. And the second one, I get masterWord dictionary. This variable contains dataframe of words and the unique key of words.
I want to process news data, then convert it from Seq of words become Seq of words_id by matching the data to masterWord dictionary. But, I have problems when accessing the masterWord dataframe in my UDF. When I am trying to access dataframe inside UDF, Spark return this error
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 i
n stage 4.0 (TID 4, localhost, executor driver): java.lang.NullPointerException
Why dataframe cannot be accessed inside UDF ?
What is the best practice to get value from another dataframe ?
This is my code
// read data stream from Kafka
val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", PropertiesLoader.kafkaBrokerUrl)
.option("subscribe", PropertiesLoader.kafkaTopic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", "100")
.load()
// Transform data stream to Dataframe
val kafkaDF = kafka.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", ColsArtifact.rawSchema).as("data"))
.select("data.*")
.withColumn("raw_text", concat(col("title"), lit(" "), col("text"))) // add column aggregate title and text
// read master word dictionary
val readConfig = ReadConfig(Map("uri" -> "mongodb://10.252.37.112/prayuga", "database" -> "prayuga", "collection" -> "master_word_2"))
var masterWord = MongoSpark.load(spark, readConfig)
// call UDF
val aggregateDF = kafkaDF.withColumn("text_aggregate", aggregateMongo(col("text_selected")))
// UDF
val aggregateMongo = udf((content: Seq[String]) => {
masterWord.show()
...
// code for query masterWord whether var content exist or not in masterWord dictionary
})
The dataframe lives in the spark context and it only available as such inside the driver
Each of the tasks sees the fraction (partition) of the data and can work with that. if you want to make the data in the dataframe available inside a udf you have to serialize it to the master and then you can broadcast it (or pass it as parameter, which will essentially do the same) to the udf, in which case Spark will send the whole thing to each instance of the udf running
If you want to use DataFrames inside UDFs, you must create a Broadcast :
import spark.implicits._
val df_name =Seq("Raphael").toDF("name")
val bc_df_name: Broadcast[DataFrame] = spark.sparkContext.broadcast(df_name)
// use df_name inside udf
val udf_doSomething = udf(() => bc_df_name.value.as[String].first())
Seq(1,2,3)
.toDF("i")
.withColumn("test",udf_doSomething())
.show()
gives
+---+-------+
| i| test|
+---+-------+
| 1|Raphael|
| 2|Raphael|
| 3|Raphael|
+---+-------+
This at least works in local mode, nut sure whether this also works on clusters. Anyway I would not recommend this approach, better convert (collect) the content of the dataframe in a scala datastructure on the driver (e.g. a Map) und broadcast this variable, or use a join instead.

How can I parallelize different SparkSQL execution efficiently?

Environment
Scala
Apache Spark: Spark 2.2.1
EMR on AWS: emr-5.12.1
Content
I have one large DataFrame, like below:
val df = spark.read.option("basePath", "s3://some_bucket/").json("s3://some_bucket/group_id=*/")
There are JSON files ~1TB at s3://some_bucket and it includes 5000 partitions of group_id.
I want to execute conversion using SparkSQL, and it differs by each group_id.
The Spark code is like below:
// Create view
val df = spark.read.option("basePath", "s3://data_lake/").json("s3://data_lake/group_id=*/")
df.createOrReplaceTempView("lakeView")
// one of queries like this:
// SELECT
// col1 as userId,
// col2 as userName,
// .....
// FROM
// lakeView
// WHERE
// group_id = xxx;
val queries: Seq[String] = getGroupIdMapping
// ** Want to know better ways **
queries.par.foreach(query => {
val convertedDF: DataFrame = spark.sql(query)
convertedDF.write.save("s3://another_bucket/")
})
The par can parallelize by Runtime.getRuntime.availableProcessors num, and it will be equal to the number of driver's cores.
But It seems weird and not efficient enough because it has nothing to do with Spark's parallization.
I really want to do with something like groupBy in scala.collection.Seq.
This is not right spark code:
df.groupBy(groupId).foreach((groupId, parDF) => {
parDF.createOrReplaceTempView("lakeView")
val convertedDF: DataFrame = spark.sql(queryByGroupId)
convertedDF.write.save("s3://another_bucket")
})
1) First of all if your data is already stored in files per group id there is no reason to mix it up and then group by id using Spark.
It's much more simple and efficient to load for each group id only relevant files
2) Spark itself parallelizes the computation. So in most cases there is no need for external parallelization.
But if you feel that Spark doesn't utilize all resources you can:
a) if each individual computation takes less than few seconds then task schedulling overhead is comparable to task execution time so it's possible to get a boost by running few tasks in parallel.
b) computation takes significant amount of time but resources are still underutilized. Then most probably you should increase the number of partitions for your dataset.
3) If you finally decided to run several tasks in parallel it can be achieved this way:
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
It is possible.
In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.
Reading can be done with textFile function from SparkContext and then .map to eliminate ()
So:
Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])
I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized.
However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,
val conf = {
new SparkConf()
.setAppName("Spark-HDFS-Read-Write")
}
val sqlContext = new SQLContext(sc)
val sc = new SparkContext(conf)
val hdfs = "hdfs:///"
val df = Seq((1, "Name1")).toDF("id", "name")
// Writing file in CSV format
df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")
// Writing file in PARQUET format
df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")
// Reading CSV files from HDFS
val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")
// Reading PQRQUET files from HDFS
val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")

Create a large RDD from RDD's in DStream

Does anyone know a way to create a large RDD from a sequence of RDD's in a DStream for a specific batch interval:
For example in the code below:
def createLargeRDD() {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
val DStream = KafkaUtilHelper.RetrieveDStream(ssc)
DStream.transform { rdd =>
/* Form an RDD with all of the RDD's that were put into the
DStream variable above for the 3 seconds batch interval */
rdd
}
}
So RDD's are being added to that DStream variable every 3 seconds. Is there a way I can aggregate all of those RDD's that are in the DStream for that 3 second time period into one large RDD and save that RDD to HBase or some external source.