Joing large RDDs in scala spark - scala

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple join, which takes the value of second column and add it to the final output along with the large data processed output. But this job is running longer for more than 1 day. How do I optimize it? I tried to refer some solutions like Refer. But the solution are for spark dataframe. How do I optimize it for RDD?
Large dataset
Medium size data
Code that I have have.
rdd1.join(rdd2).map(a => a.x)
val result = input
.map(x => {
val row = x.split(",")
(row(0), row(2))
}).filter(x=> x._1 != null !x._1.isEmpty)


Spark Dataframe changing through execution

I'm fairly new to Spark, so most likely I'm having a huge gap in my understanding. Apologies in advance if what you see here is very silly. So, what I'm trying to achieve is:
Get a set of rows from a table in Hive (let's call it T_A) and save them in a DataFrame (let's call it DF_A). This is done.
Get extra information from another Hive table (T_B) and join it with DF_A to get a new Dataframe (DF_B). And then cache it. This is also done.
val DF_A = sparkSession.sql("select * from T_A where whatever=something").toDF()
val extraData = sparkSession.sql("select * from T_B where whatever=something").toDF()
val DF_B = DF_A.join(extraData,
col(something_else=other_thing), "left"
Now this is me assuming Spark + Hive works similarly than regular java app + SQL, which is where I might need a hard course correction.
Here, I attempt to store in one of the Hive Tables I used before (T_B), partitioned by column X, whatever N rows I transformed (Tr1(DF_B)) from DF_B. I use:
val DF_C = => {
After saving it to this table, I want to reuse the information from DF_B (not the transformed data reinserted in T_B, but the joined data in DF_B based on previous state of T_B) to make a second transformation over it (Tr2(DF_B)).
I want to update the same N rows written in T_B with the data transformed by previous step, using an "INSERT OVERWRITE" operation and the same partition column X.
val DF_D = => {
What I expect:
T_B having N rows.
Having DF_B unchanged, with N rows.
What it's happening:
DF_B having 3*N rows.
T_C having 3*N rows.
Now, after some debugging, I found that DF_B has 3N rows after DF_C write finishes. So DF_B will have 3N rows too and that will cause T_B to have 3*N rows as well.
So, my question is... Is there a way to retain the original DF_B data and use it for the second map transformation, since it relies on the original DF_B state for the transformation process? Is there a reference somewhere I can read to know why this happens?
EDIT: I don't know if this is useful information, but I log the count of records before and after doing the first write. And I get the following
val DF_C = => {
}).toDF()"DF_C.count {} - DF_B.count {}"...
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")"DF_C.count {} - DF_B.count {}"...
With persist(MEMORY_AND_DISK) or no persist at all, instead of cache and 3 test rows. I get:
DF_C.count 3 - DF_B.count 3
DF_C.count 3 - DF_B.count 9
With cache, I get:
DF_C.count 3 - DF_B.count 3
DF_C.count 9 - DF_B.count 9
Any idea?
Thank you so much.
In Spark, execution happens in lazy way, only when action gets called.
So when you calling some action two times on same dataframe (in your case DF_B) that dataframe(DB_B) will be created and transformed two times from starting at time of execution.
So try to persist your dataframe DF_B before calling first action, then you can use same DF for both Tr1 and Tr2.
After persist dataframe will be stored in memory/disk and can be reused multiple times.
You can learn more about persistance here

Why clustering seems doesn't work in spark cogroup function

I have two hive clustered tables t1 and t2
`t1_req_id` string,
PARTITIONED BY (`t1_stats_date` string)
// t2 looks similar with same amount of buckets
the code looks like as following:
val t1 = spark.table("t1").as[T1] => (v.t1_req_id, v))
val t2= spark.table("t2").as[T2] => (v.t2_req_id, v))
val outRdd = t1.cogroup(t2)
.flatMap { coGroupRes =>
val key = coGroupRes._1
val value: (Iterable[T1], Iterable[T2])= coGroupRes._2
val t3List = // create a list with some logic on Iterable[T1] and Iterable[T2]
I make sure that the both t1 and t2 table has same amount of partitions, and on spark-submit there are
spark.sql.sources.bucketing.enabled=true and spark.sessionState.conf.bucketingEnabled=true flags
But Spark DAG doesn't show any impact of clustering. It seems there is still data full shuffle
What am I missing, any other configurations, tunings? How can it be assured that there is no full data shuffle?
My spark version is 2.3.1
And it shouldn't show.
Any logical optimizations are limited to DataFrame API. Once you push data to black-box functional dataset API (see Spark 2.0 Dataset vs DataFrame) and later to RDD API, no more information is pushed back to the optimizer.
You could partially utilize bucketing by performing join first, getting something around these lines
.join(spark.table("t2"), $"t1.t1_req_id" === $"t2.t2_req_id", "outer")
.groupBy($"t1.v.t1_req_id", $"t2.t2_req_id")
.agg(...) // For example collect_set($"t1.v"), collect_set($"t2.v")
However, unlike cogroup, this will generate full Cartesian Products within groups, and might be not applicable in your case

How to get the row top 1 in Spark Structured Streaming?

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it:
detailing what i need. Please I will appreciate any help :)
TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.
Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction =
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
And in the body of your code add:
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!

Split data frame into smaller ones and push a big dataframe to all executors?

I'm implementing the following logic using Spark.
Get the result of a table with 50K rows.
Get another table (about 30K rows).
For all the combination between (1) and (2), do some work and get a value.
How about pushing the data frame of (2) to all executors and partition (1) and run each portion on each executor? How to implement it?
val getTable(t String) ="jdbc").options(Map(
"driver" -> "",
"url" -> jdbcSqlConn,
"dbtable" -> s"$t"
.select("col1", "col2", "col3")
val table1 = getTable("table1")
val table2 = getTable("table2")
// Split the rows in table1 and make N, say 32, data frames
val partitionedTable1 : List[DataSet[Row]] = splitToSmallerDFs(table1, 32) // How to implement it?
val result = => {
val value = doWork(x, table2) // Is it good to send table2 to executors like this?
How to break a big data frame into small data frames? (repartition?)
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
How to break a big data frame into small data frames? (repartition?)
Simple answer would be Yes repartion can be a solution.
The challanging question can be, Would repartitioning a dataframe to smaller partition improve the overall operation?
Dataframes are already distributed in nature. Meaning that the operation you perform on dataframes like join, groupBy, aggregations, functions and many more are all executed where the data is residing. But the operations such as join, groupBy, aggregations where shuffling is needed, repartition would be void as
groupBy operation would shuffle dataframe such that distinct groups would be in the same executor.
partitionBy in Window function performs the same way as groupBy
join operation would shuffle data in the same manner.
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
Its not good to pass the dataframes as you did. As you are passing dataframes inside transformation so the table2 would not be visible to the executors.
I would suggest you to use broadcast variable
you can do as below
val table2 = sparkContext.broadcast(getTable("table2"))
val result = => {
val value = doWork(x, table2.value)

Best way to gain performance when doing a join count using spark and scala

i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames ="filename").withColumn("df", lit("df1")).distinct
val df2Filenames ="filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)