spark Join performance on EMR cluster - scala

We have 3 Node spark EMR cluster(m3Xlarge). We are trying to Join some big tables have size 4GB (250+ columns) and several small referencetable (15) have 2-3 columns each. Since we are using spark dynamicallocation that is default enabled in EMR.
So while writing to HDFS it is taking 1+ hour to save results(This is because we are using coalesce(1) on final DataFrame).
Even we tried to use broadcast joins but no luck yet. how can we improve the Performance for the above?
What will be the optimize final execution time for the above Process?
What can be the Possible ways that can improve Performance?
Any help will be appreciated !
Here is My join function
def multiJoins(MasterTablesDF: DataFrame, tmpReferenceTablesDF_List: MutableList[DataFrame], tmpReferenceTableJoinDetailsList: MutableList[Array[String]], DrivingTable: String): DataFrame = {
// Define final output of Driving Table
var final_df: DataFrame = null
if (MasterTablesDF != null) {
if (!MasterTablesDF.head(1).isEmpty && tmpReferenceTablesDF_List.length >= 1) {
for (i <- 0 until tmpReferenceTablesDF_List.length) {
val eachReferenceTableDF = tmpReferenceTablesDF_List(i)
var eachJoinDetails = tmpReferenceTableJoinDetailsList(i)
//for first ref table Join
if (i == 0) {
println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
if (eachJoinDetails(0).equals(eachJoinDetails(1))) {
println("############## Driving table and Ref table Joining columns are same joining first Drive table ==>" + DrivingTable + "With Ref table ==>" + eachJoinDetails(3))
//if reftable and Driving table have same join columns using seq() to remove duplicate columns after Joins
final_df = MasterTablesDF.join(broadcast(eachReferenceTableDF), Seq(eachJoinDetails(0)), eachJoinDetails(2)) //.select(ReqCols.head, ReqCols.tail: _*)
} else {
//if the joining column names of the driving and ref tables are not same then
//using driving table join col and reftable join cols
println("############### Driving table and Ref table joining columns are not same joining first Drive table ==>" + DrivingTable + "With Ref table ==>" + eachJoinDetails(3) + "\n")
final_df = MasterTablesDF.join(broadcast(eachReferenceTableDF), MasterTablesDF(eachJoinDetails(0)) === eachReferenceTableDF(eachJoinDetails(1)), eachJoinDetails(2))
}
} //Joining Next reference table dataframes with final DF
else {
if (eachJoinDetails(0).equals(eachJoinDetails(1))) {
println("###### drive table and another ref table join cols are same joining driving table ==>" + DrivingTable + "With RefTable" + eachJoinDetails(3))
final_df = final_df.join(broadcast(eachReferenceTableDF), Seq(eachJoinDetails(0)), eachJoinDetails(2)) //.select(ReqCols.head, ReqCols.tail: _*)
// final_df.unpersist()
} else {
println("###### drive table and another ref table join cols are not same joining driving table ==>" + DrivingTable + "With RefTable" + eachJoinDetails(3) + "\n")
final_df = final_df.join(broadcast(eachReferenceTableDF), MasterTablesDF(eachJoinDetails(0)) === eachReferenceTableDF(eachJoinDetails(1)), eachJoinDetails(2))
}
}
}
}
}
return final_df
//Writing is too slow
//final_df.coalesce(1).write.format("com.databricks.spark.csv").option("delimiter", "|").option("header", "true")
.csv(hdfsPath)
}
Is this fine ? is this due to looping ?

Probably Spark can't optimize your very long execution plan as good as possible. I had same situation and we did a series of optimizations:
1) remove all unneccessary columns and filter as soon as possible
2) "Materialize" some tables before joining, it will help Spark to break lineage and somehow optimize your flow(in our sample 2 sortJoins were replaced by broadcast joins, because Spark realized that dataframes is very small)
3) We partitioned all datasets by same key and number of partitions(right after read).
and some other optimizations.. It reduced job time from 45 mins to 4. You need closely look at Spark UI, there we found a lot of helpful insights to optimize(one of our executor worled instead of 10 because of all data was partitioned in one part..) etc.. Good luck!

Related

Pyspark - iterate on a big dataframe

I'm using the following code
events_df = []
for i in df.collect():
v = generate_event(i)
events_df.append(v)
events_df = spark.createDataFrame(events_df, schema)
to go over each dataframe item and add an event header calculated in the generate_event function
def generate_event(delta_row):
header = {
"id": 1,
...
}
row = Row(Data=delta_row)
return EntityEvent(header, row)
class EntityEvent:
def __init__(self, _header, _payload):
self.header = _header
self.payload = _payload
It works fine locally for df with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail
Note: with rdd seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)
is there a way to chunk the dataframe and consolidate at the end ?
The best solution that we can preview is to use spark promise features, like adding new columns using struct and create_map functions...
events_df = (
df
.withColumn(
"header",
f.create_map(
f.lit("id"),
f.lit(1)
)
)
...
So we can create columns as much as we need and make transformations to get the required header structure
PS: this solution (add new columns to the dataframe rather than iterate on it) avoid using rdd and brings a big advantage in terms of performance !

Iterate Through Rows of a Dataframe

Since I am a bit new to Spark Scala, I am finding it difficult to iterate through a Dataframe.
My dataframe contains 2 columns, one is path and other is ingestiontime.
Example -
Now I want to iterate through this dataframe and do the use the data in the Path and ingestiontime column to prepare a Hive Query and run it , such that query that are run look like -
ALTER TABLE <hiveTableName> ADD PARTITON (ingestiontime=<Ingestiontime_From_the_DataFrame_ingestiontime_column>) LOCATION (<Path_From_the_dataFrames_path_column>)
To achieve this, I used -
allOtherIngestionTime.collect().foreach {
row =>
var prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = "+row.mkString("<SomeCustomDelimiter>").split("<SomeCustomDelimiter>")(1)+" LOCATION ( " + row.mkString("<SomeCustomDelimiter>").split("<SomeCustomDelimiter>")(0) + ")"
spark.sql(prepareHiveQuery)
}
But I feel this can be very dangerous, i.e when my Data consists of a similar Delimiter. I am very much interested to find out other ways of iterating through rows/columns of a Dataframe.
Check below code.
df
.withColumn("query",concat_ws("",lit("ALTER TABLE myhiveTable ADD PARTITON (ingestiontime="),col("ingestiontime"),lit(") LOCATION (\""),col("path"),lit("\"))")))
.select("query")
.as[String]
.collect
.foreach(q => spark.sql(q))
In order to access your columns path and ingestiontime you can you row.getString(0) and row.getString(1).
DataFrames
val allOtherIngestionTime: DataFrame = ???
allOtherIngestionTime.foreach {
row =>
val prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = "+row.getString(1)+" LOCATION ( " + row.getString(0) + ")"
spark.sql(prepareHiveQuery)
}
Datasets
If you use Datasets instead of Dataframes you will be able to use row.path and row.ingestiontime in an easier way.
case class myCaseClass(path: String, ingestionTime: String)
val ds: Dataset[myCaseClass] = ???
ds.foreach({ row =>
val prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = " + row.ingestionTime + " LOCATION ( " + row.path + ")"
spark.sql(prepareHiveQuery)
})
In any case, to iterate over a Dataframe or a Dataset you can use foreach , or map if you want to convert the content into something else.
Also, using collect() you are bringing all the data to the driver and that is not recommended, you could use foreach or map without collect()
If what you want is to iterate over the row fields, you can make it a Seq and iterate:
row.toSeq.foreach{column => ...}

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Spark - move data from tables to new table with extra column

So we have a Cassandra project and it requires us to migrate a large number of tables from 3 separate tables into one.
e.g. table_d_abc, table_m_abc, table_w_abc to table_t_abc
Essentially data needs to be moved to this new table with an extra column with a value that was in the table's name.
There are 100's of tables like this - so you could imagine the huge job it would be to 'hand-make' a migration script. And naturally I thought SPARK should be able to do the job.
e.g.:
var tables = List("table_*_abc", "table_*_def") // etc
var periods = List('d','w','m')
for (table <- tables) {
for (period <- periods) {
var rTable = table.replace('*', period)
var nTable = table.replace('*', 't')
try {
var t = sc.cassandraTable("data", rTable)
var fr = t.first
var columns = fr.toMap.keys.toArray :+ "period"
var data = t.map(_.iterator.toArray :+ period)
// This line does not work as data is a RDD of Array[Any] and not RDD of tuple[...]
// How to ???
data.saveToCassandra("data", nTable, SomeColumns(columns.map(ColumnName(_)):_*))
} //catch {}
}
}
versus:
var periods = List('d','w','m')
for (period <- periods) {
sc.cassandraTable("data","table_" + period + "_abc")
.map(v => (v.getString("a"), v.getInt("b"), v.getInt("c"), period))
.saveToCassandra("data", "table_t_abc", SomeColumns("a","b","c","period"))
// ... 100s of other scripts like this
}
Is what I'm trying to do possible?
Is there a way to programatically save an extra column from an source with unknown number of columns and datatypes?
The issue here is the RDD objects must be of a type which has a "RowWriter" defined. This maps the data in the object to C* insertable buffers.
RDD World
Using "CassandraRow" objects this is possible. These objects allow for generic contents and can be constructed on the file. They are also the default output so making a new one from an old one should be relatively cheap.
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala
You would make a single RowMetadata (basically schema info) for each table with the additional column, then populate the row with the values of the input row + the new period variable.
Dataframe World
If you wanted to switch to Dataframes this would be easier as you could just use the DataFrame add column before saving.
cassandraDF.withColumn("period",lit("Value based on first row"))