Spark performance issues when iteratively adding multiple columns - scala

I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry

Related

Unnecessary Extra forEachPartition causing extra time to complete the Job

I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment
You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.

Spark Dataframe join heap space issue and too many partitions

I am trying to join dataframes and in fact filtering in advance for simple testing. Each dataframe after filter has only 4 rows. There are 190 columns per dataframe.
I run this code in local and it runs super fast (albeit only 22 columns). Also, the partition size when I check in local is just 1. The join is pretty simple on 2 key columns and I made sure that there is no Cartesian product.
When I run this in my Dev/Uat cluster, it is taking forever and failing in between. Also, I see that the paritions created are around 40,000 per join. I am printing it using resultDf.rdd.partitions.size.
I've divided the join like this and it didn't help.
var joinCols = Seq("subjectid","componenttype")
val df1 = mainDf1.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df1attr").withColumnRenamed("value","df1val")
val df2 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df2attr").withColumnRenamed("value","df2val")
val df3 = mainDf3.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df3attr").withColumnRenamed("value","df3val")
val df4 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df4attr").withColumnRenamed("value","df4val")
var resultDf = df1.as("dft").join(df2,joinCols,"inner").select("dft.*","df2attr","df2val")
//check partition size here and show the dataframe to make sure we are getting 4 rows only as expected. I am getting 4 rows but 40,000 partitions and takes lot of time here itself.
resultDf = resultDf.as("dfi").join(df3,joinCols,"inner").select("dfi.*","df3attr","df3val")
//Mostly by here my program comes out either with heap space error or with exception exitCode=56
resultDf = resultDf.as("dfa").join(df4,joinCols,"inner").select("dfa.*","df4attr","df4val")
The naming convention used is all dummy to put the code here. So, please don't mind with that.
Any inputs/help to put me in right direction?

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

writing 2 data frames in parallel in scala

For example
I am doing a lot of calculations
and I am finally down to
3 dataframes.
for example:
val mainQ = spark.sql("select * from employee")
mainQ.createOrReplaceTempView("mainQ")
val mainQ1 = spark.sql("select state,count(1) from mainQ group by state")
val mainQ2 = spark.sql("select dept_id,sum(salary) from mainQ group by dept_id")
val mainQ3 = spark.sql("select dept_id,state , sum(salary) from mainQ group by dept_id,state")
//Basically I want to write below writes in parallel. I could put into
//Different files. But that is not what I am looking at. Once all computation is done. I want to write the data in parallel.
mainQ1.write.mode("overwrite").save("/user/h/mainQ1.txt")
mainQ2.write.mode("overwrite").save("/user/h/mainQ2.txt")
mainQ3.write.mode("overwrite").save("/user/h/mainQ3.txt")
Normally there is no benefit using multi-threading in the driver code, but sometimes it can increase performance. I had some situations where launching parallel spark jobs increased performance drastically, namely when the individual jobs do not utilize the cluster resources well (e.g. due to data skew, too few partitions etc). In your case you can do:
ParSeq(
(mainQ1,"/user/h/mainQ1.txt"),
(mainQ2,"/user/h/mainQ2.txt"),
(mainQ3,"/user/h/mainQ3.txt")
).foreach{case (df,filename) =>
df.write.mode("overwrite").save(filename)
}

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.