Skewed Window Function & Hive Source Partitions? - scala

The data I am reading via Spark is highly skewed Hive Table with the following stats.
(MIN, 25TH, MEDIAN, 75TH, MAX) via Spark UI:
1506.0 B / 0 232.4 KB / 27288 247.3 KB / 29025 371.0 KB / 42669 269.0 MB / 27197137
I believe it is causing problems downstream in the job when I perform some Window Funcs, and Pivots.
I tried exploring this parameter to limit the partition size however nothing changed and the partitions are still skewed upon read.
spark.conf.set("spark.sql.files.maxPartitionBytes")
Also, when I cache this DF with the Hive table as source it takes a few min and even causes some GC in the Spark UI most likely because of the skew as well.
Does this spark.sql.files.maxPartitionBytes work on Hive tables or only files?
What is the best course of action for handling this skewed Hive source?
Would something like a stage barrier write to parquet or Salting be suitable for this problem?
I would like to avoid .repartition() on read as it adds another layer to an already data roller-coaster of a job.
Thank you
==================================================
After further research it appears the Window Function is causing skewed data too and this is where the Spark Job hangs.
I am performing some time series filling via double Window Function (forward then backward fill to impute all the null sensor readings) and am trying to follow this article to try a salt method to evenly distribute ... however the following code produces all null values so the salt method is not working.
Not sure why I am getting skews after Window since each measure item I am partitioning by has roughly the same amount of records after checking via .groupBy() ... thus why would salt be needed?
+--------------------+-------+
| measure | count|
+--------------------+-------+
| v1 |5030265|
| v2 |5009780|
| v3 |5030526|
| v4 |5030504|
...
salt post => https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18
nSaltBins = 300 # based off number of "measure" values
df_fill = df_fill.withColumn("salt", (F.rand() * nSaltBins).cast("int"))
# FILLS [FORWARD + BACKWARD]
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
# FORWARD FILLING IMPUTER
ffill_imputer = F.last(df_fill['new_value'], ignorenulls=True)\
.over(window)
fill_measure_DF = df_fill.withColumn('value_impute_temp', ffill_imputer)\
.drop("value", "new_value")
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(0,Window.unboundedFollowing)
# BACKWARD FILLING IMPUTER
bfill_imputer = F.first(df_fill['value_impute_temp'], ignorenulls=True)\
.over(window)
df_fill = df_fill.withColumn('value_impute_final', bfill_imputer)\
.drop("value_impute_temp")

Salting might be helpful in the case where a single partition is big enough to not fit in memory on a single executor. This might happen even if all the keys are equally distributed as well (as in your case).
You have to include the salt column in your partitionBy clause which you are using to create the Window.
window = Window.partitionBy('measure', 'salt')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Again you have to create another window which will operate on the intermediate result
window1 = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)

Hive based solution :
You can enable Skew join optimization using hive configuration. Applicable settings are:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
See databricks tips for this :
skew hints may work in this case

Related

Spark Scala - Comparing Datasets Column by Column

I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.

Performing rolling average on streaming data from Kafka using PySpark and without using window

I have been trying to perform data aggregation on streaming data, getting the following error:
Window approach has this issue - 'Non-time-based windows are not supported on streaming DataFrames/Datasets'
I am looking for an alternative method to the window approach to perform aggregation on streaming data.
w = (Window
.partitionBy("orig_time")
.orderBy(F.col("epoch").cast('long'))
.rangeBetween(-minutes(5), 0))
#windowedDeviceDF = deviceDF.withColumn('rolling_average', F.avg("tag_value").over(w))
windowSpec5 = Window.partitionBy("orig_time").orderBy(F.col("epoch").cast('long')).rangeBetween(-minutes(5),0)
windowSpec10 = Window.partitionBy("orig_time").orderBy(F.col("epoch").cast('long')).rangeBetween(-minutes(10), 0)
windowedDeviceDF = deviceDF.withColumn("avg5", F.avg("tag_value").over(windowSpec5)).withColumn("avg10",F.avg("tag_value").over(windowSpec10)).withColumn('occurrences_in_5_min', F.count('epoch').over(w)).withColumn('rolling_average', F.avg("tag_value").over(w)).select(
"tag_name", "epoch", "avg5", "avg10", "occurrences_in_5_min", "rolling_average")
windowedDeviceDF = deviceDF.groupBy(deviceDF.tag_name, deviceDF.tag_value, window(deviceDF.orig_time, windowDuration, slideDuration)).avg()
Not the same as a sliding window, but it avoids keeping the data somehow...
Use an "exponential moving average":
avg += fact * (xn - avg)
Where
avg is the current average; this is the only variable that need be held from one row to the next. (As opposed to the last N values)
fact is a constant fraction that controls the smoothness of the averaging -- 0.01 is very slow to respond to changes; 0.5 responds quite rapidly.
xn us the value (in the current row) being averaged.

Spark union of dataframes does not give counts?

I am trying to union these dataframes ,i used G_ID is not Null or MCOM.T_ID is not null and used trim, the count does not come up ,its running since 1hr. there are only 3 tasks remaining out of 300 tasks.Please suggest how can i debug this ? is null causing issue how can i debug ?
val table1 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.CIDS WHERE
_UPDT_TM >= '2020-02-01 15:14:39.527' """)
val table2 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.MIDS MCOM INNER
JOIN ab.VD_MBR VDBR
ON Trim(MCOM.T_ID) = Trim(VDBR.T_ID) AND Trim(MCOM.G_ID) = Trim(VDBR.G_ID)
AND Trim(MCOM.C123M_CD) IN ('BBB', 'AAA') WHERE MCOM._UPDT_TM >= '2020-02-01 15:14:39.527'
AND Trim(VDBR.BB_CD) IN ('BBC') """)
var abc=table1.select("PC_ID").union(table2.select("PC_ID"))
even tried this --> filtered = abc.filter(row => !row.anyNull);
It looks like you have a data skew problem. Looking at the "Summary Metrics" it's clear that (at least) three quarters of your partitions are empty, so you are eliminating most of the potential parallelization that spark can provide for you.
Though it will cause a shuffle step (where data gets moved over the network between different executors), a .repartition() will help to balance the data across all of the partitions and create more valid units of work to be spread among the available cores. This would most likely provide a speedup of your count().
As a rule of thumb, you'd likely want to call .repartition() with the parameter set to at least the number of cores in your cluster. Setting it higher will result in tasks getting completed more quickly (it's fun to watch the progress), though adds some management overhead to the overall time the job will take to run. If the tasks are too small (i.e. not enough data per partition), then sometime the scheduler gets confused and won't use the entire cluster either. On the whole, finding the right number of partitions is a balancing act.
You have added alias to the column "C_ID" as "PC_ID". and after that you are looking for "C_ID".
And Union can be performed on same number of columns, your table1 and table2 has different in columns size.
otherwise you will get: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns
Please take care of these two scenario first.

Is spark persist() (then action) really persisting?

I always understood that persist() and cache(), then action to activate the DAG, will calculate and keep the result in memory for later use. A lot of threads here will tell you to cache to enhance the performance of frequently used dataframe.
Recently I did a test and was confused because that does not seem to be the case.
temp_tab_name = "mytablename";
x = spark.sql("select * from " +temp_tab_name +" limit 10");
x = x.persist()
x.count() #action to activate all the above steps
x.show() #x should have been persisted in memory here, DAG evaluated, no going back to "select..." whenever referred to
x.is_cached #True
spark.sql("drop table "+ temp_tab_name);
x.is_cached #Still true!!
x.show() # Error, table not found here
So it seems to me that x is never calculated and persisted. The next reference to x still goes back to evaluate its DAG definition "select..." .Anything I missed here ?
cache and persist don't completely detach computation result from the source.
It just makes best-effort for avoiding recalculation. So, generally speaking, deleting source before you are done with the dataset is a bad idea.
What could go wrong in your particular case (from the top of my head):
1) show doesn't need all records of the table so maybe it triggers computation only for few partitions. So most of the partitions are still not calculated at this point.
2) spark needs some auxilliary information from the table (e.g. for partitioning)
The correct syntax is below ... here is some additional documentation for "uncaching" tables => https://spark.apache.org/docs/latest/sql-performance-tuning.html ... and you can confirm the examples below in the Spark UI under "storage" tab to see the objects being "cached" and "uncached"
# df method
df = spark.range(10)
df.cache() # cache
# df.persist() # acts same as cache
df.count() # action to materialize df object in ram
# df.foreach(lambda x: x) # another action to materialize df object in ram
df.unpersist() # remove df object from ram
# temp table method
df.createOrReplaceTempView("df_sql")
spark.catalog.cacheTable("df_sql") # cache
spark.sql("select * from df_sql").count() # action to materialize temp table in ram
spark.catalog.uncacheTable("df_sql") # remove temp table from ram

Reducing shuffle disk usage in Spark aggregations

I have a table in hive which is 100 GB of size. I am trying to group, count function and storing the result as hive table..my hard disk has 600 GB of space, by the time job reaches to 70% all of the disk space is being occupied.
so my job got fails...How can I minimize the shuffle data writes
hiveCtx.sql("select * from gm.final_orc")
.repartition(300)
.groupBy('col1, 'col2).count
.orderBy('count desc)
.write.saveAsTable("gm.result")
spark_memory
In cloud-based execution environments adding more disk is usually a very easy and cheap option. If your environment does not allow this and you've verified that your shuffles settings are reasonable, e.g., compression (on by default) is not changed, then there is only one solution: implement your own staged map-reduce using the fact that counts can be re-aggregated via sum.
Partition your data in any way that seems fit (by date, by directory, by number of files, etc.)
Perform the counting by col1 and col2 as separate Spark actions.
Re-group and re-aggregate.
Sort.
For simplicity, let's assume that col1 is an integer. Here is how I'd break up processing into 8 separate jobs, re-aggregating their output. If col1 is not an integer, you can hash it or you can use another column.
def splitTableName(i: Int) = s"tmp.gm.result.part-$i"
// Source data
val df = hiveCtx.sql("select col1, col2 from gm.final_orc")
// Number of splits
val splits = 8
// Materialize partial aggregations
val tables = for {
i <- 0 until splits
tableName = splitTableName(i)
// If col1 % splits will create very skewed data, hash it first, e.g.,
// hash(col1) % splits. hash() uses Murmur3.
_ = df.filter('col1 % splits === i)
// repartition only if you need to, e.g., massive partitions are causing OOM
// better to increase the number of splits and/or hash to un-skew skewed data
.groupBy('col1, 'col2).count
.write.saveAsTable(tableName)
} yield hiveCtx.table(tableName)
// Final aggregation
tables.reduce(_ union _)
.groupBy('col1, 'col2)
.agg(sum('count).as("count"))
.orderBy('count.desc)
.write.saveAsTable("gm.result")
// Cleanup temporary tables
(0 until splits).foreach { i =>
hiveCtx.sql(s"drop table ${splitTableName(i)}")
}
If col1 and col2 are so diverse and/or so large that the partial aggregation storage is causing disk space issues then you have to consider one of the following:
Smaller number of splits will generally use less disk space.
Sorting on col1 will help (because of Parquet run length encoding) but that would slow down execution.
How to create splits that are independent, e.g., find distinct values of col1, partition those into groups.
If you are extremely short on disk space you'd have to implement multi-step re-aggregation. The simplest approach is to generate the splits one at a time and keep a running aggregate. The execution would be much slower but it will use a lot less disk space.
Hope this helps!