Reducing shuffle disk usage in Spark aggregations - scala

I have a table in hive which is 100 GB of size. I am trying to group, count function and storing the result as hive table..my hard disk has 600 GB of space, by the time job reaches to 70% all of the disk space is being occupied.
so my job got fails...How can I minimize the shuffle data writes
hiveCtx.sql("select * from gm.final_orc")
.repartition(300)
.groupBy('col1, 'col2).count
.orderBy('count desc)
.write.saveAsTable("gm.result")
spark_memory

In cloud-based execution environments adding more disk is usually a very easy and cheap option. If your environment does not allow this and you've verified that your shuffles settings are reasonable, e.g., compression (on by default) is not changed, then there is only one solution: implement your own staged map-reduce using the fact that counts can be re-aggregated via sum.
Partition your data in any way that seems fit (by date, by directory, by number of files, etc.)
Perform the counting by col1 and col2 as separate Spark actions.
Re-group and re-aggregate.
Sort.
For simplicity, let's assume that col1 is an integer. Here is how I'd break up processing into 8 separate jobs, re-aggregating their output. If col1 is not an integer, you can hash it or you can use another column.
def splitTableName(i: Int) = s"tmp.gm.result.part-$i"
// Source data
val df = hiveCtx.sql("select col1, col2 from gm.final_orc")
// Number of splits
val splits = 8
// Materialize partial aggregations
val tables = for {
i <- 0 until splits
tableName = splitTableName(i)
// If col1 % splits will create very skewed data, hash it first, e.g.,
// hash(col1) % splits. hash() uses Murmur3.
_ = df.filter('col1 % splits === i)
// repartition only if you need to, e.g., massive partitions are causing OOM
// better to increase the number of splits and/or hash to un-skew skewed data
.groupBy('col1, 'col2).count
.write.saveAsTable(tableName)
} yield hiveCtx.table(tableName)
// Final aggregation
tables.reduce(_ union _)
.groupBy('col1, 'col2)
.agg(sum('count).as("count"))
.orderBy('count.desc)
.write.saveAsTable("gm.result")
// Cleanup temporary tables
(0 until splits).foreach { i =>
hiveCtx.sql(s"drop table ${splitTableName(i)}")
}
If col1 and col2 are so diverse and/or so large that the partial aggregation storage is causing disk space issues then you have to consider one of the following:
Smaller number of splits will generally use less disk space.
Sorting on col1 will help (because of Parquet run length encoding) but that would slow down execution.
How to create splits that are independent, e.g., find distinct values of col1, partition those into groups.
If you are extremely short on disk space you'd have to implement multi-step re-aggregation. The simplest approach is to generate the splits one at a time and keep a running aggregate. The execution would be much slower but it will use a lot less disk space.
Hope this helps!

Related

Spark performance issues when iteratively adding multiple columns

I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry

Spark union of dataframes does not give counts?

I am trying to union these dataframes ,i used G_ID is not Null or MCOM.T_ID is not null and used trim, the count does not come up ,its running since 1hr. there are only 3 tasks remaining out of 300 tasks.Please suggest how can i debug this ? is null causing issue how can i debug ?
val table1 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.CIDS WHERE
_UPDT_TM >= '2020-02-01 15:14:39.527' """)
val table2 = spark.sql(""" SELECT trim(C_ID) AS PC_ID FROM ab.MIDS MCOM INNER
JOIN ab.VD_MBR VDBR
ON Trim(MCOM.T_ID) = Trim(VDBR.T_ID) AND Trim(MCOM.G_ID) = Trim(VDBR.G_ID)
AND Trim(MCOM.C123M_CD) IN ('BBB', 'AAA') WHERE MCOM._UPDT_TM >= '2020-02-01 15:14:39.527'
AND Trim(VDBR.BB_CD) IN ('BBC') """)
var abc=table1.select("PC_ID").union(table2.select("PC_ID"))
even tried this --> filtered = abc.filter(row => !row.anyNull);
It looks like you have a data skew problem. Looking at the "Summary Metrics" it's clear that (at least) three quarters of your partitions are empty, so you are eliminating most of the potential parallelization that spark can provide for you.
Though it will cause a shuffle step (where data gets moved over the network between different executors), a .repartition() will help to balance the data across all of the partitions and create more valid units of work to be spread among the available cores. This would most likely provide a speedup of your count().
As a rule of thumb, you'd likely want to call .repartition() with the parameter set to at least the number of cores in your cluster. Setting it higher will result in tasks getting completed more quickly (it's fun to watch the progress), though adds some management overhead to the overall time the job will take to run. If the tasks are too small (i.e. not enough data per partition), then sometime the scheduler gets confused and won't use the entire cluster either. On the whole, finding the right number of partitions is a balancing act.
You have added alias to the column "C_ID" as "PC_ID". and after that you are looking for "C_ID".
And Union can be performed on same number of columns, your table1 and table2 has different in columns size.
otherwise you will get: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns
Please take care of these two scenario first.

Skewed Window Function & Hive Source Partitions?

The data I am reading via Spark is highly skewed Hive Table with the following stats.
(MIN, 25TH, MEDIAN, 75TH, MAX) via Spark UI:
1506.0 B / 0 232.4 KB / 27288 247.3 KB / 29025 371.0 KB / 42669 269.0 MB / 27197137
I believe it is causing problems downstream in the job when I perform some Window Funcs, and Pivots.
I tried exploring this parameter to limit the partition size however nothing changed and the partitions are still skewed upon read.
spark.conf.set("spark.sql.files.maxPartitionBytes")
Also, when I cache this DF with the Hive table as source it takes a few min and even causes some GC in the Spark UI most likely because of the skew as well.
Does this spark.sql.files.maxPartitionBytes work on Hive tables or only files?
What is the best course of action for handling this skewed Hive source?
Would something like a stage barrier write to parquet or Salting be suitable for this problem?
I would like to avoid .repartition() on read as it adds another layer to an already data roller-coaster of a job.
Thank you
==================================================
After further research it appears the Window Function is causing skewed data too and this is where the Spark Job hangs.
I am performing some time series filling via double Window Function (forward then backward fill to impute all the null sensor readings) and am trying to follow this article to try a salt method to evenly distribute ... however the following code produces all null values so the salt method is not working.
Not sure why I am getting skews after Window since each measure item I am partitioning by has roughly the same amount of records after checking via .groupBy() ... thus why would salt be needed?
+--------------------+-------+
| measure | count|
+--------------------+-------+
| v1 |5030265|
| v2 |5009780|
| v3 |5030526|
| v4 |5030504|
...
salt post => https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18
nSaltBins = 300 # based off number of "measure" values
df_fill = df_fill.withColumn("salt", (F.rand() * nSaltBins).cast("int"))
# FILLS [FORWARD + BACKWARD]
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
# FORWARD FILLING IMPUTER
ffill_imputer = F.last(df_fill['new_value'], ignorenulls=True)\
.over(window)
fill_measure_DF = df_fill.withColumn('value_impute_temp', ffill_imputer)\
.drop("value", "new_value")
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(0,Window.unboundedFollowing)
# BACKWARD FILLING IMPUTER
bfill_imputer = F.first(df_fill['value_impute_temp'], ignorenulls=True)\
.over(window)
df_fill = df_fill.withColumn('value_impute_final', bfill_imputer)\
.drop("value_impute_temp")
Salting might be helpful in the case where a single partition is big enough to not fit in memory on a single executor. This might happen even if all the keys are equally distributed as well (as in your case).
You have to include the salt column in your partitionBy clause which you are using to create the Window.
window = Window.partitionBy('measure', 'salt')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Again you have to create another window which will operate on the intermediate result
window1 = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Hive based solution :
You can enable Skew join optimization using hive configuration. Applicable settings are:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
See databricks tips for this :
skew hints may work in this case

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Removing duplicate lines from a large dataset

Let's assume that I have a very large dataset that can not be fit into the memory, there are millions of records in the dataset and I want to remove duplicate rows (actually keeping one row from the duplicates)
What's the most efficient approach in terms of space and time complexity ?
What I thought :
1.Using bloom filter , I am not sure about how it's implemented , but I guess the side effect is having false-positives , in that case how can we find if it's REALLY a duplicate or not ?
2.Using hash values , in this case if we have a small number of duplicate values, the number of unique hash values would be large and again we may have problem with memory ,
Your solution 2: using hash value doesn't force a memory problem. You just have to partition the hash space into slices that fits into memory. More precisely:
Consider a hash table storing the set of records, each record is only represented by its index in the table. Say for example that such a hash table will be 4GB. Then you split your hash space in k=4 slice. Depending on the two last digits of the hash value, each record goes into one of the slice. So the algorithm would go roughly as follows:
let k = 2^M
for i from 0 to k-1:
t = new table
for each record r on the disk:
h = hashvalue(r)
if (the M last bit of h == i) {
insert r into t with respect to hash value h >> M
}
search t for duplicate and remove them
delete t from memory
The drawback is that you have to hash each record k times. The advantage is that is it can trivially be distributed.
Here is a prototype in Python:
# Fake huge database on the disks
records = ["askdjlsd", "kalsjdld", "alkjdslad", "askdjlsd"]*100
M = 2
mask = 2**(M+1)-1
class HashLink(object):
def __init__(self, idx):
self._idx = idx
self._hash = hash(records[idx]) # file access
def __hash__(self):
return self._hash >> M
# hashlink are equal if they link to equal objects
def __eq__(self, other):
return records[self._idx] == records[other._idx] # file access
def __repr__(self):
return str(records[self._idx])
to_be_deleted = list()
for i in range(2**M):
t = set()
for idx, rec in enumerate(records):
h = hash(rec)
if (h & mask == i):
if HashLink(idx) in t:
to_be_deleted.append(idx)
else:
t.add(HashLink(idx))
The result is:
>>> [records[idx] for idx in range(len(records)) if idx not in to_be_deleted]
['askdjlsd', 'kalsjdld', 'alkjdslad']
Since you need deletion of duplicate item, without sorting or indexing, you may end up scanning entire dataset for every delete, which is unbearably costly in terms of performance. Given that, you may think of some external sorting for this, or a database. If you don't care about ordering of output dataset. Create 'n' number of files which stores a subset of input dataset according to hash of the record or record's key. Get the hash and take modulo by 'n' and get the right output file to store the content. Since size of every output file is small now, your delete operation would be very fast; for output file you could use normal file, or a sqlite/ berkeley db. I would recommend sqlite/bdb though. In order to avoid scanning for every write to output file, you could have a front-end bloom filter for every output file. Bloom filter isn't that difficult. Lot of libraries are available. Calculating 'n' depends on your main memory, I would say. Go with pessimistic, huge value for 'n'. Once your work is done, concatenate all the output files into a single one.