shuffle last task taking too much time to complete - scala

enter image description hereI have around 80GB of data , everything is going smooth till last shuffle task comes up ,all the task are getting finished within 30 mins, but last task takes more than 2 hours to complete it . enter image description here
Joins : (left join)
Joining 3-tables , one of the table is small relatively (2 MB )data , for that setting broadcast variable , even I removed that 3rd table , It did not resolved my issue .
below is the parameters that configured .
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "904857600")
spark.conf.set("spark.cleaner.referenceTracking.blocking", "false")
spark.conf.set("spark.cleaner.periodicGC.interval", "5min")
spark.conf.set("spark.default.parallelism","6000")
spark.conf.set("spark.sql.shuffle.partitions","2000")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

You are suffering from data Skew. Essentially most of the work is being done by 1 node instead of the work being distributed across multiple nodes.
This is why I wanted clarification of job or [task/stage].
You should consider adding a salt to your join key to help distribute the work across multiple nodes. It will require more shuffles but it will lesson the impact on one node doing all the work.
Add salt to all columns in the join
Do your 3 way Join with salt column included.
Then do a secondary group by to remove the salt from the query.
This will better distribute the work.

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.
The table schema is as shown below:
+-----------------+--------------------+------------+---------+--------+---------------+
| ID| readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+
Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:
val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.
The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date
If I am not partitioning the final table I am getting the below error.
Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.
Below provided the cluster configuration and utilization metrics.
Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...
Any leads Appreciated!
spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.
The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.
Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.
If you wanted more accuracy, I'd suggest using:
Converting your null's to 0's.
val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls
Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.
This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.

Spark dataframe seems to be recomputed twice

SOLVED: I solved the issue, that was due to a very stupid, silly, idiotic mistake in one of the first passages of the flow.
Basically, I was computing a dataframe that was written to a Hive table; this dataframe then needed to be used to create the temporaryDF after many passages, but I was originally querying the table from scratch instead of using a copy of the dataframe to-be-written in the table. The mistake lies in the fact that the just-computed dataframes was missing previous partitions (due to the specific logic of the flow), whereas next computations to create temporaryDF needed also at least two previous partitions. I don't know why, I can't remember when, I decided to cache the just-computed one, thus losing information and getting an empty one under Oozie (in Spark-Shell I was always using at least three partitions, due to manually updating the table after some time - each new partition came every 15min). I was probably in a late night working sprint and my brain decided it was worthy to mess it up.
I upvoted and accepted #thebluephantom answer because he is very right within the specific circumstance I was describing.
Original:
I'm having a strange behaviour using Spark-Submit with Spark v.2.2.0.2.6.4.105-1 (using Scala) in Hadoop 2 under an Oozie workflow vs using Spark-Shell.
I have a Hive table that contains records that keep track of some processes every 15 minutes. The table is overwritten every time with new records or 'old' records that still satisfy the logic of the processes of interest.
I keep track of 'the age' of the records through a column that I will here call times_investigated, which ranges from 1 to 9.
I create a temporary dataframe, let's call it temporayDF, that contains both the old and the new entries (both the types need to be present to run useful computations). This temporayDF is then split between the new entries and the old ones, based on $"times_investigated" === 1 and $"times_investigated > 1" (or =!= 1).
Then, the processed entries are merged with a union in a final dataframe that is then written into the original Hive table.
// Before, I run the query on the 'old' Hive table and the logic over old and new entries.
// I now have a temporary dataframe
val temporaryDF = previousOtherDF
.withColumn("original_col_new", conditions)
.withColumn("original_other_col_new", otherConditions)
.withColumn("times_investigated_new", nvl($"times_investigated" + 1, 1))
.select(
previousColumns,
$"original_col_new".as("original_col"),
$"original_other_col_new".as("original_other_col"),
$"times_investigated_new".as("times_investigated"))
.cache
// Now I need to split the temporayDF in 2 to run some other logic on the new entries.
val newEntriesDF = temporaryDF
.filter($"times_investigated" === 1)
.join(neededDF, conditions, "leftouter")
.join(otherNeededDF, conditions, "leftouter")
.groupBy(cols)
.agg(min(colOne),
max(colTwo),
min(colThree),
max(colFour))
.withColumn("original_col_five_new",
when(conditions).otherwise(somethingElse))
.withColumn("original_col_six_new",
when(conditions).otherwise(somethingElse))
.select(orderedColumns)
val oldEntriesDF = temporaryDF.filter($"times_investigated" > 1)
val finalTableDF = oldEntriesDF.union(newEntriesDF)
// Now I write the table
finalTableDF.createOrReplaceTempView(tempFinalTableDF)
sql("""INSERT OVERWRITE TABLE $finalTableDF
SELECT * FROM tempFinalTableDF """)
// I would then need to re-use the newly-computed table to process further information...
The Problem:
The Hive table does not present the new entries with times_investigated = 1. It just processes the old ones, so, after those 9 times an entry can stay inside the table, it gets completely empty.
I run some tests within Spark-Shell and everything worked perfectly for many iterations, even manually writing the Hive table from the shell produced the expected results in the Hive table, but when I launched the workflow under Oozie, the strange behavior appeared again.
What I noticed within Spark-Shell is that, after writing the Hive table, if I went to compute a temporaryDF.show(), the new entries would be updated to $"times_investigated" = 2!
I tried to create a copy of temporaryDF to work on separate dataframes with the new and the old entries, but also this copyOfTemporaryDF gets updated after writing the Hive table.
It seems that this re-computation is happening before writing the Hive table under Oozie.
I know that I can compute the operations in a different manner, but I need to find a quick temporary fix on the current flow if possible.
Above all, I would love to understand what is happening under the hood, in order to avoid getting myself in such a circumstance later on.
Do you guys have any clue and/or advice?
I tried caching the intermediate dataframes, but without success.
P.S. Sorry for the probably bad coding practices
EDIT. More context: the temporaryDF comes from other intermediate dataframes, used just once to compute this on of interest. The last passages that create temporaryDF are withColumn operations, where $"times_investigated" is updated with a custom nvl function (that works exactly like the SQL one) and never gave problems in older versions of the flow (see below for the passages).
Edit2: I also tried to merge the operations on new and old entries in one long chained series, so that temopraryDF is actually the final dataframe to be written in the Hive table, but the new entries with times_investigated = 1 are still not considered (yet I have no issues via Spark-Shell and .showing the dataframe after writing to table makes it re-compute, so the times investigated are +1).
Use .cache otherwise you will get re-computation. You should do this for the appropriate dataframe or RDD if the RDD or DF is to be used multiple times in a single Spark App - not even Action dependent, sometimes you get "skipped stages".
val temporaryDF = previousOperations...cache()
2 vals use temporaryDF and without caching the recomputations will be as you see, and they may well give different results. That should be cached.
Of course if a Worker dies, or the partition evicted, some recomputing is needed.
.cache may not be ideal for datasets larger than available cluster memory. Each partition that is evicted will be rebuilt from source and that is a costly affair.
Also, using suitable partitioning and iterating a few times be better than persisting / caching; but it all depends.

Spark reduceBykey works poorly

I'm writing a program on Spark in scala. It's used to count the numbers of keys. Here is the data example:
Name Fruit Place
A apple China
A apple China
A apple U.S
A banana U.K
B apple Japan
B orange Chile
C apple French
It's a data frame of many columns but I only care about the above three columns, so there may be some repeated records. I would like to count, for example, the number of production places of the fruit eaten by A.
val res = data.select("name","fruit","place")
.map(v=>((v.getString(0),v.getString(1)),ArrayBuffer(v.getString(2)))).rdd.reduceByKey((a,b)=>a++=b)
.map(v=>(v._1._1,Map(v._1._2 -> v._2.toSet.size))).reduceByKey((a,b)=>a++=b)
I first select the columns I need and then use ("name", "fruit") as the key to collect the production places in one ArrayBuffer for each kind of fruit eaten by each person. Then I use "name" as the key to collect the number of production places for each fruit in a map like {"apple": 2}. So the result is informally like RDD[("name",Map("fruit"->"places count"))].
In the program I did this kind of work about 3 times to calculate information similar to the above example. For example, to compute the number of different fruits in one production places eaten by each person.
The size of the data is about 80GB and I run the job on 50 executors. Each executor has 4 cores and memory of 24GB. Moreover the data is repartitioned into 200 partitions. So this job should be finished in a very short period of time as I expected. However, it took me more than one day to run the job and failed because of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 10 and java.lang.OutOfMemoryError: GC overhead limit exceeded.
I did a lot of things to optimize this program, like reset the spark.mesos.executor.memoryOverhead and use mutable map to minimize the GC cost of frequently creating and cleaning objects. I even try to use reduceByKey to move the data with the same key into one partition to boost the performance, but with little help. The code is like:
val new_data = data.map(v=>(v.getAs[String]("name"),ArrayBuffer((v.getAs[String]("fruit"),v.getAs[String]("place")))))
.rdd.reduceByKey((a,b)=>a++=b).cache()
Then I don't need to shuffle the data each time I do similar calculations. And the later work can be done on the basis of new_data. However, it seems that this optimization doesn't work.
Finally, I found that there is about 50% of the data has the same value on the field "name", say "H". I removed the data with name "H" and the job finished in 1 hour.
Here is my question:
Why the distribution of keys has such a great impact on the performance of reduceByKey? I use the word "distribution" to express the number of occurrences of different keys. In my case, the size of the data is not big but one key dominates the data so the performance is greatly affected. I assume it's the problem of reduceByKey, am I wrong?
If I have to reserve the records with name "H", how to avoid the performance issue?
Is it possible to use reduceByKey to repartition the data and put the records with the same key ("name") into one partition?
Is it really help to move the records with the same key ("name") to one partition to improve the performance? I know it may cause memory issue but I have to run similar code in the program several times, so I guess it may help in the later work. Am I right?
Thanks for help!
What you can do to avoid the big shuffle is to first do a data frame from fruit to places.
val fruitToPlaces = data.groupBy("fruit").agg(collect_set("place").as("places"))
This data frame should be small (i.e. fits in memory)
You do fruitToPlaces.cache.count to make sure it's ok
Then you do a join on fruit.
data.join(fruitToPlaces, Seq("fruit"), "left_outer")
Spark should be smart enough to do a hash join (and not a shuffle join)

Cassandra: Load large data fast

We're currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we would like dump into a Cassandra.
We have tried all of the following:
Line by line insertion using python Cassandra driver
Copy command of Cassandra
Set compression of sstable to none
We have explored the option of the sstable bulk loader, but we don't have an appropriate .db format for this. Our text file to be loaded has 70M lines that look like:
2f8e4787-eb9c-49e0-9a2d-23fa40c177a4 the magnet programs succeeded in attracting applicants and by the mid-1990s only #about a #third of students who #applied were accepted.
The column family that we're intending to insert into has this creation syntax:
CREATE TABLE post (
postid uuid,
posttext text,
PRIMARY KEY (postid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={};
Problem:
The loading of the data into even a simple column family is taking forever -- 5hrs for 30M lines that were inserted. We were wondering if there is any way to expedite this as the performance for 70M lines of the same data being loaded into MySQL takes approximately 6 minutes on our server.
We were wondering if we have missed something? Or if someone could point us in the right direction?
Many thanks in advance!
The sstableloader is the fastest way to import data into Cassandra. You have to write the code to generate the sstables, but if you really care about speed this will give you the most bang for your buck.
This article is a bit old, but the basics still apply to how you generate the SSTables
.
If you really don't want to use the sstableloader, you should be able to go faster by doing the inserts in parallel. A single node can handle multiple connections at once, and you can scale out your Cassandra cluster for increased throughput.
I have a two node Cassandra 2.? cluster. Each node is I7 4200 MQ laptop, 1 TB HDD, 16 gig RAM). Have imported almost 5 billion rows using copy command. Each CSV file is a about 63 gig with approx 275 million rows. Takes about 8-10 hours to complete the import/per file.
Approx 6500 rows per sec.
YAML file is set to use 10 gigs of RAM. JIC that helps.