Spark Dataframe Join shuffle

Spark Dataframe Join shuffle - scala

Spark version 1.6.0
I'm using the join function between two dataframes which have 100 partitions, the app is running on a cluster where I'm using 5 cores for each 20 executor with total 100 cores.
My problem is that when I do the join, all the records are computed on one executor, while the other executors are not used as picture below:
This cause a decrease in performance because all data is computed with one executor against the other 19 executors available.
It looks like spark join "bring" all the record in only one partitions, is there a way to avoid this?
To be sure that it doesn't repartion to 1 I also set this spark property: spark.sql.shuffle.partitions=100 indeed the two input dataframe have 100 partitions same as the output dataframe

Short answer:
This is because of your data, not because of spark.
Long answer:
In order to perform join operation spark need to move data with same keys (values of columns that you're joining on) to the same workers. E.g. if you join column A with column B the rows that contains same values in both tables will be moved to same workers and then joined.
In addition - rows with different keys also might be moved to same node - this depends on Partitioner that you have. You can read more here - but the general idea that there are to default partitioners - HashPartitioner and RangePartitioner. Despite which one is used - it decides on which worker row goes. As an example - if you have RangePartitioner with ranges [0, 5)[5. 7)[7, 10] then keys 1, 2, 3, 4 will all go to same worker. And if you have only these keys in your data - only one worker will be utilized.

Related

How spark shuffle partitions and partition by tag along with each other

I am reading a set of 10,000 parquet files of 10 TB cumulative size from HDFS and writing it back to HDFS in partitioned manner using following code
spark.read.orc("HDFS_LOC").repartition(col("x")).write.partitionBy("x").orc("HDFS_LOC_1")
I am using
spark.sql.shuffle.partitions=8000
I see that spark had written 5000 different partitions of "x" to HDFS(HDFS_LOC_1) . How is shuffle partitions of "8000" is being used in this entire process. I see that there are only 15,000 files got written across all partitions of "x". Does it mean that spark tried to create 8000 files at every partition of "X" and found during write time that there were not enough data to write 8000 files at each partition and ended up writing fewer files ? Can you please help me understand this?

The setting spark.sql.shuffle.partitions=8000 will set the default shuffling partition number of your Spark programs. If you try to execute a join or aggregations just after setting this option, you will see this number taking effect (you can confirm that with df.rdd.getNumPartitions()). Please refer here for more information.
In your case though, you are using this setting with repartition(col("x") and partitionBy("x"). Therefore your program will not be affected by this setting without using a join or an aggregation transformation first. The difference between repartition and partitionBy is that, the first will partition the data in memory, creating cardinality("x") number of partitions, when the second one will write approximately the same number of partitions to HDFS. Why approximately? Well because there are more factors that determine the exact number of output files. Please check the following resources to get a better understanding over this topic:
Difference between df.repartition and DataFrameWriter partitionBy?
pyspark: Efficiently have partitionBy write to same number of total partitions as original table
So the first thing to consider when using repartitioning by column repartition(*cols) or partitionBy(*cols), is the number of unique values (cardinality) that the column (or the combination of columns) has.
That being said, if you want to ensure that you will create 8000 partitions i.e output files, use repartition(partitionsNum, col("x")) where partitionsNum == 8000 in your case then call write.orc("HDFS_LOC_1"). Otherwise, if you want to keep the number of partitions close to the cardinality of x, just call partitionBy("x") to your original df and then write.orc("HDFS_LOC_1") for storing the data to HDFS. This will create cardinality(x) folders with your partitioned data.

Performance tuning in spark

I am running a spark job which processes about 2 TB of data. The processing involves:
Read data (avrò files)
Explode on a column which is a map type
OrderBy key from the exploded column
Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a df.filter(col("key").isin(keyset: _*) )
I write this df to a parquet (this dataframe is very small)
Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) ) and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?

All look right to me.
If you have small dataset then isin is okay.
1) Ensure that you can increase the number of cores. executor core=5
More than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
2) Ensure that you have good/uniform partition strucutre.
Example (only for debug purpose not for production):
import org.apache.spark.sql.functions.spark_partition_id
yourcacheddataframe.groupBy(spark_partition_id).count.show()
This is will print spark partition number and how many records
exists in each partition. based on that you can repartition, if you wanot more parllelism.
3) spark.dynamicAllocation.enabled could be another option.
For Example :
spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=100 --conf spark.shuffle.service.enabled=true
along with all other required props ..... thats for that job. If you give these props in spark-default.conf it would be applied for all jobs.
With all these aforementioned options your processing time might lower.

On top of what has been mentioned, a few suggestions depending on your requirements and cluster:
If the job can run at 20g executor memory and 5 cores, you may be able to fit more workers by decreasing the executor memory and keeping 5 cores
Is the orderBy actually required? Spark ensures that rows are ordered within partitions, but not between partitions which usually isn't terribly useful.
Are the files required to be in specific locations? If not, adding a
df.withColumn("in_keyset", when( col('key').isin(keyset), lit(1)).otherwise(lit(0)). \
write.partitionBy("in_keyset").parquet(...)
may speed up the operation to prevent the data from being read in + exploded 2x. The partitionBy ensures that the items in the keyset are in a different directory than the other keys.

spark.dynamicAllocation.enabled is enabled
partition sizes are quite uneven (based on the size of output parquet part files) since I am doing an orderBy key and some keys are more frequent than others.
keyset is a really small set (7 elements)

Is it possible to reorder tasks in Spark stage

My question is about an order of tasks in a Stage in Spark.
Context:
I have a Spark dataframe divided into 3000 partitions. Partitioning is done on one a specific Key. I use mapPartitionsWithIndex to get an id of a partition and number of elements it contains. For example:
df.rdd
.mapPartitionsWithIndex((i,rows) => Iterator((i,rows.size)))
.toDF("id", "numElements")
When Spark runs its calculation on my dataframe, I see in Spark UI (I also did some tests to make sure it is the case) that the task index corresponds to partition id, exactly the same as id obtained with mapPartitionsWithIndex above. So, tasks are executed in order of increasing id of partition on given executor.
I see a clear correlation between number of rows in a partition and the execution time of a task. Due to a skewed nature of my dataset which can't be changed, I have several partitions with much higher number of elements (>8000) than the average (~3000). The time of execution of an average partition is 10-20 min, the larger ones can go above 3 hours. Some of my largest partitions have high id and therefore the corresponding tasks are executed almost at the end of a stage. As a consequence, one of Spark Stages hangs for 3 hours on last 5 tasks.
Question:
Is there a way to reorder id of partitions so that the tasks from largest partitions are executed first? Or equivalently, is there a way to change order of execution of tasks?
Note:
I don't need to move partitions to other nodes or executors, just change order of execution.
I can't change the key of partitioning
I can change number of partitions but the problem will stay
My setup: Spark 2.2 with Mesos running with spark-submit. The job is run on 60 CPUs with 12 executors each with 5 CPUs.

No, there is not. If so, it would be in the docs by now.
You can not control the ordering (/prioritization) of Tasks - since
the Spark Task Scheduler does not have an interface to define such
order/prioritization.
Spark works differently to say Informatica. A Stage - thus all tasks - must complete entirely before next Stage can commence for a given Action.
8000 seems to take a long time.

Spark dataframe Join issue

Below code snippet works fine. (Read CSV, Read Parquet and join each other)
//Reading csv file -- getting three columns: Number of records: 1
df1=spark.read.format("csv").load(filePath)
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 30 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Its weired that below code snippet doesnt work. (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase)
//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
df1=read from Hbase code
// It read from Hbase properly and able to show one record.
df1.show
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use.
If I persist df1 in memory&disk and do a count(). Program works fine. Code snippet is below
//Reading from Hbase -- getting three columns: Number of records: 1
df1=read from Hbase code
**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
It works with file even it has the same data but not with Hbase. Running this on 100 worknode cluster with 125 GB of memory on each. So memory is not the problem.
My question here is both the file and Hbase has same data and both read and able to show() the data. But why only Hbase is failing. I am struggling to understand what might be going wrong with this code. Any suggestions will be appreciated.

When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join.
thus it tries to sort and shuffle the data across the executors.
to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet.
however to bypass this (since it is only one row) we can use Case expression for the columns to be padded.
example:
df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
,lit(hbaseValueCol1))
.otherwise(lit(null))

I'm sometimes struggling with this error too. Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). As there is no hint to force sort-merge join (How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1

When I have problem with memory during a join it usually means one of two reasons:
You have too few partitions in dataframes (partitions are too big)
There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory.
Ad 1. I think you should look at number of partitions you have in each table before join. When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. Too large partitions become even larger after join and this creates memory problem. Have a look at the Peak Execution Memory per task in Spark UI. This will give you some idea about your memory usage per task. I found it best to keep it below 1 Gb.
Solution: Repartition your tables before the join.
Ad. 2 Maybe not the case here but worth checking.

Slow count of >1 billion rows from Cassandra via Apache Spark [duplicate]

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.

Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.

I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse