I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code
var myRDD = sc.textFile("input file location")
myRDD = myRDD.repartition(10000)
and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. (image of the distribution)
So the load is high on only one executor
I also tried and got the same result
myRDD.coalesce(10000, shuffle = true)
is there a way to uniformly distribute records among partitions.
Attached is the shuffle read size/ number of records on that particular executor
the circled one has a lot more records to process than the others
any help is appreciated thank you.
To deal with the skew, you can repartition your data using distribute by(or using repartition as you used). For the expression to partition by, choose something that you know will evenly distribute the data.
You can even use the primary key of the DataFrame(RDD).
Even this approach will not guarantee that data will be distributed evenly between partitions. It all depends on the hash of the expression by which we distribute.
Spark : how can evenly distribute my records in all partition
Salting can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.
(here is link for salting)
For small data I have found that I need to enforce uniform partitioning myself. In pyspark the difference is easily reproducible. In this simple example I'm just trying to parallelize a list of 100 elements into 10 even partitions. I would expect each partition to hold 10 elements. Instead, I get an uneven distribution with partitions sizes anywhere from 4 to 22:
my_list = list(range(100))
rdd = spark.sparkContext.parallelize(my_list).repartition(10)
rdd.glom().map(len).collect()
# Outputs: [10, 4, 14, 6, 22, 6, 8, 10, 4, 16]
Here is the workaround I use, which is to index the data myself and then mod the index to find which partition to place the row in:
my_list = list(range(100))
number_of_partitions = 10
rdd = (
spark.sparkContext
.parallelize(zip(range(len(my_list)), my_list))
.partitionBy(number_of_partitions, lambda idx: idx % number_of_partitions)
)
rdd.glom().map(len).collect()
# Outputs: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
Related
We are working in Spark streaming .
Our DataFrame contains the following columns
[unitID,source,avrobyte,schemeType]
The unitID values are [ 10, 76, 510, 269 , 7, 0, 508, , 509 ,511 , 507]
We active the following command :
val dfGrouped :KeyValueGroupedDataset [Int,Car] = dfSource.groupByKey(car1=> ca1.unitID)
val afterLogic : DataSet[CarLogic]= dfGrouped.flatMapGroups{
case(unitID: Int , messages:Iterator[Car])=> performeLogic(...)
}
We allocate 8 Spark executers .
In our Dataset we have 10 different units so we have 10 different unitID,
so we excepted that job processing will split on all over the executers in equal manner, but when we looking on the executers performance via the UI I see that only 2 executers are working and all the other are idle during the mission....
What are we doing wrong? or how we can divide the job over all the executers to be more or the less equal...
What you are seeing can be explained by the low cardinality of your key space. Spark uses a HashPartitioner (by default) to assign keys to partitions (by default 200 partitions). On a low cardinality key space this is rather problematic and requires careful attention as each collision has a massive impact. Even further, these partitions then have to be assigned to executors. At the end of this process it's not surprising to end up with a rather sub-optimal distribution of data.
You have a few options:
If applicable, attempt to increase the cardinality of your keys, e.g. by salting them (appending some randomness temporarily). That has the advantage that you can also better handle skew in the data (when the amount of data per keys is not equally distributed). In a following step you can then remove the random part again and combine the partial results.
If you absolutely require a partition per key (and the key space is static and well-known), you should configure spark.sql.shuffle.partitions to match the cardinality n of your keys space and assign each key a partition id in [0, n) ahead of time (to avoid collisions when hashing). Then you can use this partition id in your groupBy.
Just for completeness, using the RDD API you could provide you own custom partitioner that does the same as described above: rdd.partitionBy(n, customPartitioner)
Though, one final word: Even following one of the latter two options above, using 8 executors for 10 keys (equals 10 non-empty partitions) is a poor choice. If your data is equally distributed, you will still end up with 2 executors doing double the work. If your data is skewed things might even be worse (or you are accidentally lucky) - in any case, it's out of your control.
So it's best to make sure that the number of partitions can be equally distributed among your executors.
I have a DataFrame (df) with more than 1 billion rows
df.coalesce(5)
.write
.partitionBy("Country", "Date")
.mode("append")
.parquet(datalake_output_path)
From the above command I understand only 5 worker nodes in my 100 worker node cluster (spark 2.4.5) will be performing all the tasks. Using coalesce(5) takes the process 7 hours to complete.
Should I try repartition instead of coalesce?
Is there a more faster/ efficient way to write out 128 MB size parquet files or do I need to first calculate the size of my dataframe to determine how many partitions are required.
For example if the size of my dataframe is 1 GB and spark.sql.files.maxPartitionBytes = 128MB should I first calculate No. of partitions required as 1 GB/ 128 MB = approx(8) and then do repartition(8) or coalesce(8) ?
The idea is to maximize the size of parquet files in the output at the time of writing and be able to do so quickly (faster).
You can get the size (dfSizeDiskMB) of your dataframe df by persisting it and then checking the Storage tab on the Web UI as in this answer. Armed with this information and an estimate of the expected Parquet compression ratio you can then estimate the number of partitions you need to achieve your desired output file partition size e.g.
val targetOutputPartitionSizeMB = 128
val parquetCompressionRation = 0.1
val numOutputPartitions = dfSizeDiskMB * parquetCompressionRatio / targetOutputPartitionSizeMB
df.coalesce(numOutputPartitions).write.parquet(path)
Note that spark.files.maxPartitionBytes is not relevant here as it is:
The maximum number of bytes to pack into a single partition when reading files.
(Unless df is the direct result of reading an input data source with no intermediate dataframes created. More likely the number of partitions for df is dictated by spark.sql.shuffle.partitions, being the number of partitions for Spark to use for dataframes created from joins and aggregations).
Should I try repartition instead of coalesce?
coalesce is usually better as it can avoid the shuffle associated with repartition, but note the warning in the docs about potentially losing parallelism in the upstream stages depending on your use case.
Coalesce is better if you are coming from higher no of partitions to lower no. However, if before writing the df, your code isn't doing shuffle , then coalesce will be pushed down to the earliest point possible in DAG.
What you can do is process your df in say 100 partitions or whatever number you seem appropriate and then persist it before writing your df.
Then bring your partitions down to 5 using coalesce and write it. This should probably give you a better performance
I am running a spark job which processes about 2 TB of data. The processing involves:
Read data (avrò files)
Explode on a column which is a map type
OrderBy key from the exploded column
Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a df.filter(col("key").isin(keyset: _*) )
I write this df to a parquet (this dataframe is very small)
Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) ) and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?
All look right to me.
If you have small dataset then isin is okay.
1) Ensure that you can increase the number of cores. executor core=5
More than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
2) Ensure that you have good/uniform partition strucutre.
Example (only for debug purpose not for production):
import org.apache.spark.sql.functions.spark_partition_id
yourcacheddataframe.groupBy(spark_partition_id).count.show()
This is will print spark partition number and how many records
exists in each partition. based on that you can repartition, if you wanot more parllelism.
3) spark.dynamicAllocation.enabled could be another option.
For Example :
spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=100 --conf spark.shuffle.service.enabled=true
along with all other required props ..... thats for that job. If you give these props in spark-default.conf it would be applied for all jobs.
With all these aforementioned options your processing time might lower.
On top of what has been mentioned, a few suggestions depending on your requirements and cluster:
If the job can run at 20g executor memory and 5 cores, you may be able to fit more workers by decreasing the executor memory and keeping 5 cores
Is the orderBy actually required? Spark ensures that rows are ordered within partitions, but not between partitions which usually isn't terribly useful.
Are the files required to be in specific locations? If not, adding a
df.withColumn("in_keyset", when( col('key').isin(keyset), lit(1)).otherwise(lit(0)). \
write.partitionBy("in_keyset").parquet(...)
may speed up the operation to prevent the data from being read in + exploded 2x. The partitionBy ensures that the items in the keyset are in a different directory than the other keys.
spark.dynamicAllocation.enabled is enabled
partition sizes are quite uneven (based on the size of output parquet part files) since I am doing an orderBy key and some keys are more frequent than others.
keyset is a really small set (7 elements)
Spark version 1.6.0
I'm using the join function between two dataframes which have 100 partitions, the app is running on a cluster where I'm using 5 cores for each 20 executor with total 100 cores.
My problem is that when I do the join, all the records are computed on one executor, while the other executors are not used as picture below:
This cause a decrease in performance because all data is computed with one executor against the other 19 executors available.
It looks like spark join "bring" all the record in only one partitions, is there a way to avoid this?
To be sure that it doesn't repartion to 1 I also set this spark property: spark.sql.shuffle.partitions=100 indeed the two input dataframe have 100 partitions same as the output dataframe
Short answer:
This is because of your data, not because of spark.
Long answer:
In order to perform join operation spark need to move data with same keys (values of columns that you're joining on) to the same workers. E.g. if you join column A with column B the rows that contains same values in both tables will be moved to same workers and then joined.
In addition - rows with different keys also might be moved to same node - this depends on Partitioner that you have. You can read more here - but the general idea that there are to default partitioners - HashPartitioner and RangePartitioner. Despite which one is used - it decides on which worker row goes. As an example - if you have RangePartitioner with ranges [0, 5)[5. 7)[7, 10] then keys 1, 2, 3, 4 will all go to same worker. And if you have only these keys in your data - only one worker will be utilized.
Imagine I have a RDD with 100 records and I partitioned it with 10, so each partition is now having 10 records I am just converting to rdd to key value pair rdd and saving it to a file now my output data is divided into 10 partitions which is ok to me, but is it best practise to use coalesce function before saving output data to file ? for example rdd.coalesce(1) this gives just one file as output does it not shuffles data insides nodes ? want to know where coalesce should be used.
Thanks
Avoid coalesce if you don't need it. Only use it to reduce the amount of files generated.
As with anything, depends on your use case; coalesce() can be used to either increase or decrease the number of partitions but there is a cost associated with it.
If you are attempting to increase the number of partitions (in which the shuffle parameter must be set to true), you will incur the cost of redistributing data through a HashPartitioner. If you are attempting to decrease the number of partitions, the shuffle parameter can be set to false but the number of nodes actively grabbing from the current set of partitions will be the number of partitions you are coalescing to. For example, if you are coalescing to 1 partition, only 1 node will be active in pulling data from the parent partitions (this can be dangerous if you are coalescing a large amount of data).
Coalescing can be useful though as sometimes you can make your job run more efficiently by decreasing your partition set size (e.g. after a filter or a sparse inner join).
you can simply use it like this
rdd.coalesce(numberOfPartition)
It doesn't shuffle data if you decease partitions but its shuffle data if you increase partitions. Its according to use cases.But we careful to use it because if you decrease partition less than or not equal to number of cores in your cluster then its cant use full resources of your cluster. And Sometimes less shuffle data or network IO like you decrease rdd partition but equal to number of partition so its increase performance of your system.