Interaction of spark configurations with input structure - scala

Spark has many configurable options. Here, I would like to know what the optimal configuration is under certain constraints.
I have seen many of these post and do not think the approach of neglecting the structure of the data can yield in a satisfactory solution.
Cluster Config
We will set the already established --executor-cores 5, because of the previous research done. Let us set another constraint such that the --executor-memory 60 Gb is the threshold maximum. This may be expressed as --executor-memory = min(60 Gb,EM).
We fix the number of nodes in our cluster to N_0, which implicitly regulates the --num-executors (equal to N_0 * average num-cores on node / 5).
Data Config
We are presented with data in the form of FN_0-many text files of equal size FS (approx. 1 Gb) loaded into an RDD. This RDD has initially a partiton number PN equal to FN_0. Loading all the files into the RDD results in records RN = RDD.count().
Question
I would like to find a qualitative expression or optimal solution for the --executor-memory, --num-executors and partition number PN for an Input -> Map -> Filter -> Action job, in terms of N_0,FN_0,FS,RN. What is their inter-dependency?
My assumption is that the partition number would be ideal at RN (approx. 100.000), so that every record has its own task, but this shuffle would scale astronomically. I would also appreciate any thoughts in regards to the relationship between he product FN_0 * FS and --executor-memory.

Related

Performance tuning in spark

I am running a spark job which processes about 2 TB of data. The processing involves:
Read data (avrò files)
Explode on a column which is a map type
OrderBy key from the exploded column
Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a df.filter(col("key").isin(keyset: _*) )
I write this df to a parquet (this dataframe is very small)
Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) ) and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?
All look right to me.
If you have small dataset then isin is okay.
1) Ensure that you can increase the number of cores. executor core=5
More than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
2) Ensure that you have good/uniform partition strucutre.
Example (only for debug purpose not for production):
import org.apache.spark.sql.functions.spark_partition_id
yourcacheddataframe.groupBy(spark_partition_id).count.show()
This is will print spark partition number and how many records
exists in each partition. based on that you can repartition, if you wanot more parllelism.
3) spark.dynamicAllocation.enabled could be another option.
For Example :
spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=100 --conf spark.shuffle.service.enabled=true
along with all other required props ..... thats for that job. If you give these props in spark-default.conf it would be applied for all jobs.
With all these aforementioned options your processing time might lower.
On top of what has been mentioned, a few suggestions depending on your requirements and cluster:
If the job can run at 20g executor memory and 5 cores, you may be able to fit more workers by decreasing the executor memory and keeping 5 cores
Is the orderBy actually required? Spark ensures that rows are ordered within partitions, but not between partitions which usually isn't terribly useful.
Are the files required to be in specific locations? If not, adding a
df.withColumn("in_keyset", when( col('key').isin(keyset), lit(1)).otherwise(lit(0)). \
write.partitionBy("in_keyset").parquet(...)
may speed up the operation to prevent the data from being read in + exploded 2x. The partitionBy ensures that the items in the keyset are in a different directory than the other keys.
spark.dynamicAllocation.enabled is enabled
partition sizes are quite uneven (based on the size of output parquet part files) since I am doing an orderBy key and some keys are more frequent than others.
keyset is a really small set (7 elements)

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Slow count of >1 billion rows from Cassandra via Apache Spark [duplicate]

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".

Spark performs task with not enough parallelism

I am a begginer in Spark and I am a bit confuse about the behaviour of Spark.
I am developing an algorithm in Scala, in this method I create an RDD with a number of partitions specified by the user in this way:
val fichero = sc.textFile(file, numPartitions)
I am developing under a cluster with 12 workers and 216 cores available (18 per node). But when I go to the Spark UI to debug the application I saw the following event timeline for a given stage:
Sorry for the quality of the image, but I have to low the zoom a lot. In this execution, there are 128 partitions. But, as can be observed in the image, the whole RDD is executed in only two out of twelve executors available, so some task are executed sequentially and I don't want that behaviour.
So the question is: What's happening here? Could I use all workers in order to execute each task in parallel? I have seen the option:
spark.default.parallelism
But this option is modified when choosing the number of partition to use. I am launching the application with the defaults parameters of the spark-submit script.
You should set --num-executors to a higher number (default is 2), you should also look at --executor-cores which is 1 by default. Try e.g. --num-executors 128.
Make sure that your number of partitions is a multiple (I normally use 2 or 4, depending on the resources needed) of "the number of executors times the number of cores per executor".
See spark-submit --help and for further reading, I can recommend to have a look at this (especially "tuning parallelism") : http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
numPartition is a hint not a requirement.
It is finally passed to InputFormat
https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapred/FileInputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf, int)
You can always check the actual number of partitions with
val fichero = sc.textFile(file, numPartitions)
fichero.partitions.size

Last 2/3 task takes huge time compared to all other task in spark

I am trying to do sentimental analysis of comments. Program is running successfully on Spark, But the problem I am facing is out of 70 partitions 68 partitions gave result in around 20% of time compared to the time taken by last 2 partitions. I have checked my data is equally distributed on all partitions , and even checked with different sample data also.
Also I had run the code using persist(StorageLevel.MEMORY_AND_DISK_SER) for all dataframes, and unpersist these dataframes as soon as they are not required anymore.
I also tried increasing and decreasing the number of partitions, but still for last 2 tasks, It takes huge time. Following is the current config I am using
--master yarn \
--deploy-mode client \
--num-executors 15 \
--executor-cores 5 \
--executor-memory 32g \
--driver-memory 8g \
--driver-cores 8 \
I have used KryoSerializer. Following is set in sparkConf
sparkConf.set("spark.driver.allowMultipleContexts", "true")
sparkConf.set("spark.scheduler.mode", "FAIR")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryoserializer.buffer.max", "1024m")
How can I optimise, so that last 2 partitions dont take so much time?
Thanks
Try using KYRO serialization , the default java serialization is bit slow.
When it is writing back to DISK and getting the data back , it is transferring huge amounts of data so you need to use better serialization technique.
You are running in client mode, try running in cluster mode. In cluster mode driver is launched with in cluster as another process.
Executor processing time mostly un even based on how you partitioned your data and how busy worker node at that time. Try to relook your partitioned logic(trying to increase partition number) and increase number of core and executor memory. Its kind of specific tuning based on your specific use cases. In my case i increased partitioned number from 100 to 1000 with number of executor to 10 and core 2 solve this problem. again its depend upon how large your dataset also.
NLP tasks can take variable amounts of time depending on the data factors. For example, Stanford's CoreNLP takes a very long time to do NER on long sentences (time goes up with the square of the number of tokens in the sentence).
If it is a different partition each time, then I would guess that your cluster is sized in an inconvenient way: the number of executors does not divide evenly into the number of partitions. An easy way to check this is to re-partition your 70 partitions into 700 partitions and see if the time taken is still 2 or 3 partitions, or if the time evens out. If it is two or three partitions still, then likely it is a data issue. If the time evens out then it was due to partition/executor mismatch.