Spark: sc.WholeTextFiles takes a long time to execute - scala

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total
I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.
I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)
I'm just starting and I've never had the need to optimize a job before
EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.
EDIT 2: benchmark assessment
So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:
15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.
Would that be a reason of the bad performance? How do I hedge that?
Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....
It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

To summarize my recommendations from the comments:
HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks
So my recommendation is:
Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration

Related

How can I write dataframe to csv file using one partition although the file size exceeds executors memory

I am working on Apache Spark standalone cluster with 2 executors, each having 1g heap space and 8 cores each.
I load input file having size 2.7Gb into a dataframe df. This was successfully done using 21 tasks, that is I used 21 partitions in total across my whole cluster.
Now I tried writing this out to csv using only 1 partition, so that I get all my records in 1 csv file.
df.coalesce(1).write.option("header","true").csv("output.csv")
I expected to get an OOM error since the total usable memory for an executor is less than 2.7Gb. But this did not happen.
How did my task not break despite the data being larger than a single partition? What exactly is happening here under the hood?
The original csv file is of size 2.7GB in its raw format (text-based, no compression). When you read that file with Spark it splits up the data into multiple partitions based on the configuration spark.files.maxPartitionBytes which defaults to 128MB. Doing the math leads to 2700MB / 128MB = 21 partitions.
Spark keeps the data in-memory but in its own storage format which is called "Vectorized Parquet" and using a default compression "lz4".
Therefore, the 2.7GB will fit into the provided 1GB memory.
Keep in mind, that not all 100% of the 1GB is available to use for data store/processing. There is a clear design to the executors memory that can be configured by the configuration spark.memory.fraction and spark.memory.storageFraction. I have written an article on medium about the Executor Memory Layout.
Here is a picture that helps to understand the Memory Layout:

Repartioning Large Files in Spark

I am very new to Spark and got a file of 1 TB to process.
My system specification is :
Each node: 64 GB RAM
Number Of nodes:2
Cores per node: 5
As I know I have to repartition the data for better parallelism as spark will try to create default partition only by (totalNumber of cores * 2 or 3 or 4).
But in my case since Data file is very huge, I have to repartition this data to a number such that this data can be processed in a efficient manner.
How to choose the number of Partitions to be passed in repartition??How should I calculate it?What approach I should take to solve this..
Thanks a lot in advance.
partitions and parallelism are two different things per my understanding. However both go hand in hand when it comes to parallel executions of tasks in Spark.
Parallelism is number of executors * number of cores , which in your case is 2 * 5 = 10. So at any given moment you could have 10 tasks running at most.
If your data is divided into 10 partitions then all of it would be processing at once. However if you have 20 partitions then Spark would start processing 10 partitions and based on when each task finish , spark will schedule next partitions to process. This will happen until it finish processing all the partitions.
By default one partition is one block of data. I am guessing your 1 TB of Data is stored on HDFS. If underlying block size is 256MB then you would have 1TB/256MB number of blocks which in turn are partitions.
Please note that once the data is read you can always repartition it based on your requirement.
How to choose the number of Partitions to be passed in
repartition??How should I calculate it?What approach I should take to
solve this..
You need to see how your spark application holds up with the size of partition and then determine if you can decrease or increase that number. One thing is the executor memory consideration as well. If your partition is too big then you can run into OutOfMemory errors as well. These are just the guidelines and not the extensive list.
This https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ multipart series has more detailed discussion on partitions and executors.

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

How spark loads the data into memory

I have total confusion in the spark execution process. I have referred may articles and tutorials, nobody is discussing in detailed. I might be wrongly understanding spark. Please correct me.
I have my file of 40GB distributed across 4 nodes (10GB each node) of the 10 node cluster.
When I say spark.read.textFile("test.txt") in my code, will it load data(40GB) from all the 4 nodes into driver program (master node)?
Or this RDD will be loaded in all the 4 nodes separately. In that case, each node RDD should hold 10GB of physical data, is it?
And the whole RDD holds 10GB data and perform tasks for each partition i.e 128MB in spark 2.0. And finally shuffles the output to the driver program (master node)
And I read somewhere "numbers of cores in Cluster = no. of partitions" does it mean, the spark will move the partitions of one node to all 10 nodes for processing?
Spark doesn't have to read the whole file into memory at once. That 40GB file is split into many 128MB (or whatever your partition size is) partitions. Each of those partitions is a processing task. Each core will only work on one task at a time, with a preference to work on tasks where the data partition is stored on the same node. Only the 128MB partition that is being worked on needs to be read, the rest of the file is not read. Once the task completes (and produces some output) then the 128MB for the next task cab be read in and the data read in for the first task can be freed from memory. Because of this only the small amount of data being processed at a time needs to be loaded in to memory and not the entire file at once.
Also strictly speaking spark.read.textFile("test.txt") does nothing. It reads no data and does no processing. It creates an RDD but an RDD doesn't contain any data. And RDD is just an execution plan. spark.read.textFile("test.txt") declared that the file test.txt will be read an used as a source of data if and when the RDD is evaluated but doesn't do anything on its own.

How to distribute data to worker nodes

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.