Spark out of memory - scala

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.

Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine

To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.

Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.

Related

Spark/Synapse Optimal handling huge gzipped xml file (+600mb compressed size)

For a task we need to process huge transactional xml files which are gz(ipped). Each line in the uncompressed file can be interpreted as its own xml record.
When working with small files like 100 MiB this works fine. The moment à collect() is performed on the huge input file it tends to fail OOM and the jvm crashes.
As this is a compressed (gz) file it can not be processed in parallel (AFAIK).
I was thinking about
using the toLocalIterator() to split it first up into smaller packets of 200K xml entries which are distributed to the other nodes for their cost om processing. Apparently the toLocalIterator() does also the collect() first (to test)
Other option is to use the some kind of index value and filter on it ("index > 5000") and set the limit(5000) to simulate paging through the 2 Million or more entries.
But I have no clue to what I should pay attention to parralize. Any tips are welcome.
Settings to pay attention and how to apply them in Azure Synapse etc.
how to push the read xml over the nodes to be processed in their executor/tasks.
could streaming a single file be an option?
any tips are welcome
Currently my code is done in scala due the fact the java libraries are easily accessible to convert the xml to json and extract the values I need.
Many thanks in advance (also for reading this)
TL;DR suggestion:
Step 1. Increase driver memory and test
Step 2. Increase executor memory and test if first step fails
Slightly longer version:
The fact that it gives OOM on collect() operation doesn't indicate whether it is OOM on spark.read operation or df.collect()
Spark scheduler will run the DAG when it encounters an Action but not when it is a Transformation.
So if collect is your first action, it is at that point it actually runs the DAG and the OOM may even be on read but manifesting as OOM on collect
Spark UI will provide insights on where the OOM happens
You are right that uncompressing gzip wont be parallelised. On read operation, it will use a single executor and even a single core. So I would increase executor memory until there is sufficient memory to gzip the whole file into memory - not just to the exact file size, leave the usual 400MB / 0.7% buffer.
If the error is indeed happening on the collect() operation, then you need to sufficiently increase driver memory.
Your app will not be parallelised at read. Your app will not be parallelised at collect()
You app can be parallelised during the transformations in between them and you can force the parallelisation to the extent you want to tune it by repartitioning your dataframe / dataset / rdd further.
Finally, I would consider again whether you do need the collect or whether you can store the output as a number of partitioned files?
I think the zipped file is always going to be a bottleneck so one alternative is to unzip it and see if that helps. I would also consider loading up the xml to a table in Synapse (which can deal with gzipped files). This would have the effect of unzipping it, you could then pass it into a Synapse Notebook with the synapsesql method, eg in Scala:
// Get the table with the XML column from the database and expose as temp view
val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
df.createOrReplaceTempView("someXMLTable")
You could process the XML as I have done here and then write it back to the Synapse dedicated SQL pool as an internal table:
val df2 = spark.sql("""
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
""")
// Write that dataframe back to the dedicated SQL pool
df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
This would ensure you are keeping things in parallel, no collect required. NB there are a couple of assumptions in there around uploading the gzipped files to a dedicated SQL pool and that the xpath_string does what you need which need to be checked and confirmed. The proposed pattern:

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Caching one big RDD or many small RDDs

I have a large RDD (R) which i cut it into 20 chunks (C_1, C_2, ..., C_20) such that:
If the time it takes to cache only depends on the size of the RDD (e.g. 10 second per MB) then caching the individual chunks is better.
However, i suspect there is some additional overhead i'm not aware of, like seek time in case of persisting to disk.
So, my questions are:
Are there any additional overheads when writing to memory?
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
EDIT: To give some more context, i'm currently running the application on my computer but at the end it will run on a cluster consisting of 10 nodes, each of which has 8 cores. However, since we only have access to the cluster for a small amount of time, i wanted to already experiment locally on my computer.
From my understanding, the application won't need a lot of shuffling as i can partition it rather nicely, such that each chunk runs on a single node.
However, i'm still thinking about the partitioning, so it is not yet 100% decided.
Spark performs the computations in memory. So there is no real extra overhead when you cache data to memory. Caching to memory essentially says, reuse these intermediate results. The only issue that you can run into is having too much data in memory and then it spills to disk. There you will incur disk read time costs. unpersist() will be needed for swapping things out of memory as you get finished with the various intermediate results, if you run into memory limitations.
When determining where to cache your data you need to look at the flow of your data. If you read in a file and then filter it 3 times and write out each one of those filters separately, without caching you will end up reading in that file 3 times.
val data = spark.read.parquet("file:///testdata/").limit(100)
data.select("col1").write.parquet("file:///test1/")
data.select("col2").write.parquet("file:///test2/")
data.select("col3").write.parquet("file:///test3/")
If you read in the file, cache it, then you filter 3 times and write out the results. You will read in the file once and then write out each result.
val data = spark.read.parquet("file:///testdata/").limit(100).cache()
data.select("col1").write.parquet("file:///test4/")
data.select("col2").write.parquet("file:///test5/")
data.select("col3").write.parquet("file:///test6/")
The general test that you can use as to what to cache is, "Am I performing multiple actions on the same RDD?" If yes, cache it. In your example if you break the large RDD into chunks and the large RDD isn't cached you will most likely be recalculating the large RDD every time that you perform an action on it. Then if you don't cache the chunks and you perform multiple actions on those then you will have to recalculate those chunks every time.
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
So to answer that, it all depends on what you are doing with each intermediate result. It looks like you will definitely want to properly repartition your large RDD according to the number of executors and then cache it. Then, if you perform more than one action on each one of the chunks that you create from the large RDD, you may want to cache those.

Why does Spark RDD partition has 2GB limit for HDFS?

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120, 10.215.149.47):
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!
So, my question is :
Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark
The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).
It is a critical issue which prevents use of spark with very large datasets.
Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.
The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.
the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.
most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.
when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.
a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.
The Spark 2.4.0 release removes this limit by replicating block data as a stream. See Spark-24926 for details.

How to distribute data to worker nodes

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.