Caching one big RDD or many small RDDs - scala

I have a large RDD (R) which i cut it into 20 chunks (C_1, C_2, ..., C_20) such that:
If the time it takes to cache only depends on the size of the RDD (e.g. 10 second per MB) then caching the individual chunks is better.
However, i suspect there is some additional overhead i'm not aware of, like seek time in case of persisting to disk.
So, my questions are:
Are there any additional overheads when writing to memory?
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
EDIT: To give some more context, i'm currently running the application on my computer but at the end it will run on a cluster consisting of 10 nodes, each of which has 8 cores. However, since we only have access to the cluster for a small amount of time, i wanted to already experiment locally on my computer.
From my understanding, the application won't need a lot of shuffling as i can partition it rather nicely, such that each chunk runs on a single node.
However, i'm still thinking about the partitioning, so it is not yet 100% decided.

Spark performs the computations in memory. So there is no real extra overhead when you cache data to memory. Caching to memory essentially says, reuse these intermediate results. The only issue that you can run into is having too much data in memory and then it spills to disk. There you will incur disk read time costs. unpersist() will be needed for swapping things out of memory as you get finished with the various intermediate results, if you run into memory limitations.
When determining where to cache your data you need to look at the flow of your data. If you read in a file and then filter it 3 times and write out each one of those filters separately, without caching you will end up reading in that file 3 times.
val data = spark.read.parquet("file:///testdata/").limit(100)
data.select("col1").write.parquet("file:///test1/")
data.select("col2").write.parquet("file:///test2/")
data.select("col3").write.parquet("file:///test3/")
If you read in the file, cache it, then you filter 3 times and write out the results. You will read in the file once and then write out each result.
val data = spark.read.parquet("file:///testdata/").limit(100).cache()
data.select("col1").write.parquet("file:///test4/")
data.select("col2").write.parquet("file:///test5/")
data.select("col3").write.parquet("file:///test6/")
The general test that you can use as to what to cache is, "Am I performing multiple actions on the same RDD?" If yes, cache it. In your example if you break the large RDD into chunks and the large RDD isn't cached you will most likely be recalculating the large RDD every time that you perform an action on it. Then if you don't cache the chunks and you perform multiple actions on those then you will have to recalculate those chunks every time.
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
So to answer that, it all depends on what you are doing with each intermediate result. It looks like you will definitely want to properly repartition your large RDD according to the number of executors and then cache it. Then, if you perform more than one action on each one of the chunks that you create from the large RDD, you may want to cache those.

Related

Spark/Synapse Optimal handling huge gzipped xml file (+600mb compressed size)

For a task we need to process huge transactional xml files which are gz(ipped). Each line in the uncompressed file can be interpreted as its own xml record.
When working with small files like 100 MiB this works fine. The moment à collect() is performed on the huge input file it tends to fail OOM and the jvm crashes.
As this is a compressed (gz) file it can not be processed in parallel (AFAIK).
I was thinking about
using the toLocalIterator() to split it first up into smaller packets of 200K xml entries which are distributed to the other nodes for their cost om processing. Apparently the toLocalIterator() does also the collect() first (to test)
Other option is to use the some kind of index value and filter on it ("index > 5000") and set the limit(5000) to simulate paging through the 2 Million or more entries.
But I have no clue to what I should pay attention to parralize. Any tips are welcome.
Settings to pay attention and how to apply them in Azure Synapse etc.
how to push the read xml over the nodes to be processed in their executor/tasks.
could streaming a single file be an option?
any tips are welcome
Currently my code is done in scala due the fact the java libraries are easily accessible to convert the xml to json and extract the values I need.
Many thanks in advance (also for reading this)
TL;DR suggestion:
Step 1. Increase driver memory and test
Step 2. Increase executor memory and test if first step fails
Slightly longer version:
The fact that it gives OOM on collect() operation doesn't indicate whether it is OOM on spark.read operation or df.collect()
Spark scheduler will run the DAG when it encounters an Action but not when it is a Transformation.
So if collect is your first action, it is at that point it actually runs the DAG and the OOM may even be on read but manifesting as OOM on collect
Spark UI will provide insights on where the OOM happens
You are right that uncompressing gzip wont be parallelised. On read operation, it will use a single executor and even a single core. So I would increase executor memory until there is sufficient memory to gzip the whole file into memory - not just to the exact file size, leave the usual 400MB / 0.7% buffer.
If the error is indeed happening on the collect() operation, then you need to sufficiently increase driver memory.
Your app will not be parallelised at read. Your app will not be parallelised at collect()
You app can be parallelised during the transformations in between them and you can force the parallelisation to the extent you want to tune it by repartitioning your dataframe / dataset / rdd further.
Finally, I would consider again whether you do need the collect or whether you can store the output as a number of partitioned files?
I think the zipped file is always going to be a bottleneck so one alternative is to unzip it and see if that helps. I would also consider loading up the xml to a table in Synapse (which can deal with gzipped files). This would have the effect of unzipping it, you could then pass it into a Synapse Notebook with the synapsesql method, eg in Scala:
// Get the table with the XML column from the database and expose as temp view
val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
df.createOrReplaceTempView("someXMLTable")
You could process the XML as I have done here and then write it back to the Synapse dedicated SQL pool as an internal table:
val df2 = spark.sql("""
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
""")
// Write that dataframe back to the dedicated SQL pool
df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
This would ensure you are keeping things in parallel, no collect required. NB there are a couple of assumptions in there around uploading the gzipped files to a dedicated SQL pool and that the xpath_string does what you need which need to be checked and confirmed. The proposed pattern:

Fixed size window for Apache Beam

How to define window of fixed size (fixed number of items) in Apache Beam?
I know that we have
(FixedWindows.of(Duration.standardMinutes(10))
but I do not care about time-only about number of items.
More details:
I am writing significant amount of data (53 gigabytes) to S3. Currently my proces uses
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.getKey())
(grouping by key). This causes serve performance bottleneck, because of skewed key distribution. My total data size is 53Gb, but size of data for one key is 37Gb. This single key takes an hour to write (writing occurs on single executor, single thread, rest of cluster waits idle).
I do not need any special grouping. Ideally I want uniform distribution of data, so writing will happen concurrently and finish as soon as possible.
Guaranteeing exactly equal sized grouping is fairly hard, but you can get pretty close by using hashes of your data modulo some constant as the keys. For example:
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.hashCode() % 530)
This will give roughly equal 100MB partitions.
Additionally, if you are using the DataflowRunner, you don't need to specify keys at all; the system will automatically group up the data, and dynamically rebalance the load to avoid stragglers. For this, use FileIO.write() instead of FileIO.writeDynamic().

Getting the count of records in a data frame quickly

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.
It's going to take so much time anyway. At least the first time.
One way is to cache the dataframe, so you will be able to more with it, other than count.
E.g
df.cache()
df.count()
Subsequent operations don't take much time.
The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.
It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.
If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.
Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).
The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.
approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.
The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.

How to distribute data to worker nodes

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.