How to distribute data to worker nodes - scala

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks

How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.

Related

Flink reduce shuffling overhead and memory consumption

My Flink job is frequently going OOM with one or the other task manager. I have enough memory and storage for my job (2 JobManagers/16 TaskManagers - each with 15core and 63GB RAM). Sometimes the job runs 4 days before throwing OOM, sometimes job goes into OOM in 2 days. But the traffic is consistent compared to previous days.
I have a received a suggestion not to pass through objects in streaming pipeline and instead use primitives to reduce shuffling overhead and memory consumption.
The flink job I work is written in Java. Lets say below is my pipeline
Kafka source
deserialize (converted bytes to java object, the object contains String, int, long types)
FirstKeyedWindow (the above serialized java objects received here)
reduce
SecondKeyedWindow (the above reduced java objects received here)
reduce
Kafka sink (above java objects are serialized into bytes and are produced to kafka)
My question is what all should I consider to reduce the overhead and memory consumption?
Will replacing String with char array helps reduce overhead a bit? or
Should I only deal with bytes all through the pipeline?
If I serialize the object between the KeyedWindows, will it help reduce the overhead? but if I have to read the bytes back, then I need to deserialize, use as required and then serialize it. Wouldn't it create more overhead of serializing/deserializing?
Appreciate your suggestions. Headsup, I am talking about 10TB of data received per day.
Update 1:
The exception I see for OOM is as below:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'host/host:port'. This might indicate that the remote task manager was lost.
Answering to David Anderson comments below:
The Flink version used is v1.11 The state backend used is RocksDB, file system based. The job is running out of heap memory. Each message from Kafka source is sized up-to 300Bytes.
The reduce function does deduplication (removes duplicates within the same group), the second reduce function does aggregation (updates the count within the object).
Update 2:
After thorough exploration, I found that Flink uses Kyro default serializer which is inefficient. I understood custom_serializers can help reduce overhead if we define one instead of using Kyro default. I am now trying out google-protobuf to see if it performs any better.
And, I am also looking forward to increase taskmanager.network.memory.fraction which suits to my job parallelism. Yet to find out the right calculation to set the above configuration.
I am answering my own question here after what I tried has worked for me. I have found extra metrics in Grafana that is tied to my Flink job. Two of the metrics are GC time and GC count. I have seen some good spikes in GC (Garbage Collection) metrics. The reason for that could possibly be is, I have some new object creations going in the job pipeline. And considering the TBs of data I am dealing with and 20 Billion records per day, this object creations went haywire. I have optimized it to reuse the objects as much as I can and that reduced the amount of memory consumption.
And I have increased the taskmanager.network.memory to the required value which is set to 1GB default.
In my question above, I talked about custom serializers to reduce network overhead. I tried implementing protobuf serializer with Kyro and the protobuf generated classes are final. If I have to update the objects, I have to create new objects which will create spikes in GC metrics. So, avoided using it. May be I can further change the protobuf generate classes to suit my needs. Will consider that step if things are inconsistent.

How spark loads the data into memory

I have total confusion in the spark execution process. I have referred may articles and tutorials, nobody is discussing in detailed. I might be wrongly understanding spark. Please correct me.
I have my file of 40GB distributed across 4 nodes (10GB each node) of the 10 node cluster.
When I say spark.read.textFile("test.txt") in my code, will it load data(40GB) from all the 4 nodes into driver program (master node)?
Or this RDD will be loaded in all the 4 nodes separately. In that case, each node RDD should hold 10GB of physical data, is it?
And the whole RDD holds 10GB data and perform tasks for each partition i.e 128MB in spark 2.0. And finally shuffles the output to the driver program (master node)
And I read somewhere "numbers of cores in Cluster = no. of partitions" does it mean, the spark will move the partitions of one node to all 10 nodes for processing?
Spark doesn't have to read the whole file into memory at once. That 40GB file is split into many 128MB (or whatever your partition size is) partitions. Each of those partitions is a processing task. Each core will only work on one task at a time, with a preference to work on tasks where the data partition is stored on the same node. Only the 128MB partition that is being worked on needs to be read, the rest of the file is not read. Once the task completes (and produces some output) then the 128MB for the next task cab be read in and the data read in for the first task can be freed from memory. Because of this only the small amount of data being processed at a time needs to be loaded in to memory and not the entire file at once.
Also strictly speaking spark.read.textFile("test.txt") does nothing. It reads no data and does no processing. It creates an RDD but an RDD doesn't contain any data. And RDD is just an execution plan. spark.read.textFile("test.txt") declared that the file test.txt will be read an used as a source of data if and when the RDD is evaluated but doesn't do anything on its own.

ChannelOutboundBuffer allocating more than 2GB and OOMing my executors

I'm running spark on a hadoop cluster. During a (huge) shuffle phase, some executors are crashing due to Out Of Memory.
While investigating, I see that the OOM is triggered in a netty thread, due to a ChannelOutboundBuffer trying to grow to really huge values (I think it's 2GB and trying to double it).
I have looked at some source code for Spark and Netty and have the following questions:
is it allowed for a ChannelOutboundBuffer to grow that big? (ie over 2GB)
during a shuffle, what is supposed to be the size of the outbound buffers that are generated for netty ? According to this it seems that one buffer will contain all the data is needed by the executor requesting the shuffle (so no split even if this is a lot of data). Is this correct?
Even if all the data is put into a single netty buffer, I'm assuming that netty doesn't require to have a full copy of the data in it ChannelOutboundBuffer before sending it, as it could stream it. In that case why would it try to cache so much of the data in the ChannelOutboundBuffer before sending it?

Scala concurrency performance issues

I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?
Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.