Flink reduce shuffling overhead and memory consumption - apache-kafka

My Flink job is frequently going OOM with one or the other task manager. I have enough memory and storage for my job (2 JobManagers/16 TaskManagers - each with 15core and 63GB RAM). Sometimes the job runs 4 days before throwing OOM, sometimes job goes into OOM in 2 days. But the traffic is consistent compared to previous days.
I have a received a suggestion not to pass through objects in streaming pipeline and instead use primitives to reduce shuffling overhead and memory consumption.
The flink job I work is written in Java. Lets say below is my pipeline
Kafka source
deserialize (converted bytes to java object, the object contains String, int, long types)
FirstKeyedWindow (the above serialized java objects received here)
reduce
SecondKeyedWindow (the above reduced java objects received here)
reduce
Kafka sink (above java objects are serialized into bytes and are produced to kafka)
My question is what all should I consider to reduce the overhead and memory consumption?
Will replacing String with char array helps reduce overhead a bit? or
Should I only deal with bytes all through the pipeline?
If I serialize the object between the KeyedWindows, will it help reduce the overhead? but if I have to read the bytes back, then I need to deserialize, use as required and then serialize it. Wouldn't it create more overhead of serializing/deserializing?
Appreciate your suggestions. Headsup, I am talking about 10TB of data received per day.
Update 1:
The exception I see for OOM is as below:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'host/host:port'. This might indicate that the remote task manager was lost.
Answering to David Anderson comments below:
The Flink version used is v1.11 The state backend used is RocksDB, file system based. The job is running out of heap memory. Each message from Kafka source is sized up-to 300Bytes.
The reduce function does deduplication (removes duplicates within the same group), the second reduce function does aggregation (updates the count within the object).
Update 2:
After thorough exploration, I found that Flink uses Kyro default serializer which is inefficient. I understood custom_serializers can help reduce overhead if we define one instead of using Kyro default. I am now trying out google-protobuf to see if it performs any better.
And, I am also looking forward to increase taskmanager.network.memory.fraction which suits to my job parallelism. Yet to find out the right calculation to set the above configuration.

I am answering my own question here after what I tried has worked for me. I have found extra metrics in Grafana that is tied to my Flink job. Two of the metrics are GC time and GC count. I have seen some good spikes in GC (Garbage Collection) metrics. The reason for that could possibly be is, I have some new object creations going in the job pipeline. And considering the TBs of data I am dealing with and 20 Billion records per day, this object creations went haywire. I have optimized it to reuse the objects as much as I can and that reduced the amount of memory consumption.
And I have increased the taskmanager.network.memory to the required value which is set to 1GB default.
In my question above, I talked about custom serializers to reduce network overhead. I tried implementing protobuf serializer with Kyro and the protobuf generated classes are final. If I have to update the objects, I have to create new objects which will create spikes in GC metrics. So, avoided using it. May be I can further change the protobuf generate classes to suit my needs. Will consider that step if things are inconsistent.

Related

Size of the Kafka Streams In Memory Store

I am doing an aggregation on a Kafka topic stream and saving to an in memory state store. I would like to know the exact size of the accumulated in memory data, is this possible to find?
I looked through the jmx metrics on jconsole and Confluent Control Centre but nothing seemed relevant, is there anything I can use to find this out please?
You can get the number of stored key-value-pairs of an in-memory store, via KeyValueStore#approximateNumEntries() (for the default in-memory-store implementation, this number is actually accurate). If you can estimate the byte size per key-value pair, you can do the math.
However, estimating the byte size of an object is pretty hard to do in general in Java. The problem is, that Java does not provide any way to receive the actual size of an object. Also, objects can be nested making it even harder. Finally, besides the actual data, there is always some metadata overhead per object, and this overhead is JVM implementation dependent.

Why does Spark RDD partition has 2GB limit for HDFS?

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120, 10.215.149.47):
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!
So, my question is :
Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark
The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).
It is a critical issue which prevents use of spark with very large datasets.
Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.
The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.
the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.
most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.
when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.
a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.
The Spark 2.4.0 release removes this limit by replicating block data as a stream. See Spark-24926 for details.

How to distribute data to worker nodes

I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.

How to build a fault-tolerant app in Storm?

The short version of the question: how to build a fail-safe word count program (topology) in Twitter Storm that produces accurate results even when failure occurs? Is that even possible?
Long version: I am studying Twitter Storm and trying to understand how it should be used. I have followed the tutorial and find it a very simple concept. But the word count example outlined in the tutorial is not fault tolerant (because bolts save some data in memory). Saving the same data in back-end DB however leads to double counting if an event is re-submitted to the start of chain (which happens when some of the bolts fail).
Should I see Twitter Storm as real-time platform for producing partially accurate results and still depend on MapReduce to get the accurate ones?
It really depends on what kind of failure your trying to hege against. There are a few things that you can do:
Storm bolts are supposed to ack a tuple only after they have processed it. If you write your spouts and bolts and topology to use this, you can implement an "exactly one time" system which will guarantee accuracy.
Kafka can be a good way to put data into Storm because it uses disk persistance to keep messages around for a long time even after they are consumed. This means you can retrieve them if there's a failure by a consumer down the line.
In general though, it's difficult to guarantee that things are processed exactly once in any streaming system. This is a known problem, and it is a very difficult problem to solve efficiently.
Storm has the concept of transactional topologies. In practice, this means you will want to process items in batches, then commit to your database at the end of the batch, storing the transaction ID in the database alongside a count. This also has the practical benefit of reducing the load on your database with fewer inserts.
Batches are processed in parallel and may be replayed on failure, but are guaranteed to be committed in order. This is important because it makes it safe to write code that fetches the current count row, checks the transaction ID against the one in memory, and if the two differ (meaning it is an uncommitted batch), adding the new count to the existing one and committing that updated count.
See the following link for much more information and code examples:
https://github.com/nathanmarz/storm/wiki/Transactional-topologies