I have a spark job that blows up our CDH cluster in one of two ways depending on how I partition things. The purpose of this job is to generate anywhere between 1 and 210,094,780,875 sets of four integers. The job is being submitted via spark-submit, master is set to YARN. Below is the code snip germane to this question:
// build rdd and let cluster build up the ngram list
val streamList_rdd = sc.parallelize(streamList).repartition(partitionCount)
val rdd_results = streamList_rdd.flatMap { x => x.toList }
println(rdd_results.count())
streamList is a list of generators that have been seeded with a floor/ceiling value (a tuple containing two Ints) that will generate sets of four integers bounded by floor/ceiling. The idea is to farm out the generation work across the cluster and that's where the front falls off. If partitionCount is too low (and thus the size of each partition is large), the workers blow up due to lack of memory. If partitionCount is high (and thus the size of each partition is manageable from a memory perspective), you start seeing errors like this one:
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
The memory issue I understand - what I'm not understanding is why there are issues with having a high partition count (~100k or more). Is there a way for me to make this work while preserving YARN's role in managing cluster resources?
Given the amount of data, and the presence of memory errors, I think you need to assign more cluster resources.
Increasing partitions improves parallelism but at the cost of consuming more resources on an already insufficiently sized cluster. I also suspect the repartition operation causes a shuffle which is an expensive operation at the best of times, very bad (catastrophic!) when you have enough data to out-of-memory. But without logs, that is conjecture.
The cause of heartbeat failure is likely either an executor is under such heavy load it fails to respond in time, or the process has crashed / been killed by YARN...
Related
I think I'm familiar with Spark Cluster Mode components which described here and one of its major disadvantage - shuffling. hence, IMHO a best practice would be to have one executor for each worker node, having more than that will unnecessarily increase the shuffling (between the executors).
What am I missing here, in which cases I would prefer to have more then one executor in each worker?
In which case would we like to have more than one executor in each worker?
Whenever possible: if jobs require less resources for one executor than what a worker node has, then spark should try to start other executors on the same worker to use all its available resources.
But that's the role of spark, not our call. When deploying spark apps, it is up to spark to decide how many executors (jvm process) are started on each worker node (machine). And it depends on the executor resources (core and memory) required by the spark jobs (the spark.executor.* configs). We often don't know what resources are available per worker. A cluster is usually shared by multiple apps/people. So we configure executor number and required resources and let spark decide to run them on the same worker or not.
Now, your question is maybe "should we have less executors with lots of cores and memory, or distribute it in several little executors?"
Having less but bigger executors reduce shuffling, clearly. But there are several reasons to also prefer distribution:
It is easier to start little executors
Having big executor means the cluster need all required resources on one worker
it is specially useful when using dynamic allocation, that will start and kill executor in function of runtime usage
Several little executors improve resilience: if our code is unstable and might sometime kill the executor, everything is lost and restarted.
I met a case where the code used in the executor wasn't thread safe. That's a bad thing, but it wasn't done on purpose. So until, or instead :\ , this is fixed, we distributed the code on many 1-core executors.
I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!
We Are running a 5 node flink cluster (1.6.3) over kubernetes, with a 5 partitions Kafka topic source.
5 jobs are reading from that topic (with different consumer group), each with parallelism = 5.
Each task manager is running with 10Gb of ram and the task manager heap size is limited to 2Gb.
The ingestion load is rather small (100-200 msgs per second) and an avg message size is ~4-8kb.
all jobs are running fine for a few hours. after a duration we suddenly see one or more jobs failing on:
ava.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:666)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
at sun.nio.ch.IOUtil.read(IOUtil.java:195)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:110)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:169)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:150)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:355)
at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)
flink restarts the job, but it keeps failing on that exception.
we've tried reducing the record poll as suggested here:
Kafka Consumers throwing java.lang.OutOfMemoryError: Direct buffer memory
We also tried increasing kafka heap size as suggested here:
Flink + Kafka, java.lang.OutOfMemoryError when parallelism > 1, although i'm failing to understand how failing to allocate memory in the flink process has anything to do with the jvm memory of the kafka broker process, and i don't see anything to indicate oom in the broker logs.
What might be the cause of that failure? what else should we check?
Thanks!
One thing that you may have underestimated, is that having a parallelism of 5, means there are 5+4+3+2+1=18 pairs of combinations. If we compare this to the linked thread, there were likely 3+2+1=6 combinations.
In the linked thread the problem was resolved by setting the max poll records to 250, hence my first thought would be to set it to 80 here (or even to 10) and see if that resolves the problem.
(I am not sure if the requirements are shaped this way, but the only noticable difference is the parallelism from 3 to 5 so that seems like a good point to compensate for).
I am working with several hundred GB dataset (around 2B rows). One of the operation is to reduce RDD or scala case objects(containing doubles, maps, sets) into single entity. Initially my operation was performing groupByKey but it was slow and was doing high GC. so I tried convert it to aggregateByKey and later even into reduceByKey in a hope to avoid high user memory allocations, shuffle activity and high gc issue that I was encountering with groupBy.
Application resources: 23GB exec mem + 4GB overhead. 20 instances and 6 cores each. Played with shuffle ration from 0.2 to 0.4
Available cluster resources 10 nodes, 600GB total for yarn, 32GB max container size
2016-05-02 22:38:53,595 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to hdn2.mycorp:45993
2016-05-02 22:38:53,832 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.storage.BlockManagerInfo: Removed broadcast_4_piece0 on 10.250.70.117:52328 in memory (size: 2.1 KB, free: 15.5 MB)
2016-05-02 22:39:03,704 WARN [New I/O worker #5] org.jboss.netty.channel.DefaultChannelPipeline: An exception was thrown by a user handler while handling an exception event ([id: 0xa8147f0c, /10.250.70.110:48056 => /10.250.70.117:38300] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.jboss.netty.buffer.CompositeChannelBuffer.toByteBuffer(CompositeChannelBuffer.java:649)
at org.jboss.netty.buffer.AbstractChannelBuffer.toByteBuffer(AbstractChannelBuffer.java:530)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:77)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:46)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:194)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:152)
at org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:335)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.rpc.akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-2] akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
67247,1 99%
About a Job
Read input data set having around 20 fields. 1B-2B. Create an output dataset aggregating over 10 unique fields. Which becomes query criteria basically. However out of those 10, 3 fields represent various combinations of them so that we don't have to query multiple records to get a set. out of those 3 fields lets sat a, b, and c each has 11, 2 and 2 possible values. so we could get max of 2^11 -1 * 2^2 - 1 * 2^2 -1 combination for given key.
//pseudo code where I use aggregateByKey
case class UserDataSet(salary: Double, members: Int, clicks: Map[Int, Long],
businesses: Map[Int, Set[Int]])...) //About 10 fileds with 5 of them are maps
def main() = {
create combinationRDD of type (String, Set[Set]) Rdd from input dataset which represent all combination
create a joinedRdd of type (String, UserDataSet) - where key at this point already a final key which contains 10 unique fields; value is a UserDataSet
//This is where things fails
val finalDataSet = joinedRdd.aggregateByKey(UserDataSet.getInstance())(processDataSeq, processDataMerge)
}
private def processDataMerge(map1: UserDataSet, map2: UserDataSet) = {
map1.clicks ++= map2.clicks (deep merge of course to avoid overwriting of map keys)
map1.salary += map2.salary
map1
}
So issue was indeed driver running out of memory and not executor. hence error was in driver logs. duh. however it wasn't very clear from logs. Driver ran out because 1) It was using default of -Xmx900m 2) Spark driver relies on akka libs and akka libs relies on stubborn JavaSerializer which uses Byte array instead of stream to serialize objects. As a temp solution I increased spark.driver.memory to 4096m in my case and I haven't seen memory error since. Thanks everyone for some insights into a problem space though.
To be able to help, you should post the code and also give an explanation of the input data.
Why the data ?
When aggregating by key, to achieve optimal parallelism and avoid issues, it's important to have an idea of what the key distribution looks like and also the cardinality.
Let me explain what they are and why they are important.
Let's say you're aggregating by country...there are about 250 countries on earth, so the cardinality of the key is around 250.
Cardinality is important because low cardinality may stifle your parallelism. For instance, if 90% of your data is for the US, and you have 250 nodes, one node will be processing 90% of the data.
That leads to the concept of distribution, that is, when you're grouping by key, how many values you have per key is your value distribution. For optimal parallelism, you ideally want roughly the same number of values for every key.
Now, if the cardinality of your data is very high, but the value distribution is not optimal, statistically things should even out.
For instance, let's say you have apache logs, where most users only visit a few pages, but some visit many (as it's the case with robots).
If the number of users is much greater than the number of your nodes, the users with lots of data get distributed around the nodes so parallelism is not that impacted.
Problems usually arise when you use keys with low cardinality.
If the distribution of the values is not good, it causes issues not unlikely an unbalanced washing machines.
Last but not least, it also depends greatly on what you're doing on the aggregateByKey. You can exhaust memory easily if you're leaking objects in either the map or reduce phase of processing.
I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.