I am working with several hundred GB dataset (around 2B rows). One of the operation is to reduce RDD or scala case objects(containing doubles, maps, sets) into single entity. Initially my operation was performing groupByKey but it was slow and was doing high GC. so I tried convert it to aggregateByKey and later even into reduceByKey in a hope to avoid high user memory allocations, shuffle activity and high gc issue that I was encountering with groupBy.
Application resources: 23GB exec mem + 4GB overhead. 20 instances and 6 cores each. Played with shuffle ration from 0.2 to 0.4
Available cluster resources 10 nodes, 600GB total for yarn, 32GB max container size
2016-05-02 22:38:53,595 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to hdn2.mycorp:45993
2016-05-02 22:38:53,832 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.storage.BlockManagerInfo: Removed broadcast_4_piece0 on 10.250.70.117:52328 in memory (size: 2.1 KB, free: 15.5 MB)
2016-05-02 22:39:03,704 WARN [New I/O worker #5] org.jboss.netty.channel.DefaultChannelPipeline: An exception was thrown by a user handler while handling an exception event ([id: 0xa8147f0c, /10.250.70.110:48056 => /10.250.70.117:38300] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.jboss.netty.buffer.CompositeChannelBuffer.toByteBuffer(CompositeChannelBuffer.java:649)
at org.jboss.netty.buffer.AbstractChannelBuffer.toByteBuffer(AbstractChannelBuffer.java:530)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:77)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:46)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:194)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:152)
at org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:335)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.rpc.akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-2] akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
67247,1 99%
About a Job
Read input data set having around 20 fields. 1B-2B. Create an output dataset aggregating over 10 unique fields. Which becomes query criteria basically. However out of those 10, 3 fields represent various combinations of them so that we don't have to query multiple records to get a set. out of those 3 fields lets sat a, b, and c each has 11, 2 and 2 possible values. so we could get max of 2^11 -1 * 2^2 - 1 * 2^2 -1 combination for given key.
//pseudo code where I use aggregateByKey
case class UserDataSet(salary: Double, members: Int, clicks: Map[Int, Long],
businesses: Map[Int, Set[Int]])...) //About 10 fileds with 5 of them are maps
def main() = {
create combinationRDD of type (String, Set[Set]) Rdd from input dataset which represent all combination
create a joinedRdd of type (String, UserDataSet) - where key at this point already a final key which contains 10 unique fields; value is a UserDataSet
//This is where things fails
val finalDataSet = joinedRdd.aggregateByKey(UserDataSet.getInstance())(processDataSeq, processDataMerge)
}
private def processDataMerge(map1: UserDataSet, map2: UserDataSet) = {
map1.clicks ++= map2.clicks (deep merge of course to avoid overwriting of map keys)
map1.salary += map2.salary
map1
}
So issue was indeed driver running out of memory and not executor. hence error was in driver logs. duh. however it wasn't very clear from logs. Driver ran out because 1) It was using default of -Xmx900m 2) Spark driver relies on akka libs and akka libs relies on stubborn JavaSerializer which uses Byte array instead of stream to serialize objects. As a temp solution I increased spark.driver.memory to 4096m in my case and I haven't seen memory error since. Thanks everyone for some insights into a problem space though.
To be able to help, you should post the code and also give an explanation of the input data.
Why the data ?
When aggregating by key, to achieve optimal parallelism and avoid issues, it's important to have an idea of what the key distribution looks like and also the cardinality.
Let me explain what they are and why they are important.
Let's say you're aggregating by country...there are about 250 countries on earth, so the cardinality of the key is around 250.
Cardinality is important because low cardinality may stifle your parallelism. For instance, if 90% of your data is for the US, and you have 250 nodes, one node will be processing 90% of the data.
That leads to the concept of distribution, that is, when you're grouping by key, how many values you have per key is your value distribution. For optimal parallelism, you ideally want roughly the same number of values for every key.
Now, if the cardinality of your data is very high, but the value distribution is not optimal, statistically things should even out.
For instance, let's say you have apache logs, where most users only visit a few pages, but some visit many (as it's the case with robots).
If the number of users is much greater than the number of your nodes, the users with lots of data get distributed around the nodes so parallelism is not that impacted.
Problems usually arise when you use keys with low cardinality.
If the distribution of the values is not good, it causes issues not unlikely an unbalanced washing machines.
Last but not least, it also depends greatly on what you're doing on the aggregateByKey. You can exhaust memory easily if you're leaking objects in either the map or reduce phase of processing.
Related
So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory
I am processing a stream of 100 Mb/s average load. I have six executors with each having 12 Gb of memory allocated. However, due to data load, I am getting Out of Memory errors (Error 52) in the spark executors in few minutes. It seems even though Spark dataframe is conceptually unbounded it is bounded by total executor memory?
My idea here was to save dataframe/stream as an in parquet in about every five minutes. However, it seems spark won't have a direct mechanism to purge the dataframe after that?
val out = df.
writeStream.
format("parquet").
option("path", "/applications/data/parquet/customer").
option("checkpointLocation", "/checkpoints/customer/checkpoint").
trigger(Trigger.ProcessingTime(300.seconds)).
outputMode(OutputMode.Append).
start
It seems that there is no direct way to do this. As this conflicts with the general Spark model that operations be rerunnable in case of failure.
However I would share the same sentiment of the comment at 08/Feb/18 13:21 on this issue.
I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy function. It looks something like below (made up names):
df.write.partitionBy("account", "markers")
.mode(SaveMode.Overwrite)
.parquet(s"$location$org/$corrId/")
This particular dataframe has around 30gb of data, 2000 accounts and 30 markers. The accounts and markers are close to evenly distributed. I have tried using 5 core nodes and 1 master node driver of amazon's r4.8xlarge (220+ gb of memory) with the default maximize resource allocation setting (which 2x cores for executors and around 165gb of memory). I have also explicitly set the number of cores, executors to different numbers, but had the same issues. When looking at Ganglia, I don't see any excessive memory consumption.
So, it seems very likely that the root cause is the 2gb ByteArrayBuffer issue that can happen on shuffles. I then tried repartitioning the dataframe with various numbers, such as 100, 500, 1000, 3000, 5000, and 10000 with no luck. The job occasionally logs a heap space error, but most of the time gives a node lost error. When looking at the individual node logs, it just seems to suddenly fail with no indication of the problem (which isn't surprising with some oom exceptions).
For dataframe writes, is there a trick to partitionBy's to either get passed the memory heap space error?
I have a spark job that blows up our CDH cluster in one of two ways depending on how I partition things. The purpose of this job is to generate anywhere between 1 and 210,094,780,875 sets of four integers. The job is being submitted via spark-submit, master is set to YARN. Below is the code snip germane to this question:
// build rdd and let cluster build up the ngram list
val streamList_rdd = sc.parallelize(streamList).repartition(partitionCount)
val rdd_results = streamList_rdd.flatMap { x => x.toList }
println(rdd_results.count())
streamList is a list of generators that have been seeded with a floor/ceiling value (a tuple containing two Ints) that will generate sets of four integers bounded by floor/ceiling. The idea is to farm out the generation work across the cluster and that's where the front falls off. If partitionCount is too low (and thus the size of each partition is large), the workers blow up due to lack of memory. If partitionCount is high (and thus the size of each partition is manageable from a memory perspective), you start seeing errors like this one:
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
The memory issue I understand - what I'm not understanding is why there are issues with having a high partition count (~100k or more). Is there a way for me to make this work while preserving YARN's role in managing cluster resources?
Given the amount of data, and the presence of memory errors, I think you need to assign more cluster resources.
Increasing partitions improves parallelism but at the cost of consuming more resources on an already insufficiently sized cluster. I also suspect the repartition operation causes a shuffle which is an expensive operation at the best of times, very bad (catastrophic!) when you have enough data to out-of-memory. But without logs, that is conjecture.
The cause of heartbeat failure is likely either an executor is under such heavy load it fails to respond in time, or the process has crashed / been killed by YARN...
I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120, 10.215.149.47):
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!
So, my question is :
Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark
The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).
It is a critical issue which prevents use of spark with very large datasets.
Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.
The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.
the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.
most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.
when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.
a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.
The Spark 2.4.0 release removes this limit by replicating block data as a stream. See Spark-24926 for details.