I'm using pyspark.
Previously i had a similar problem, I collected a lot of data in driver program and didn't realize that lineage of dataframe kept source of data in driver until action was completed. A solution of this problem was to checkpoint dataframe (so lineage became lazy sourced).
But recently I met the same problem, and I don't collect data in driver, but write result to db. Is there some other heavy objects that can cause this error?
I checked all logs, and didn't find any warnings or suspicious messages. Last started action was checkpoint. Of course I can increase driver memory limit, but reason of such consumption is still unknown and I'm afraid I have some bottleneck.
Error message:
Current usage: 4.5 GB of 4.5 GB physical memory used; 6.4 GB of 9.4 GB virtual memory used. Killing container.
And it sends SIGKILL to executors
UPDT:
After some investigations I suppose that problem can be in BroadcastHashJoin. We had a lot of small tables and broadcast join is used to join them, but to broadcast table spark need to materialize it in driver so I disable it by setting spark.sql.autoBroadcastJoinThreshold=-1. And looks like memory consumption on driver decreased, and now AM not raising previous error. But on writing result table in DB(postgres) executors stuck on few last task of last stage. In thread dump of executor i see this stuck task:
Executor task launch worker for task 635633 RUNNABLE Lock(java.util.concurrent.ThreadPoolExecutor$Worker#1287601873}), Monitor(java.io.BufferedOutputStream#1935023563}), Monitor(org.postgresql.core.v3.QueryExecutorImpl#625626056})
It not falling, just waiting, can this be lock from postgres or something ?
Related
I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?
I've faced with situation when running cluster on AWS EMR, that one stage remained 'running' when execution plan continue to progress. Look at screen from Spark UI (job 4 has running tasks, however job 7 in progress). My question is how to debug such situation, if there are any tips that I can find at DAG?
My thought that it could be some memory issue because data is tough, and there are a lot of spills to disk. However I am wondering why spark stays idle for hour. Is it related to driver memory issues?
UPD1:
Based on Ravi requests:
(1) check the time since they are running and the GC time also. If GC
time is >20% of the execution time it means that u r throttled by the
memory.
No, it is not an issue.
(2) check number of active tasks in the same page.
That's really weird, there are executors with more active tasks than cores capacity (3x time more for some of executors), however I do not see any executors failures.
(3) see if all executors are equally spending time in running the job
Not an issue
(4) what u showed above is the job what abt stages etc? are they also
paused for ever?
I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy function. It looks something like below (made up names):
df.write.partitionBy("account", "markers")
.mode(SaveMode.Overwrite)
.parquet(s"$location$org/$corrId/")
This particular dataframe has around 30gb of data, 2000 accounts and 30 markers. The accounts and markers are close to evenly distributed. I have tried using 5 core nodes and 1 master node driver of amazon's r4.8xlarge (220+ gb of memory) with the default maximize resource allocation setting (which 2x cores for executors and around 165gb of memory). I have also explicitly set the number of cores, executors to different numbers, but had the same issues. When looking at Ganglia, I don't see any excessive memory consumption.
So, it seems very likely that the root cause is the 2gb ByteArrayBuffer issue that can happen on shuffles. I then tried repartitioning the dataframe with various numbers, such as 100, 500, 1000, 3000, 5000, and 10000 with no luck. The job occasionally logs a heap space error, but most of the time gives a node lost error. When looking at the individual node logs, it just seems to suddenly fail with no indication of the problem (which isn't surprising with some oom exceptions).
For dataframe writes, is there a trick to partitionBy's to either get passed the memory heap space error?
I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.
I have a single test node with 8 GB ram on which I am loading barely 10 MB of data(from csv files) into Cassandra(on the same node itself). Im trying to process this data using spark(running on the same node).
Please note that for SPARK_MEM, Im allocating 1 GB of RAM and SPARK_WORKER_MEMORY I'm allocating the same. The allocation of any extra amount of memory results in spark throwing a "Check if all workers are registered and have sufficient memory error", which is more often than not indicative of Spark trying to look for extra memory(as per SPARK_MEM and SPARK_WORKER_MEMORY properties) and coming up short.
When I try to load and process all data in the Cassandra table using spark context object, I'm getting an error during processing. So, I'm trying to use a looping mechanism to read chunks of data at a time from one table, process them and put them in another table.
My source code has the following structure
var data=sc.cassandraTable("keyspacename","tablename").where("value=?",1)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
for(i<-2 to 50000){
data=sc.cassandraTable("keyspacename","tablename").where("value=?",i)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
}
Now, this works for a while, for around 200 loops, and then this throws an error: java.lang.OutOfMemoryError: unable to create a new native thread.
I've got two questions:
Is this the right way to deal with data?
How can processing just 10 MB of data do this to a cluster?
You are running a query inside the for loop. If the 'value' column is not a key/indexed column, Spark will load the table into memory and then filter on the value. This will certainly cause an OOM.