Writing using Pyspark and MongoDB Spark Connector gets stuck on Databricks

I'm using the MongoDB-Spark-Connector (2.12:3.0.1) to write data when running a Databricks (runtime 9.1 LTS ML Spark 3.1.2, Scala 2.12) job from notebook using PySpark. I'm able to run the job successfully when sampling smaller amount of rows, but when I run full scale (180 M rows) the job seems to get stuck after roughly 1.5 hours without throwing any error.
To clarify - The spark process keeps on staying alive but nothing seems to happen on both the Spark nodes side, as well as on the MongoDB side. On the MongoDB side I see a drop of writes to 0 writes while at the same time on Ganglia I see the nodes utilization drops to near 0. The job on Spark UI is at a running state with just few last running tasks still represented as running though nothing is progressing.
My initial code:
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite") \
.option("database", database) \
.option("collection", destination_collection).save()
After investigating a bit it seemed like the root cause could be some timeouts occurring on the MongoDB side which might be related to the writeConcern (w/wTimeoutMS) so I've added the following options to test if I get an exception on a very small allowed timeout but I didn't so I guess it's not being applied correctly.
My refactored code:
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite") \
.option("database", database) \
.option("collection", destination_collection) \
.option("writeConcern.w", 2) \
.option("writeConcern.wTimeoutMS", 1).save()
Have anyone else encountered this issue and have a proper solution?


Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
I also created a small subset of the file and all my code runs fine.
What is the problem?
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.

Spark-Submit execution time

I have developed a Scala Program on Spark which connected MySQL Database to pull the data about 250K records and process it. When I execute the application from the IDE itself (IntelliJ) it takes about 1 min to complete the job where as if I submit through Spark-Sumit from my terminal it takes 4 minutes.
Scala Code
val sparkSession = SparkSession.builder().
From Terminal
spark-submit --master local[*] .....
Any changes should I have to make or it is normal behaviour? Since local[*] I have it in code also Im supplying from terminal.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
it's from the reference of spark web page. link
you can adjust the number of 'K',
for example, "local[4] or local[8]" following your CPU performance.

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:

apache spark: local[K] master URL - job gets stuck

I am using apache spark 0.8.0 to process a large data file and perform some basic .map and .reduceByKey operations on the RDD.
Since I am using a single machine with multiple processors, I mention local[8] in the Master URL field while creating SparkContext
val sc = new SparkContext("local[8]", "Tower-Aggs", SPARK_HOME )
But whenever I mention multiple processors, the job gets stuck (pauses/halts) randomly. There is no definite place where it gets stuck, its just random. Sometimes it won't happen at all. I am not sure if it continues after that but it gets stuck for a long time after which I abort the job.
But when I just use local in place of local[8], the job runs seamlessly without getting stuck ever.
val sc = new SparkContext("local", "Tower-Aggs", SPARK_HOME )
I am not able to understand where is the problem.
I am using Scala 2.9.3 and sbt to build and run the application
I'm using spark 1.0.0 and met the same problem: if a function passed to a transformation or action wait/loop indefinitely, then spark won't wake it or terminate/retry it by default, in which case you can kill the task.
However, a recent feature (speculative task) allows spark to start replicated tasks if a few tasks take much longer than average running time of their peers. This can be enabled and configured in the following config properties:
spark.speculation false If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.
spark.speculation.interval 100 How often Spark will check for tasks to speculate, in milliseconds.
spark.speculation.quantile 0.75 Percentage of tasks which must be complete before speculation is enabled for a particular stage.
spark.speculation.multiplier 1.5 How many times slower a task is than the median to be considered for speculation.
(source: http://spark.apache.org/docs/latest/configuration.html)