Metrics for latency or performance of checkpointing - spark-structured-streaming

We are using Spark 2.3 and we are facing a weird issue. The streaming job works perfectly fine in one environment but in an another environment it does not.
We feel that in low performing environment the checkpointing of the state is not as performant as the other.
Is there any metrics we can look into that Spark Streaming emits which can prove our theory.

Related

PyFlink performance compared to Scala

How PyFlink performance is compared to Flink + Scala?
Big Picture.
The goal is to build Lambda architecture with Cold and Hot Tier.
Cold (Batch) Tier will be implemented with Apache Spark (PySpark).
But with Hot (Streaming) Tier there are different options: Spark Streaming or Flink.
Thus Apache Flink is pure streaming rather then Spark's micro-batches, I tend to choose Apache Flink.
But my only point of concern is performance of PyFlink. Will it have less latency that PySpark streaming? Is it slower then Scala written Flink code? In what cases it's slower?
Thank you in advance!
I had implemented something very similar , and from my experience these are a few things
Performance of the job is completely dependent on the type of code you are writing , if you are using some custom UDFs written in python to run while you extract then the performance is going to be slower than doing the same thing using Scala based code - this happens majorly because of the conversion of python objects to JVM and vice versa . But this will happen while you are using Pyspark .
Flink is true streaming process, the micro batches in spark are not so if your use case does need a true streaming service go ahead with Flink.
If you stick your service to the native functions given in PyFlink you will not observe any noticeable difference in performance .

OOM and data loss issues using checkpoints with spark streaming (pyspark) on Databricks

I have encountered many issues using checkpoints with spark streaming on databricks. The code below led to OOM errors on our clusters. Investigating the cluster's memory usage, we could see that the memory was slowly increasing over time, indicating a memory leak (~10 days before OOM, while a batch only lasts a couple of minutes). After deleting the checkpoint so that a new one could be created, the memory leak disappeared, indicating the error originated from the checkpoint. In a similar streaming job, we also had a problem where some data were never being processed (again, fixed after re-creating the checkpoint).
Disclaimer: I do not understand perfectly the in-depth behaviours of checkpoints as online documentation is evasive. Hence, I am not sure my configuration is good.
Below is a minimal example of the problem:
pyspark 3.0.1, python 3.7
The json conf of the clusters has the following element:
"spark_conf": {
"spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true",
"spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": "true"
}
code:
import pandas as pd
from pyspark.sql import functions as F
def for_each_batch(data, epoch_id):
pass
spark.readStream.format("delta").load("path/to/delta").filter(
F.col("TIME") > pd.Timestamp.utcnow() - pd.Timedelta(hours=1)
).writeStream.option(
"ignoreChanges", "true"
).option(
"checkpointLocation", "path/to/checkpoint"
).trigger(
processingTime="3 minutes"
).foreachBatch(
for_each_batch
).start()
PS: If the content of the function 'for_each_batch' or the filtering condition is changed, should I re-create the checkpoint?

having Spark process partitions concurrently, using a single dev/test machine

I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!

Mongo Spark connector write issues

We're observing significant increase in duration for writes, which eventually results in timeouts.
We're using replica set based MongoDB cluster.
It only happens during the high peak days of the week due to high volume.
We've tried deploying additional nodes, but it hasn't helped.
Attaching the screen shots.
We're using Mongo-connector 2.2.1 on databricks Apache Spark 2.2.1
Any recommendations to optimise write speed will be truly appreciated.
how many workers are there? please check DAG, executor metrics for the job. if all writes happening from a single executor, try repartitioning the dataset based on no. of executors.
MongoSpark.save(dataset.repartition(50), writeConf);

Is checkpointing necessary in spark streaming

I have noticed that spark streaming examples also have code for checkpointing. My question is how important is that checkpointing. If its there for fault tolerance, how often do faults happen in such streaming applications?
It all depends on your use case. For suppose if you are running a streaming job, which just reads data from Kafka and counts the number of records. What would you do if your application crashes after a year or so?
If you don't have a backup/checkpoint, you will have to recompute all the previous one years worth data so you can resume counting.
If you have a backup/checkpoint, you can simply read the checkpoint data and resume instantly.
Or if all you are just doing is having a streaming application which just Reads-Messages-From-Kafka >>> Tranform >>> Insert-to-a-Database, I need not worry about my application crashing. Even if it's crashed, i can simply resume my application without loss of data.
Note: Check-pointing is a process which stores the current state of a spark application.
Coming to the frequency of fault tolerance, you can almost never predict an outage. In companies,
There might be power outage
regular maintainance/upgrading of cluster
hope this helps.
There are two cases:
You are doing stateful operations, such as updateStateByKey, then
you must use checkpointing - every state is saved. Without setting
checkpoint directory, an exception will be thrown.
You are doing only windowed operations - then yes, you can disable checkpointing. However I strongly recommend setting checkpoint directory.
When driver is killed, then you'll loose all your data and progress information. Checkpointing helps you to recover applications from such situations.
Is a failure a normal situation? Of course! Imagine that you've got large cluster, many machines, many components in these machines. If one of these components fails, then your application will also fail. When connection to driver is lost - your application fails. With checkpoiting you can just run application again and it will recover state.