Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts - pyspark

I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4.
Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning of May (3rd of May to be precise): we started experiencing frequent socket timeout exceptions which terminate our pipelines. It doesn't seem to be related to the load on the cluster, as it has not significantly increased. It also happens quite randomly throughout the day and I have checked possibly related code changes but I could not find any. Moreover, this only seems to occur on the cluster with 4 worker nodes, while the pipelines on the cluster with 3 nodes are very similar and experience no timeouts at all. I have already recreated the cluster twice, but the issue remains and it affects all pipelines running on this dataproc cluster. Cluster with 3 nodes is a n1-standard-4 machine type, while the troublesome cluster with 4 nodes is a n1-standard-8 machine type, other then that their configuration is identical.
Example output of a pipeline job execution when the problem occurs and the job terminates:
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:645)
16/05/23 14:45:45 ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1464014740000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
File "/tmp/b85990ba-e152-4d5b-8977-fb38915e78c4/transformfwpythonfiles.zip/transformationsframework/StreamManager.py", line 138, in process_kafka_rdd
.foreach(lambda *args: None)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 747, in foreach
self.mapPartitions(processPartition).count() # Force evaluation
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold
vals = self.mapPartitions(func).collect()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 772, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in _load_from_socket
for item in serializer.load_stream(rf):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in load_stream
yield self._read_with_length(stream)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 156, in _read_with_length
length = read_int(stream)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 543, in read_int
length = stream.read(4)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
timeout: timed out
The start of the stacktrace is in our StreamManager module, method process_kafka_rdd: it processes a single discrete RDD within the direct stream of Kafka messages. Our integration of Kafka with Spark streaming is based upon the "direct approach" described on http://spark.apache.org/docs/latest/streaming-kafka-integration.html

My experience with Spark and socket errors is that some executor has suddenly died. Some other executor communicating with it at the time raises the socket error.
In my experience, the cause of unexpected executor death is hitting some resource paucity, usually a shortage of memory.
(It's important to tune the amount of memory executors can use. The defaults are typically way too low. But I suspect you are already aware of this.)
I assume Spark is running on top of yarn? Unfortunately, in my experience Spark does a poor job reporting the cause of the problem when it occurs down in the guts of yarn. Unfortunately one has to dig into the yarn logs to figure out what actually caused the sudden executor death. The executors each run in a yarn "container;" somewhere in the yarn logs there should be a record of a container falling over.

Related

Flink job cant use savepoint in a batch job

Let me start in a generic fashion to see if I somehow missed some concepts: I have a streaming flink job from which I created a savepoint. Simplified version of this job looks like this
Pseduo-Code:
val flink = StreamExecutionEnvironment.getExecutionEnvironment
val stream = if (batchMode) {
flink.readFile(path)
}
else {
flink.addKafkaSource(topicName)
}
stream.keyBy(key)
stream.process(new ProcessorWithKeyedState())
CassandraSink.addSink(stream)
This works fine as long as I run the job without a savepoint. If I start the job from a savepoint I get an exception which looks like this
Caused by: java.lang.UnsupportedOperationException: Checkpoints are not supported in a single key state backend
at org.apache.flink.streaming.api.operators.sorted.state.NonCheckpointingStorageAccess.resolveCheckpoint(NonCheckpointingStorageAccess.java:43)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1623)
at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249)
I could work around this if I set the option:
execution.batch-state-backend.enabled: false
but this eventually results in another error:
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:673)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:526)
Of course I tried to set the config key taskmanager.memory.managed.consumer-weights (used DATAPROC:70,PYTHON:30) but this doesn't seems to have any effects.
So I wonder if I have a conceptual error and can't reuse savepoints from a streaming job in a batch job or if I simply have a problem in my configuration. Any hints?
After a hint from the flink user-group it turned out that it is NOT possible to reuse a savepoint from the streaming job (https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/execution_mode/#state-backends--state). So instead of running the job as in batch-mode (flink.setRuntimeMode(RuntimeExecutionMode.BATCH)) I just run it in the default execution mode (STREAMING). This has the minor downside that it will run forever and have to be stopped by someone once all data was processed.

Kafka Streams shutdown after IllegalStateException: No current assignment for partition

I have a Kafka Streams application that launches and runs successfully. We have 4 instances of the application running. Occasionally one of our instance of the application is legitimately killed which causes several rounds of rebalancing until the old node is replaced.
Sometimes during the rebalance, one ore more previously healthy nodes fail. The logs are indicating that the Streams application transitions into a PENDING_SHUTDOWN state directly after receiving the following exception:
java.lang.IllegalStateException: No current assignment for partition public.chat.message-28
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:256)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.resetFailed(SubscriptionState.java:418)
at org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:621)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:571)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:389)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
at org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:275)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1849)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1827)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:259)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:133)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:79)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
Prior to this error we often seem to also recieve some informational logs reporting a disconnect exception:
Error sending fetch request (sessionId=568252460, epoch=7) to node 4: org.apache.kafka.common.errors.DisconnectException
I have a feeling the two are related but I'm unable to reason why at present.
Is anyone able to give me some hints as to what may be causing this issue and any possible solutions?
Additional Info:
Kafka 2.2.1
32 partitions spread evenly across the 4 worker nodes
StreamsConfig settings:
kafkaStreamProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
kafkaStreamProps.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
kafkaStreamProps.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
kafkaStreamProps.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 120000);
kafkaStreamProps.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
This looks like it could be related to https://issues.apache.org/jira/browse/KAFKA-9073, which has been fixed in Kafka Streams 2.3.2.
If you can't wait for that release, you could try creating a private build using the changeset from this pull request: https://github.com/apache/kafka/pull/7630/files

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location.
Sometimes application fails with Exception:
_spark_metadata/0 doesn't exist while compacting batch 9
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
We are not able to resolve this issue.
Only solution I found is to delete checkpoint location files which will make the job read the topic/data from beginning as soon as we run the application again. However, this is not a feasible solution for production application.
Does anyone has a solution for this error without deleting checkpoint such that I can continue from where the last run was failed?
Sample code of application:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
[...] // do some processing
dfProcessed.writeStream
.format("csv")
.option("format", "append")
.option("path",hdfsPath)
.option("checkpointlocation","")
.outputmode(append)
.start
The error message
_spark_metadata/n.compact doesn't exist when compacting batch n+10
can show up when you
process some data into a FileSink with checkpoint enabled, then
stop your streaming job, then
change the output directory of the FileSink while keeping the same checkpointLocation, then
restart the streaming job
Quick Solution (not for production)
Just delete the files in checkpointLocation and restart the application.
Stable Solution
As you do not want to delete your checkpoint files, you could simply copy the missing spark metadata files from the old File Sink output path to the new output Path. See below to understand what are the "missing spark metadata files".
Background
To understand, why this IllegalStateException is being thrown, we need to understand what is happening behind the scene in the provided file output path. Let outPathBefore be the name of this path. When your streaming job is running and processing data the job creates a folder outPathBefore/_spark_metadata. In that folder you will find a file named after micro-batch Identifier containing the list of files (partitioned files) the data has been written to, e.g:
/home/mike/outPathBefore/_spark_metadata$ ls
0 1 2 3 4 5 6 7
In this case we have details for 8 micro batches. The content of one of the files looks like
/home/mike/outPathBefore/_spark_metadata$ cat 0
v1
{"path":"file:///tmp/file/before/part-00000-99bdc705-70a2-410f-92ff-7ca9c369c58b-c000.csv","size":2287,"isDir":false,"modificationTime":1616075186000,"blockReplication":1,"blockSize":33554432,"action":"add"}
By default, on each tenth micro batch, these files are getting compacted, meaning the contents of the files 0, 1, 2, ..., 9 will be stored in a compacted file called 9.compact.
This procedure continuous for the subsequent ten batches, i.e. in the micro batch 19 the job aggregates the last 10 files which are 9.compact, 10, 11, 12, ..., 19.
Now, imagine you had the streaming job running until micro batch 15 which means the job has created the following files:
/home/mike/outPathBefore/_spark_metadata/0
/home/mike/outPathBefore/_spark_metadata/1
...
/home/mike/outPathBefore/_spark_metadata/8
/home/mike/outPathBefore/_spark_metadata/9.compact
/home/mike/outPathBefore/_spark_metadata/10
...
/home/mike/outPathBefore/_spark_metadata/15
After the fifteenth micro batch you stopped the streaming job and changed the output path of the File Sink to, say, outPathAfter. As you keep the same checkpointLocation the streaming job will continue with micro-batch 16. However, it now creates the metadata files in the new out path:
/home/mike/outPathAfter/_spark_metadata/16
/home/mike/outPathAfter/_spark_metadata/17
...
Now, and this is where the Exception is thrown: When reaching micro batch 19, the job tries to compact the tenth latest files from spark metadata folder. However, it can only find the files 16, 17, 18 but it does not find 9.compact, 10 etc. Hence the error message says:
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
Documentation
The Structured Streaming Programming Guide explains on Recovery Semantics after Changes in a Streaming Query:
"Changes to output directory of a file sink are not allowed: sdf.writeStream.format("parquet").option("path", "/somePath") to sdf.writeStream.format("parquet").option("path", "/anotherPath")"
Databricks has also written some details in the article Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
Error caused by checkpointLocation because checkpointLocation stores old or deleted data information. You just need to delete the folder containing checkpointLocation.
Explore more :https://kb.databricks.com/streaming/file-sink-streaming.html
Example :
df.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "D:/path/dir/checkpointLocation")
.option("path", "D:/path/dir/output")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
You need to do delete directory checkpointLocation.
This article introduces the mechanism and gives a good way to recover from a deleted _spark_metadata folder in Spark Structured Streaming:
https://dev.to/kevinwallimann/how-to-recover-from-a-deleted-sparkmetadata-folder-546j
"Create dummy log files:
If the metadata log files are irrecoverable, we could create dummy log files for the missing micro-batches.
In our example, this could be done like this:
for i in {0..1}; do echo v1 > "/tmp/destination/_spark_metadata/$i"; done
This will create the files
/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
Now, the query can be restarted and should finish without errors."
As my previous output folder was not recoverable anymore. I tried this dummy solution, which could work to get rid of the IllegalStateException: _spark_metadata/... doesn't exist exception.

io.confluent.ksql.exception.KafkaTopicExistsException: when launching ksql-server-start ksql-server.properties

I'm working with ksql from quite some time. Kafka cluster if of 3 nodes. I've been using udf as well and all looks good until I stop the servers and start them again.
On server start I'm seeing the following in the logs:
[2019-04-03 11:29:54,381] ERROR Exception encountered running command: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).. Retrying in 5000 ms (io.confluent.ksql.util.RetryUtil:80)
[2019-04-03 11:29:54,381] ERROR Stack trace: io.confluent.ksql.exception.KafkaTopicExistsException: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:51)
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:35)
at io.confluent.ksql.services.KafkaTopicClientImpl.validateTopicProperties(KafkaTopicClientImpl.java:292)
at io.confluent.ksql.services.KafkaTopicClientImpl.createTopic(KafkaTopicClientImpl.java:76)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.createSinkTopic(KsqlStructuredDataOutputNode.java:244)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.buildStream(KsqlStructuredDataOutputNode.java:146)
at io.confluent.ksql.physical.PhysicalPlanBuilder.buildPhysicalPlan(PhysicalPlanBuilder.java:106)
at io.confluent.ksql.QueryEngine.buildPhysicalPlan(QueryEngine.java:113)
at io.confluent.ksql.KsqlEngine$EngineExecutor.execute(KsqlEngine.java:625)
at io.confluent.ksql.KsqlEngine$EngineExecutor.access$800(KsqlEngine.java:577)
at io.confluent.ksql.KsqlEngine.execute(KsqlEngine.java:247)
at io.confluent.ksql.rest.server.computation.StatementExecutor.startQuery(StatementExecutor.java:277)
at io.confluent.ksql.rest.server.computation.StatementExecutor.executeStatement(StatementExecutor.java:191)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleStatementWithTerminatedQueries(StatementExecutor.java:167)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleRestore(StatementExecutor.java:101)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$null$0(CommandRunner.java:139)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:63)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:36)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$processPriorCommands$1(CommandRunner.java:135)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.confluent.ksql.rest.server.computation.CommandRunner.processPriorCommands(CommandRunner.java:134)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:414)
at io.confluent.ksql.rest.server.KsqlServerMain.createExecutable(KsqlServerMain.java:80)
at io.confluent.ksql.rest.server.KsqlServerMain.main(KsqlServerMain.java:42)
(io.confluent.ksql.util.RetryUtil:84)
Though I've stopped/terminated all the queries, the log prints all the commands I've executed from the beginning for my testing till data, including create, select, drop. I've pulled out the .jar(UDF) from /ext folder and the server started, though the log prints udf function(i'm using) not available.
This is my ksql-server.properties:
bootstrap.servers=hostname:9092
service.id=cyan_ksql
commit.interval.ms=5000
cache.max.bytes.buffering=20000000
num.stream.threads=10
fail.on.deserialization.error=false
listeners=http://localhost:8088
ksql.extension.dir=/opt/ksql-master/ext/
Going nuts with the error. I'm deleting the topic and somehow its recreated. Someone please help.
Check out the error:
A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required.
KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1)
If you've deleted the topic then either
it didn't actually get deleted
it got deleted and something else recreated it with nine partitions and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
another KSQL command is creating it ahead of the one that errors out and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
If you want to blow away your state and start from scratch, simply change your ksql.service.id which will cause KSQL to use a new command topic (which is what get replayed when you restart the process)

Spark job using HBase fails

Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more
run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.