Structured Streaming - Could not use FileContext API for managing metadata log files on AWS S3 - scala

I have a StreamingQuery in Spark(v2.2.0), i.e.,
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val query = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("parquet")
.option("checkpointLocation", "s3n://bucket/checkpoint/test")
.option("path", "s3n://bucket/test")
.start()
When I am running query then data does get save on AWS S3 and checkpoints are created at s3n://bucket/checkpoint/test. But, I am also receiving following WARNING in the logs:
WARN [o.a.s.s.e.streaming.OffsetSeqLog] Could not use FileContext API for managing metadata log files at path s3n://bucket/checpoint/test/offsets. Using FileSystem API instead for managing log files. The log may be inconsistent under failures.
I am not able to understand as to why this WARNING is coming. Also, will my checkpoints be inconsistent in case of any failure?
Can anyone help me in resolving it?

Looking at the source code, this error comes from the HDFSMetadataLog class. A comment in the code states that:
Note: [[HDFSMetadataLog]] doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
So the problem is due to using AWS S3 and it will force you to use the FileSystemManager API. Checking the comment for that class, we see that,
Implementation of FileManager using older FileSystem API. Note that this implementation cannot provide atomic renaming of paths, hence can lead to consistency issues. This should be used only as a backup option, when FileContextManager cannot be used.
Hence, some issues can come up when multiple writers want to concurrently do rename operations. There is a related ticket here, however, it has been closed since the issue can't be fixed in Spark.
Some things to consider if you need to checkpoint on S3:
To aviod the warning and potential trouble, checkpoint to HDFS and then copy over the results
Checkpoint to S3, but have a long gap between checkpoints.

Nobody should be using S3n as the connector. It is obsolete and removed from Hadoop 3. If you have the Hadoop 2.7.x JARs on the classpath, use s3a
The issue with rename() is not just the consistency, but the bigger the file, the longer it takes.
Really checkpointing to object stores needs to be done differently. If you look closely, there is no rename(), yet so much existing code expects it to be an O(1) atomic operation.

Related

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Azure Event Hubs Streaming: Does Checkpointing override setStartingPosition?

If we specify the starting position in EventHub conf like so:
EventHubsConf(ConnectionStringBuilder(eventHubConnectionString).build)
.setStartingPosition(EventPosition.fromStartOfStream)
or
.setStartingPosition(EventPosition.fromEndOfStream)
And also sepecify the checkpoint location in the StreamWriter
streamingInputDF
.writeStream
.option("checkpointLocation", checkpointLocation)
...
After a restart, does the setStartingPosition become irrelevant because the checkpoint is always used as the point from where to begin reading?
Thanks.
The information on offsets stored in the checkpoint files will be used when restarting the streamimg query.
Interestingly, this is not specifically mentioned in the structured streaming eventhubs integration guide, however, in the DStreams guide it is:
"The connector fully integrates with the Structured Streaming checkpointing mechanism. You can recover the progress and state of you query on failures by setting a checkpoint location in your query. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."
Make sure to follow the general guidance on checkpoint recovery.

Kafka option 'group.id' is not supported in spark structured streaming [duplicate]

I want to use Spark Structured Streaming to read from a secure kafka. This means that I will need to force a specific group.id. However, as is stated in the documentation this is not possible.
Still, in the databricks documentation https://docs.azuredatabricks.net/spark/latest/structured-streaming/kafka.html#using-ssl, it says that it is possible. Does this only refer to the azure cluster?
Also, by looking at the documentation of the master branch of the apache/spark repo https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md, we can understand that such functionality is intended to be added at later spark releases. Do you know of any plans of such a stable release, that is going to allow setting that consumer group.id?
If not, are there any workarounds for Spark 2.4.0 to be able to set a specific consumer group.id?
Currently (v2.4.0) it is not possible.
You can check following lines in Apache Spark project:
https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81 - generate group.id
https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L534 - set it in properties, that are used to create KafkaConsumer
In master branch you can find modification, that enable to setting prefix or particular group.id
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L83 - generate group.id based on group prefix (groupidprefix)
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L543 - set previously generated groupId, if kafka.group.id wasn't passed in properties
Since Spark 3.0.0
According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("kafka.group.id", "myConsumerGroup")
.load()
However, Spark will not commit any offsets back so the offsets of your ConsumerGroups will not be stored in Kafka's internal topic __consumer_offsets but rather in Spark's checkpoint files.
Being able to set the group.id is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.
A full example of a Spark 3.x application setting kafka.group.id is discussed and solved here.
Now with spark3.0, you can specify group.id for kafka https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
Structured Streaming guide seems to be quite explicit about it:
Note that the following Kafka params cannot be set and the Kafka
source or sink will throw an exception:
group.id: Kafka source will create a unique group id for each query
automatically.
auto.offset.reset: Set the source option
startingOffsets to specify where to start instead.

Spark Structured Streaming unable to write parquet data to HDFS

I'm trying to write data to HDFS from a spark structured streaming code in scala.
But I'm unable to do so due to an error that I failed to understand
On my use case, I'm reading data from a Kafka topic which I want to write on HDFS in parquet format. Everything in my script work well no bug so far.
For doing that I'm using a developement hadoop cluster with 1 namenode and 3 datanodes.
Whatever hadoop configuration I tried I have the same error (2 datanodes, a single node setup and so on ...)
here is the error :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test/metadata could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
here is the code I'm using to write data :
val query = lbcAutomationCastDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("lbcautomation")
.partitionBy("date_year", "date_month", "date_day")
.option("checkpointLocation", "hdfs://NAMENODE_IP:8020/test/")
.option("path", "hdfs://NAMENODE_I:8020/test/")
.start()
.awaitTermination()
The spark scala code work correctly because I can write the data to the server local disk without any error.
I already tried to format the hadoop cluster, it does not change anything
Have you ever deal with this case ?
UPDATES :
Manually push file to HDFS on the cluster work without issues

Fault tolerance in Spark streaming

I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.
The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
ssc.checkpoint(folderName)
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )
ssc.start()
ssc.awaitTermination()