Fault tolerance in Spark streaming - scala

I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.

The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/

Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
ssc.checkpoint(folderName)
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )
ssc.start()
ssc.awaitTermination()

Related

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Azure Event Hubs Streaming: Does Checkpointing override setStartingPosition?

If we specify the starting position in EventHub conf like so:
EventHubsConf(ConnectionStringBuilder(eventHubConnectionString).build)
.setStartingPosition(EventPosition.fromStartOfStream)
or
.setStartingPosition(EventPosition.fromEndOfStream)
And also sepecify the checkpoint location in the StreamWriter
streamingInputDF
.writeStream
.option("checkpointLocation", checkpointLocation)
...
After a restart, does the setStartingPosition become irrelevant because the checkpoint is always used as the point from where to begin reading?
Thanks.
The information on offsets stored in the checkpoint files will be used when restarting the streamimg query.
Interestingly, this is not specifically mentioned in the structured streaming eventhubs integration guide, however, in the DStreams guide it is:
"The connector fully integrates with the Structured Streaming checkpointing mechanism. You can recover the progress and state of you query on failures by setting a checkpoint location in your query. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."
Make sure to follow the general guidance on checkpoint recovery.

Spark Structured Streaming unable to write parquet data to HDFS

I'm trying to write data to HDFS from a spark structured streaming code in scala.
But I'm unable to do so due to an error that I failed to understand
On my use case, I'm reading data from a Kafka topic which I want to write on HDFS in parquet format. Everything in my script work well no bug so far.
For doing that I'm using a developement hadoop cluster with 1 namenode and 3 datanodes.
Whatever hadoop configuration I tried I have the same error (2 datanodes, a single node setup and so on ...)
here is the error :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test/metadata could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
here is the code I'm using to write data :
val query = lbcAutomationCastDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("lbcautomation")
.partitionBy("date_year", "date_month", "date_day")
.option("checkpointLocation", "hdfs://NAMENODE_IP:8020/test/")
.option("path", "hdfs://NAMENODE_I:8020/test/")
.start()
.awaitTermination()
The spark scala code work correctly because I can write the data to the server local disk without any error.
I already tried to format the hadoop cluster, it does not change anything
Have you ever deal with this case ?
UPDATES :
Manually push file to HDFS on the cluster work without issues

Structured Streaming with Flume

Hi can anyone tell me how to read a flume stream using spark new API for structured streaming.
Example:
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
As of Spark 2.1, Spark supports only File, Kafka and Socket source. Socket SOurce is meant for debugging and development and shouldn't be productionalized. This leaves File and Kafka sources.
So, the only options you have are
a) take data from FLume and dump them into S3 files. Spark can get the data from S3 files. The way the File Source works is that it watches a folder, and when a new file appears, it loads it as a microbatch
b) Funnel your events into a Kafka instance
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]) for push based approach and
val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]) for pull-based approach

Spark Streaming on a S3 Directory

So I have thousands of events being streamed through Amazon Kinesis into SQS then dumped into a S3 directory. About every 10 minutes, a new text file is created to dump the data from Kinesis into S3. I would like to set up Spark Streaming so that it streams the new files being dumped into S3. Right now I have
import org.apache.spark.streaming._
val currentFileStream = ssc.textFileStream("s3://bucket/directory/event_name=accepted/")
currentFileStream.print
ssc.start()
However, Spark Streaming is not picking up the new files being dumped into S3. I think it has something to do with the file write requirements:
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Why is Spark streaming not picking up the new files? Is it because AWS is creating the files in the directory and not moving them? How can I make sure Spark picks up the files being dumped into S3?
In order to stream an S3 bucket. you need to provide the path to S3 bucket. And it will stream all data from all the files in this bucket. Then whenever w new file is created in this bucket, it will be streamed. If you are appending data to existing file which are read before, these new updates will not be read.
here is small piece of code that works
import org.apache.spark.streaming._
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)
//ones above this may be deprecated?
hadoopConf.set("fs.s3n.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey",mySecretKey)
val ssc = new org.apache.spark.streaming.StreamingContext(
sc,Seconds(60))
val lines = ssc.textFileStream("s3n://path to bucket")
lines.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
hope it will help.