So I have thousands of events being streamed through Amazon Kinesis into SQS then dumped into a S3 directory. About every 10 minutes, a new text file is created to dump the data from Kinesis into S3. I would like to set up Spark Streaming so that it streams the new files being dumped into S3. Right now I have
import org.apache.spark.streaming._
val currentFileStream = ssc.textFileStream("s3://bucket/directory/event_name=accepted/")
currentFileStream.print
ssc.start()
However, Spark Streaming is not picking up the new files being dumped into S3. I think it has something to do with the file write requirements:
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Why is Spark streaming not picking up the new files? Is it because AWS is creating the files in the directory and not moving them? How can I make sure Spark picks up the files being dumped into S3?
In order to stream an S3 bucket. you need to provide the path to S3 bucket. And it will stream all data from all the files in this bucket. Then whenever w new file is created in this bucket, it will be streamed. If you are appending data to existing file which are read before, these new updates will not be read.
here is small piece of code that works
import org.apache.spark.streaming._
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)
//ones above this may be deprecated?
hadoopConf.set("fs.s3n.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey",mySecretKey)
val ssc = new org.apache.spark.streaming.StreamingContext(
sc,Seconds(60))
val lines = ssc.textFileStream("s3n://path to bucket")
lines.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
hope it will help.
Related
I am using intellij to write spark code. And, I want to access files stored in hdfs file system at a server. How can I access the hdfs file in the Scala spark code so that it can be loaded as a dataframe??
val spark = SparkSession.builder().appName("CSV_Import_Example")
.config("spark.hadoop.yarn.resourcemanager.hostname","XXX")
.config("spark.hadoop.yarn.resourcemanager.address","XXX:8032")
.config("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020")
.config("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/")
.getOrCreate()
The entry point into all functionality in Spark is the SparkSession class.
val sourceDF = spark.read.format("csv").option("header", "true").load("hdfs://192.168.1.1:8020/user/cloudera/example_csvfile.csv")
hdfs://192.168.1.1:8020 here is accessing the HDFS Cluster & 8020 port is related to namenode.
I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.
The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
ssc.checkpoint(folderName)
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )
ssc.start()
ssc.awaitTermination()
Is there any way to configure the textFileStream source such that it will process any file added to the source directory regardless of the file create time?
To demonstrate the issue, I created a basic Spark Streaming application that uses textFileStream as a source and prints the stream contents to the console. When an existing file created prior to running the application is copied into the source directory, nothing is printed to the console. When a file created after the application starts running is copied to the source directory, the file contents are printed. Below is my code for reference.
val conf = new SparkConf().setAppName("Streaming Test")
.setMaster("local[*]")
val spark = new SparkContext(conf)
val ssc = new StreamingContext(spark, Seconds(5))
val fileStream = ssc.textFileStream("/stream-source")
val streamContents = fileStream.flatMap(_.split(" "))
streamContents.print()
This is the documented behavior of the FileInputDStream.
If we would like to consume existing files in that directory, we can use the Spark API to load these files and apply our desired logic to them.
val existingFiles = sparkContext.textFile(path)
or
val existingFilesDS = sparkSession.read.text(path)
And then after, setup and start the streaming logic.
We could even use the data of the already existing files in the processing of the new ones.
I need to append the streaming data into hdfs using Flume. Without overwriting the existing log file I need to append the streaming data to existing file in hdfs. Could you please provide links for the MR code for the same.
Flume does not overwrite existing data in hdfs directory by default. It is because, flume save incoming data with folder name appended sink timestamp, such as
Flume.2345234523 so if you run flume again in the same directory in hdfs it will create another file, under the same hdfs path.
Hi can anyone tell me how to read a flume stream using spark new API for structured streaming.
Example:
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
As of Spark 2.1, Spark supports only File, Kafka and Socket source. Socket SOurce is meant for debugging and development and shouldn't be productionalized. This leaves File and Kafka sources.
So, the only options you have are
a) take data from FLume and dump them into S3 files. Spark can get the data from S3 files. The way the File Source works is that it watches a folder, and when a new file appears, it loads it as a microbatch
b) Funnel your events into a Kafka instance
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]) for push based approach and
val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]) for pull-based approach