Is there any way to configure the textFileStream source such that it will process any file added to the source directory regardless of the file create time?
To demonstrate the issue, I created a basic Spark Streaming application that uses textFileStream as a source and prints the stream contents to the console. When an existing file created prior to running the application is copied into the source directory, nothing is printed to the console. When a file created after the application starts running is copied to the source directory, the file contents are printed. Below is my code for reference.
val conf = new SparkConf().setAppName("Streaming Test")
.setMaster("local[*]")
val spark = new SparkContext(conf)
val ssc = new StreamingContext(spark, Seconds(5))
val fileStream = ssc.textFileStream("/stream-source")
val streamContents = fileStream.flatMap(_.split(" "))
streamContents.print()
This is the documented behavior of the FileInputDStream.
If we would like to consume existing files in that directory, we can use the Spark API to load these files and apply our desired logic to them.
val existingFiles = sparkContext.textFile(path)
or
val existingFilesDS = sparkSession.read.text(path)
And then after, setup and start the streaming logic.
We could even use the data of the already existing files in the processing of the new ones.
Related
I am new to working with HDFS. I am trying to read a csv file which is stored in a hadoop cluster using spark. Every time i try to access it i get the following error:
End of File Exception between local host
I have not setup hadoop locally since i already had access to hadoop cluster.
I may be missing some configurations but i dont know which ones. Would appreciate the help.
I tried to debug it using this :
link
Did not work for me.
This is the code using spark.
val conf= new SparkConf().setAppName("Read").setMaster("local").set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
val sc=new SparkContext(conf)
val data=sc.textfile("hdfs://<some-ip>/abc.csv)
I expect it to read the csv and convert it into RDD.
Getting this error:
Exception in thread "main" java.io.EOFException: End of File Exception between local host is:
Run you spark jobs on hadoop cluster. Use below code:
val spark = SparkSession.builder().master("local[1]").appName("Read").getOrCreate()
val data = spark.sparkContext.textFile("<filePath>")
or you can use spark-shell as well.
If you want to access hdfs from your local, follow this:link
I am using intellij to write spark code. And, I want to access files stored in hdfs file system at a server. How can I access the hdfs file in the Scala spark code so that it can be loaded as a dataframe??
val spark = SparkSession.builder().appName("CSV_Import_Example")
.config("spark.hadoop.yarn.resourcemanager.hostname","XXX")
.config("spark.hadoop.yarn.resourcemanager.address","XXX:8032")
.config("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020")
.config("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/")
.getOrCreate()
The entry point into all functionality in Spark is the SparkSession class.
val sourceDF = spark.read.format("csv").option("header", "true").load("hdfs://192.168.1.1:8020/user/cloudera/example_csvfile.csv")
hdfs://192.168.1.1:8020 here is accessing the HDFS Cluster & 8020 port is related to namenode.
I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.
The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
ssc.checkpoint(folderName)
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )
ssc.start()
ssc.awaitTermination()
I have a scala 2.10 spark 1.5.0 sbt app I am developing in eclipse. In my main method, I have:
val sc = new SparkContext("local[2]", "My Spark App")
// followed by operations to do things with the spark context to count up lines in a file
When I run this application within the eclipse IDE, it works and outputs what I expect the result to be, but when I change the spark context to connect to my cluster using:
val master = "spark://My-Hostname-From-The-Spark-Master-Page.local:7077"
val conf = new SparkConf().setAppName("My Spark App").setMaster(master)
val sc = new SparkContext(conf)
I get errors like:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#My-Hostname-From-The-Spark-Master-Page.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
What gives? How can I get my job to run with the existing master and slave nodes I started up? I know spark-submit is recommended, but arn't applications like Zepplin and Notebook designed to use spark without having to "spark-submit"?
So I have thousands of events being streamed through Amazon Kinesis into SQS then dumped into a S3 directory. About every 10 minutes, a new text file is created to dump the data from Kinesis into S3. I would like to set up Spark Streaming so that it streams the new files being dumped into S3. Right now I have
import org.apache.spark.streaming._
val currentFileStream = ssc.textFileStream("s3://bucket/directory/event_name=accepted/")
currentFileStream.print
ssc.start()
However, Spark Streaming is not picking up the new files being dumped into S3. I think it has something to do with the file write requirements:
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Why is Spark streaming not picking up the new files? Is it because AWS is creating the files in the directory and not moving them? How can I make sure Spark picks up the files being dumped into S3?
In order to stream an S3 bucket. you need to provide the path to S3 bucket. And it will stream all data from all the files in this bucket. Then whenever w new file is created in this bucket, it will be streamed. If you are appending data to existing file which are read before, these new updates will not be read.
here is small piece of code that works
import org.apache.spark.streaming._
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)
//ones above this may be deprecated?
hadoopConf.set("fs.s3n.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey",mySecretKey)
val ssc = new org.apache.spark.streaming.StreamingContext(
sc,Seconds(60))
val lines = ssc.textFileStream("s3n://path to bucket")
lines.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
hope it will help.