Hi can anyone tell me how to read a flume stream using spark new API for structured streaming.
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
As of Spark 2.1, Spark supports only File, Kafka and Socket source. Socket SOurce is meant for debugging and development and shouldn't be productionalized. This leaves File and Kafka sources.
So, the only options you have are
a) take data from FLume and dump them into S3 files. Spark can get the data from S3 files. The way the File Source works is that it watches a folder, and when a new file appears, it loads it as a microbatch
b) Funnel your events into a Kafka instance
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]) for push based approach and
val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]) for pull-based approach
I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.
Is Kafka Spool Directory connector suitable for loading streaming data (log) into Kafka in production ? Can it be run in distributed mode ? Is there any other connector that can be used since filestream source connector is not suitable for production ?
Does this match your requirements?
provides the capability to watch a directory for files and read the data as new files are written to the input directory.
Do you have CSV or JSON files?
If so, then you can use the Spooldir connector
It can be argued that something like Flume, Logastash, Filebeat, FluentD, Syslog, GELF, or other log solutions are more appropriately suited for your purposes of collecting logs into Kafka
I need a kafka sink connector allowing the user to persist topic content as .CSV files. I've been investigating a while.
Confluent provides FileSink connector which - as far as I understood - doesn't support CSV format.
I've been playing a bit with
this project but the sink task seems not implemented. Alongside, this one actually implements just the Source part.
Does a project with this capability currently exists?
Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html
I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.
The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )