I have got a spark structured streaming job which consumes events coming from the azure event hubs service. In some cases it happens, that some batches are not processed by the streaming job. In this case there can be seen the following logging statement in the structured streaming log:
INFO FileStreamSink: Skipping already committed batch 25
the streaming job persists the incoming events into an Azure Datalake, so I can check which events have actually been processed/persisted. When the above skipping happens, these events are missing!
It is unclear to me, why these batches are marked as already committed, because in the end it seems like they were not processed!
Do you have an idea what might cause this behaviour?
Thanks!
I could solve the issue. The problem was that I had two different streaming jobs which had different checkpoint locations (which is correct) but used the same base folder for their output. But in the output folder there is also saved meta information and so the two streams shared the information which batches they had already committed. After using a different base output folder the issue was fixed.
We had the same issue and the Kafka broker already deleted the data. So to force the Spark application to start from the beginning (latest offset in Kafka) we deleted both the checkpoint and _spark_metadata directories. You can find _spark_metadata in the same path where you write the stream.
Related
I have a Spark Structured Streaming app that reads from Kafka and writes to Elasticsearch and S3. I have enabled checkpointing to a S3 bucket as well (app runs AWS EMR). I saw that in S3 bucket that over time the commits get less frequently and there is always growing delay in the data.
So I want to make Spark to process always to process batches with same amount of data each batch. I tried to set the ".option("maxOffsetsPerTrigger", 100)" but the batch size didnt become smaller, still huge amount of time between commits.
As I understood that we just tell spark how much data consume from kafka per poll and that spark just polls multiple times and then writes, so no limitations in the batch size.
I also tried to use continuous mode but the submit failed, i guess cuz of the output sink / foreachbatch doesnt support it.
any ideas are welcome, i will try everything ^^
actually the each offset contained so much data that I had to limit the max offsets per trigger to 50, and had to delete the old checkpoint folder, I read somewhere that it tries to finish first batch with the offset in the checkpoint, and then turns on the max offset per trigger
I have a KafkaStream application with multiple joins on KTable. Finally I am using Processor Api to form State Store to perform some sort of business logic. Earlier I had a purge job which was deleting Kafka log and KafkaStream's state dir every morning to process only today's data which was produced by my producers. Till this time everything was working as expected.
But deleting Kafka log directory is not good approach so I decided to make use of cleanup.policy to delete data from Kafka and deleting KafkaStream state dirs. I think this approach is creating problem in state stores where data is still being re-stored from changelog topics on App startup.
Is there a way to purge entire data from Kafka and all KafkaStream's state store along with changelog topics?
Appreciate your help.
I'm thinking of using a Kafka Connector vs creating my own Kafka Consumers/Producers to move some data from/to Kafka, and I see the value Kafka Connectors provide in terms of scalability and fault tolerance. However, I haven't been able to find how exactly connectors behave if the "Task" fails for some reason. Here are a couple of scenarios:
For a sink connector (S3-Sink), if it (the Task) fails (after all retries) to successfully send the data to the destination (for example due to a network issue), what happens to the worker? Does it crash? Is it able to re-consume the same data from Kafak later on?
For a source connector (JDBC Source), if it fails to send to Kafka, does it re-process the same data later on? Does it depend on what the source is?
Does answer to the above questions depend on which connector we are talking about?
In Kafka 2.0, I think, they introduced the concept of graceful error handling, which can skip the over bad messages or write to a DLQ topic.
1) The S3 sink can fail, and it'll just stop processing data. However, if you fix the problem (for various edge cases that may arise) the sink itself is exactly once delivery to S3. The consumed offsets are stored as a regular consumer offset offset will not commit to Kafka until the file upload completes. However, obviously if you don't fix the issue before the retention period of a topic, you're losing data.
2) Yes, it depends on the source. I don't know the semantics of the JDBC Connector, but it really depends which query mode you're using. For example, for the incrementing timestamp, if you try to run a query every 5 seconds for all rows within a range, I do not believe it'll retry old, missed time windows
Overall, the failure recovery scenario are all dependent on the systems that are being connected to. Some errors are recoverable, and some are not (for example, your S3 access keys get revoked, and it won't write files until you get a new credential set)
I am completely new to Big Data, from last few weeks i am try to build log analysis application.
I read many articles and i found Kafka + spark streaming is the most reliable configuration.
Now, I am able to process data sent from my simple kafka java producer to spark Streaming.
Can someone please suggest few things like
1) how can i read server logs real time and pass it to kafka broker.
2) any frameworks available to push data from logs to Kafka?
3) any other suggestions??
Thanks,
Chowdary
There are many ways to collect logs and send to Kafka. If you are looking to send log files as stream of events I would recommend to review Logstash/Filebeats - just setup you input as fileinput and output to Kafka.
You may also push data to Kafka using log4j KafkaAppender or pipe logs to Kafka using many CLI tools already available.
In case you need to guarantee sequence, pay attention to partition configuration and partition selection logic. For example, log4j appender will distribute messages across all partitions. Since Kafka guarantees sequence per partition only, your Spark streaming jobs may start processing events out of sequence.
I am using pyspark streaming to process a very large streaming log, and because the log is very huge I don't want spark to process old logs if the application fails for any reason.
I can delete the checkpoint directory and get what I want but I was wondering if there is any way to do it programmatically.
I have already tried KafkaUtils.createStream(..., karkaParams={'auto.offset.reset': 'largest'}) but no success.
Any suggestion?
you should use
auto.offset.reset': 'smallest'
if you want to skip old messages in queue for your consumer group name when your application starts.
Largest means "get me messages which my consumer group did not received never"
Smallest means get messages from now on.
Also for future reference, if any one wants every available message in topic when application starts, you should use different consumer group name in every time and pass "largest" as offset.