how is Kinesis LATEST position recorded - streaming

When Kinesis checkpoint is not used, there is a InitialPositionInStream.LATEST setting that supposedly let you restart kinesis receiving from the "latest tip". Could someone explain what exactly that is. If the receiver application was shut down and restarted, where would this latest tip information be stored for this purpose?

You shouldn't care where the previous worker stopped if you are using LATEST. When you are using this flag it means that you want to start processing records from NOW and you don't care about the history events.
If you want to continue the processing from a previous worker, you should use the sequence ID of the last checkpoint of that worker.
If you want to process all the history events that are available in the kinesis stream, you should use the TRIM_HORIZON flag.
See more details in the kinesis documentation for shard iterator type: http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html

Related

Does rebuilding state stores in Kafka Streams propagate duplicate records to downstream topics?

I'm currently using Kafka Streams for a stateful application. The state is not stored in a Kafka state store though, but rather just in memory for the moment being. This means whenever I restart the application, all state is lost and it has to be rebuilt by processing all records from the start.
After doing some research on Kafka state stores, this seems to be exactly the solution I'm looking for to persist state between application restarts (either in memory or on disk). However, I find the resources online lack some pretty important details, so I still have a couple of questions on how this would work exactly:
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
State stores use their own changelog topics, and kafka-streams state stores take on responsibility for loading from them. If your state stores are uninitialised, your kafka-streams app will rehydrate its local state store from the changelog topic using EARLIEST, since it has to read every record.
This means the startup sequence for a brand new instance is roughly:
Observe there is no local state-store cache
Load the local state store by consumeing from the changelog topic for the statestore (the state-store's topic name is <state-store-name>-changelog)
Read each record and update a local rocksDB instance accordingly
Do not emit anything, since this is an application-service, not your actual topology
Read your consumer-groups offsets using EARLIEST or LATEST according to how you configured the topology. Not this is only a concern if your consumer group doesn't have any offsets yet
Process stuff, emitting records according to the topology
Whether you set your actual topology's auto.offset.reset to LATEST or EARLIEST is up to you. In the event they are lost, or you create a new group, its a balance between potentially skipping records (LATEST) vs handling reprocessing of old records & deduplication (EARLIEST),
Long story short: state-restoration is different from processing, and handled by kafka-streams its self.
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If you are re-launching the same application (e.g. after having stopped it before), then state will not be recalculated by reprocessing the original input data. Instead, the state will be restored from its "backup" (every state store or KTable is durably stored in a Kafka topic, the so-called "changelog topic" of that table/state store for such purposes) so that its data is exactly what it was when the application was stopped. This behavior enables you to seamlessly stop+restart your applications without skipping over records that arrived between "stop" and "restart".
But there is a different caveat that you need to be aware of: The configuration to set the offset start point (latest or earliest) is only used when you run your Kafka Streams application for the first time. Afterwards, whenever you stop+restart your application, it will always continue where it previously stopped. That's because, if the app has run at least once, it has stored its consumer offset information in Kafka, which allows it to know from where to automatically resume operations once it is being restarted.
If you need the different behavior of always (re)starting from e.g. the latest offsets (thus potentially skipping records that arrived in between when you stopped the application and when you restarted it), you must reset your Kafka Streams application. One of the steps the reset tool performs is removing the application's consumer offset information from Kafka, which makes the application think that it was never started before, so to speak.
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
This reprocessing will not happen by default as explained above. State will be automatically reconstructed to its prior state (pun intended) at the point when the application was stopped.
Reprocessing would only happen if you manually reset your application (see above) and e.g. configure the application to re-read historical data (like setting auto.offset.reset to earliest after you did the reset).

Skipping of batches in spark structured streaming process

I have got a spark structured streaming job which consumes events coming from the azure event hubs service. In some cases it happens, that some batches are not processed by the streaming job. In this case there can be seen the following logging statement in the structured streaming log:
INFO FileStreamSink: Skipping already committed batch 25
the streaming job persists the incoming events into an Azure Datalake, so I can check which events have actually been processed/persisted. When the above skipping happens, these events are missing!
It is unclear to me, why these batches are marked as already committed, because in the end it seems like they were not processed!
Do you have an idea what might cause this behaviour?
Thanks!
I could solve the issue. The problem was that I had two different streaming jobs which had different checkpoint locations (which is correct) but used the same base folder for their output. But in the output folder there is also saved meta information and so the two streams shared the information which batches they had already committed. After using a different base output folder the issue was fixed.
We had the same issue and the Kafka broker already deleted the data. So to force the Spark application to start from the beginning (latest offset in Kafka) we deleted both the checkpoint and _spark_metadata directories. You can find _spark_metadata in the same path where you write the stream.

Pentaho Data Integration - Kafka Consumer

I am using the Kafka Consumer Plugin for Pentaho CE and would appreciate your help in its usage. I would like to know if any of you were in a situation where pentaho failed and you lost any messages (based on the official docs there's no way to read the message twice, am I wrong ?). If this situation occurs how do you capture these messages so you can reprocess them?
reference:
http://wiki.pentaho.com/display/EAI/Apache+Kafka+Consumer
Kafka retains messages for the configured retention period whether they've been consumed or not, so it allows consumers to go back to an offset they previously processed and pick up there again.
I haven't used the Kafka plugin myself, but it looks like you can disable auto-commit and manage that yourself. You'll probably need the Kafka system tools from Apache and some command line steps in the job. You'd have to fetch the current offset at the start, get the last offset from the messages you consume and if the job/batch reaches the finish, commit that last offset to the cluster.
It could be that you can also provide the starting offset as a field (message key?) to the plugin, but I can't find any documentation on what that does. In that scenario, you could store the offset with your destination data and go back to the last offset there at the start of each run. A failed run wouldn't update the destination offset, so would not lose any messages.
If you go the second route, pay attention to the auto.offset.reset setting and behavior, as it may happen that the last offset in your destination has already disappeared from the cluster if it's been longer than the retention period.

Flink Kafka connector - commit offset without checkpointing

I have a question regarding Flink Kafka Consumer (FlinkKafkaConsumer09).
I've been using this version of connector:
flink-connector-kafka-0.9_2.11-1.1.2 (connector version is 0.9, akka version is 2.11, flink version is 1.1.2)
I gather communication data from kafka within 5-minutes tumbling windows. From what I've seen, the windows are aligned with system time (for example windows end in 12:45, 12:50, 12:55, 13:00 etc.)
After window is closed, its records are processed/aggregated and sent via Sink operator to database.
Simplified version of my program:
env.addSource(new FlinkKafkaConsumer09<>(topicName,jsonMapper, properties))
.keyBy("srcIp", "dstIp", "dstPort")
.window(TumblingEventTimeWindows.of(Time.of(5, TimeUnit.MINUTES)))
.apply(new CounterSum<>())
.addSink(new DbSink(...));
However I need to commit offset in kafka. From what I've read, the only way in FlinkKafkaConsumer09 is to turn on checkpointing. I do it like this:
env.enableCheckpointing(300000); // 5 minutes
Checkpointing stores state of all operators. After checkpoint is complete, the offset is comitted to kafka.
My checkpoints are stored via FsStateBackend in taskmanager system file structures (the first problem - older checkpoint data are not deleted, I saw some bugs being reported for this).
The second problem is when the checkpoint is triggered. If triggered at the beginning of the window, resulting checkpoint file is small, on the other side when triggered just before window is closed, resulting state is large (for example 50MB), because there are already many communication records in this window. The checkpoint process usually takes less than 1-2s, however when the checkpoint is triggered after the window is closed and while processing aggregations and DB sinks, the checkpoint process takes 45s.
But the whole point is that I don't need state checkpointing at all. All I need is to commit offset to kafka after window is closed, is processed and resulting data are sinked to db (or at the beginning of another window). If failover occured, flink would fetch last offset from kafka and would read data from last 5-minute interval again. Because last failed result was not sent to db, there would be no duplicate data being sent to DB and rereading last 5 minute interval is no overhead.
So basically I have 2 questions:
Is there any way how to achieve checkpointing being turned off
and only commit offsets like described above ?
If not, is there any way how to align checkpointing with start of
the window ? I read flink documentation - there is feature called
savepoints (i.e. manual checkpoints), but it is meant to be used
from command line. I would need to call savepoint from code on
window start - state would be small and checkpoint process would be
quick.
In order to commit offset in Kafka, set the property enable.auto.commit=true and then set a commit duration via auto.commit.interval.ms=300000 in the Kafka source builder.
FlinkKafkaConsumer09.<>builder()
...
.setProperty("auto.commit.interval.ms", "500")
.setProperty("enable.auto.commit", "true")
...
This will only commit your offset and not interfere with checkpointing

Is it possible to use kafka source in spark streaming without replaying logs from last checkpoint?

I am using pyspark streaming to process a very large streaming log, and because the log is very huge I don't want spark to process old logs if the application fails for any reason.
I can delete the checkpoint directory and get what I want but I was wondering if there is any way to do it programmatically.
I have already tried KafkaUtils.createStream(..., karkaParams={'auto.offset.reset': 'largest'}) but no success.
Any suggestion?
you should use
auto.offset.reset': 'smallest'
if you want to skip old messages in queue for your consumer group name when your application starts.
Largest means "get me messages which my consumer group did not received never"
Smallest means get messages from now on.
Also for future reference, if any one wants every available message in topic when application starts, you should use different consumer group name in every time and pass "largest" as offset.