Ingesting data from REST api to Kafka - rest

I have many REST API to pull the data from different data sources, now i want to publish these rest response to different kafka topics. Also i want to make sure that duplicate data is not getting produced.
Is there any tools available to do this kind of operations?

So in general a Kafka processing pipeline should be able to handle messages that are sent multiple times. Exactly once delivery of Kafka messages is a feature that's only been around since mid 2017 (giving that I'm writing this Jan 2018), and Kafka 0.11, so in general unless you're super bleedy edge in your Kafka installation your pipeline should be able to handle multiple deliveries of the same message.
That's of course your pipeline. Now you have a problem where you have a data source that may deliver the message to you multiple times, to your HTTP -> Kafka microservice.
Theoretically you should design your pipeline to be idempotent: that multiple applications of the same change message should only affect the data once. This is, of course, easier said than done. But if you manage this then "problem solved": just send duplicate messages through and whatever it doesn't matter. This is probably the best thing to drive for, regardless of whatever once only delivery CAP Theorem bending magic KIP-98 does. (And if you don't get why this super magic well here's a homework topic :) )
Let's say your input data is posts about users. If your posted data includes some kind of updated_at date you could create a transaction log Kafka topic. Set the key to be the user ID and the values to be all the (say) updated_at fields applied to that user. When you're processing a HTTP Post look up the user in a local KTable for that topic, examine if your post has already been recorded. If it's already recorded then don't produce the change into Kafka.
Even without the updated_at field you could save the user document in the KTable. If Kafka is a stream of transaction log data (the database inside out) then KTables are the streams right side out: a database again. If the current value in the KTable (the accumulation of all applied changes) matches the object you were given in your post, then you've already applied the changes.

Related

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Best practice at the moment of processing data with dependencies in Kafka?

We are developing an app that takes data from different sources and once the data is available we process it, put it together and then proceed to move it to a different topic.
In our case we have 3 topics and each of these topics are going to bring data which have a relation with data from a different topic, in this case, every entity generated could be or not received at the same time (or a short period of time), and this is when the problem comes because there is a need for joining this 3 entities into one before we proceed with the moving to the topic.
Our idea was to create a separate topic which is going to contain all the data that is not processed yet and then have a separate thread that is going to check that topic in fixed intervals and also check the dependencies of this topic to be available, if they are available then we delete this entity from this separate topic, if not, we kept this entity there until it gets resolved.
At the end of all this explanation my question is if is it reasonable to do it in this way or there are other good practices or strategies that Kafka provides to solve this kind of scenarios?
Kafka messages could get clean after some time based on retention policy so you need to store message somewhere:
I can see below option but always every problem have may approach and solution:
Processed all message and forward "not processed message" to other topic say A
Kafka Processor API to consume messages from topic A and store into the state store
Schedule a punctuate() method with a time interval
Iterate all messages stored in the state stored.
check dependency if available delete the message from the state store and processed it or publish back to original topics to get processed again.
You can refer below link for reference
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html

Delayed message consumption in Kafka

How can I produce/consume delayed messages with Apache Kafka? Seems like standard Kafka (and Java kafka-client) functionality doesn't have this feature. I know that I could implement it myself with standard wait/notify mechanism, but it doesn't seem very reliable, so any advices and good practices are appreciated.
Found related question, but it didn't help.
As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
Indeed, kafka lowest structure is a partition, which are sequential events in a queue with incremental offset - you can't insert a log anywhere else than the end at the moment you produce it. There is no concept of delayed messages.
What do you want to achieve exactly?
Some possibilities in your case:
You want to push a message at a specific time (for example, an event "start job"). In this case, use a scheduled task (not from kafka, use some standard way on your os / language / custom app / whatever) to send the message at the given time - consumers will receive them at the proper time.
You want to send an event now, but which should not be taken into account now by consumers. In this case, you can use a custom structure which would include a "time" in its payload. Consumers will have to understand this field and have custom processing to deal with it. For exemple: "start job at 2017-12-27T20:00:00Z". You could also use headers for this, but headers are not supported by all clients for now.
You can change the timestamp of the message sent. Internally, it would still be read in order, but some functions implying time would work differently, and consumer could use the timestamp of the message for its action - this is kinda like the previous proposition, except the timestamp is one metadata of the event, and not the event payload itself. I would not use this personally - I only deal with timestamp when I proxy some events.
For your last question: basically, yes, but with some notes:
Topics are actually split in partition, and order is only preserved in partition. All message with same key are send to same partition.
Most of time, you only read from memory, except if you read old events - in this case, as those are sequentially read from disk, this is very fast
You can choose where to begin to read - a given offset or a given time - and even change it at runtime
You can parallelize read across process - multiple consumers can read the same topics and never reading the same messages twice (each reading different partition, see consumer groups)

Looking for alternative to dynamically creating Kafka topics

I have a service that fetches a snapshot of some information about entities in our system and holds on to that for later processing. Currently in the later processing stages we fetch the information using http.
I want to use Kafka to store this information by dynamically creating topics so that the snapshots aren't mixed up with each other. When the service fetches the entities it creates a unique topic and then each entity we fetch gets pushed to that topic. The later processing stages would be passed the topic as a parameter and can then read all the info at their own leisure.
The benefits of this would be:
Restarting the later stages processing can be made to just restart at the offset it has processed so far.
No need to worry about batching of requests (or stream processing the incoming http response) for the entities if there is a lot of them since we simply read one at a time.
Multiple consumer groups can easily be added later for other processing purposes.
However, Kafka/Zookeeper has some limits on the total number of topics/partitions it can support. As such I would need to delete them either after the processing is done or based on some arbitrary time passing. In addition since (some) of the processors would have to know when all the information has been read I would need to include some sort of "End of Stream" message on the topic.
Two general questions:
Is it bad to dynamically create and delete Kafka topics like this?
Is it bad to include an "End of Stream" type of message?
Main question:
Is there an alternative to the above approach using static topics/partitions that doesn't entail having to hold onto the entities in memory until the processing should occur?
It seems that one “compacted” topic can be an alternative