kafka x kinesis - design questions

kafka x kinesis - design questions - apache-kafka

I'm studying system designs and data streams and got confused about when to use Kafka or kinesis, at first I thought that they worked together, but I'm still not so sure if that's correct
let's suppose I have a microservice to calculate a delivery fee(shipping) with thousands of requests per second and need to give a reply almost instantly
my idea was this:
ENDPOINT --> API GATEWAY --> LAMBDA FUNCTION(if I need to do anything with the API) -->
KINESIS(process the data stream) --> KAFKA(publish the events on a queue) -->
MICROSERVICE THAT WOULD CONSUME THE KAFKA EVENTS AND RETURN TO THE USER
does this make sense?
Kafka and Kinesis work together or I'm misunderstanding the functionality of any service?
Should I remove the lambda function?
PS: there's no code for this problem, I'm just trying to learn more about how to design a system

They can work together (there is a Kafka Connector that can copy the data), but it's often pointless, as your gateway / lambda / app should be able to directly publish to Kafka.
The only reason you might need to duplicate data between the two is while you're migrating between them

Related

What is the point of using Kafka in this example and why not use DB straightaway?

Here is an example of how Kafka should run for a Social network site.
But it is hard for me to understand the point of Kafka here. We would not want to store posts and likes in Kafka as they will be destroyed after some time. So kafka should be an intermediate storage between View and DB.
But why would we need it? Wouldn't it be better to use DB straightaway.
I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query. But I am pretty sure that is not the reason kafka here.

What's not shown in the diagram is the processes querying the database (RocksDB, in this case). Without using Kafka Streams, you'd need to write some external service to run GROUP BY / SUM on the database. The "website" box on the left is doing some sort of front-end Javascript, and it is unclear how the Kafka backend consumer sends data to it (perhaps WebSockets?).
With Kafka Streams Interactive Queries, that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch. In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post", "shares per post", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.
More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries. Therefore, load is more distributed and fault tolerant.
Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.
Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service). And there's nothing preventing you from adding another database that's connected to Kafka Connect sink. For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing. The producer code only cares about one Kafka topic, not every downstream system where the data is needed.

Kafka Connect vs Apache Nifi

Good Afternoon, my question is pretty simple, I'm new in Apache Kafka but I'm doing some work as part of my internship which is why I came with the question.
I will provide the context as much as I can, so I hope someone can help me, I want to clear my doubts.
I was requested to develop a pipeline (or workflow) using first Apache Nifi.
This pipeline consisted of the following.
I fetched data from one local MySQL database using Nifi, then the data was sent to one Kafka topic which was later processed to clean some raw data using the Kafka Client with Java (KStream, KTable and some regular expressions) and sent again to one kafka topic.
Once the processing was done, the new data was read again using Apache Nifi, and then sent to a new MySQL table.
I provide a picture for a better undertanding.
General Pipeline
After it, I was requested to do the same but using Kafka Connect instead of Apache Nifi, which was even shorter because I only had to use the Source connector to read the data from the MySQL database to sent it to one kafka topic, then process it with the Kafka Client with Java and sent it to a new kafka topic. Finally use the Sink connector to save the processed data of the new topic to sent it straight to one new table in the database.
So, someone in charge asked me when I should use Apache Nifi + Kafka instead of Kafka Connect + Kafka and I have no idea being honest.
So let's consider that the most important point here is apply Data Enrichment and let's consider two scenaries:
when I have data from different source but the data is not streaming data AND when the data is streaming data as well as not.
And all of it needs to be processed, integrated, cleaned and finally unified to apply data enrichment.
If I consider the context provided previously my questions and doubts are:
when should I use or not Nifi and Kafka? and why?
When should I use or not Kafka Connect with Kafka? and why?
I think I have one basic idea, and I have been reading in order to be able to answer it for myself, but being honest, I haven't come with one acceptable answer or clearly idea of when to use each one.
So, I would really appreciate your help.

Filtering in Kafka and other streaming technologies

I am currently doing some research about which stream processing technology to use. So far I have looked at message queueing technologies and streaming frameworks. I am now leaning towards Apache Kafka or Google Pub/Sub.
The requirements I have:
Deliver, read and process messages/events in real time.
Persistence in the messages/events.
Ability to filter messages/event in real time with out having to read entire topic. For example: if I have topic called ‘details’, I want to be able to filter out the messages/events out of that topic where an attribute of an event equals a certain value.
Ability to see if the producer to a certain topic or queue is finished.
Ability to delete messages/events in a topic based on an attribute within an event equaling a certain value.
Ordering in messages/events.
My question is: what is the best framework/technology for these use cases? From what I have read so far, Kafka doesn’t provide that out of the boxes filtering approach for messages/events in topics and Google Pub/Sub does have a filter approach.
Any suggestions and experience would be welcome.

As per the requirements you mentioned kafka seems a nice fit, using kafka streams or KSQL you can perform filtering in real-time, here is an example https://kafka-tutorials.confluent.io/filter-a-stream-of-events/confluent.html
What you need is more than just integration and data transfer, you need something similar to what is known as ETL tool, here you can find more about ETL and tools in GCP https://cloud.google.com/learn/what-is-etl

How to get data from Kafka into a store without Kafka Connect sink?

When reading about Kafka and how to get data from Kafka to a queryable database suited for some specific task, there is usually mention of Kafka Connect sinks.
This sounds like the way to go if I needed Kafka to search indexing like ElasticSearch or analytics like Hadoop to Spark where there's a Kafka Connect sink available.
But my question is what is the best way to handle a store that isn't as popular say MyImaginaryDB, where the only way I can get to it is through some API, and the data needs to be handled securely and reliably, as well as decently transformed before inserting? Is it recommended to:
Just have the API consume from Kafka and use the MyImaginaryDB driver to write
Figure out how to build a custom Kafka Connect sink (assuming it can handle schemas, authentication/authorization, retries, fault-tolerance, transforms and post-processing needed before landing in MyImaginaryDB)
I have also been reading about Kafka KSQL and Streams and am wondering if that helps with transforming the data before it is sent to the end store.

Option 2, definitely. Just because there isn't an existing source connector, doesn't mean Kafka Connect isn't for you. If you're going to be writing some code anyway, it still makes sense to hook into the Kafka Connect framework. Kafka Connect handles all the common stuff (schemas, serialisation, restarts, offset tracking, scale out, parallelism etc etc), and leaves you just to implement the bit of getting the data to MyImaginaryDB.
As regards transformations, standard pattern is either:
Use Single Message Transform for lightweight stuff
Use Kafka Streams/KSQL and write back to another topic, which is then routed through Kafka Connect to the target
If you try to build your own app doing (transformation + data sink) then you're munging together responsibilities, and you're reinventing a chunk of wheel that exists already (integration with an external system in a reliable scalable way)
You might find this talk useful for background about what Kafka Connect can do: http://rmoff.dev/ksldn19-kafka-connect

Kafka Streams use case

I am building a simple application which does below in order -
1) Reads messages from a remote IBM MQ(legacy system only works with IBM MQ)
2) Writes these messages to Kafka Topic
3) Reads these messages from the same Kafka Topic and calls a REST API.
4) There could be other consumers reading from this topic in future.
I came to know that Kafka has the new streams API which is supposed to be better than Kafka consumer in terms of speed/simplicity etc. Can someone please let me know if the streams API is a good fit for my use case and at what point in my process i can plug it ?

It is true that Kafka Streams API has a simple way to consume records in comparison to Kafka Consumer API (e.g. you don't need to poll, manage a thread and loop), but it also comes with a cost (e.g. local data store - if you do stateful processing).
I would say that if you need to consume records one by one and call a REST API use the Consumer API, if you need stateful processing, query the topic state, etc. use the Streams API.
For more info take a look to this blog post: https://balamaci.ro/kafka-streams-for-stream-processing/

Reads messages from a remote IBM MQ (legacy system only works with
IBM MQ)
Writes these messages to Kafka Topic
I'd use Kafka Connect for (1) and (2). It is part of the Kafka project, and there are many free as well as commercial "connectors" available for hundreds of systems.
Reads these messages from the same Kafka Topic and calls a REST API.
You can use Kafka Streams as well as the lower-level Consumer API of Kafka, depending on what you prefer. I'd go with Kafka Streams as it is easier to use and far more powerful. (Both are part of the Kafka project.)
There could be other consumers reading from this topic in future.
This works out-of-the-box -- once data is stored in a Kafka topic according to step 2, many different applications and "consumers" can read this data independently.

Looks like you are not doing any processing/transformation once you consume you message from your IBM MQ or even after your Kafka Topic.
First one -> from IBM Mq to your Kafka Topic is kind of a pipeline and
Secondly -> You are just calling the REST API(I assume w/o any processing)
Considering these facts it seems to be a good fit for using Simple consumer.
Let's not use a technology only because it's there :)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse