I would like to harness the speed and power of Apache Kafka to replace REST calls in my Java application.
My app has the following architecture:
Producer P1 writes a search query to topic search
Consumer C1 reads/consumes the search query and produces search results which it writes to another topic search_results.
Both Producer P1 and Consumer C1 are part of a group of producers/consumer living on different physical machines. I need the Producer P1 server to be the one to consume/read the search results output produced by Consumer C1 so it can serve the search results to the client who submitted the search query.
The above example was simplified for demonstration purposes - in reality the process entails several additional intermediate Producers and Consumers where the query may be thrown to/from multiple servers to be processed. The main point is that the value produced by the last Producer needs to be read/consumed by the first Producer.
In the typical Apache Kafka architecture, there's no way to ensure that the final output is read by the server that originally produced the search query - as there are multiple servers reading the same topic.
I do not want to use REST for this purpose because it is very sloooooow when processing thousands of queries. Apache Kafka can handle millions of queries with 10 millisecond latency. In my particular use case it is critical that the query is transmitted with sub-millisecond speed. Scaling with REST is also much more difficult. Suppose our traffic increases and we need to add a dozen more servers to intercept client queries. With Apache Kafka it's as simple as adding new servers and adding them to the Producer P1 group. With REST not so simple. Apache Kafka also provides a very high level of decoupling which REST does not.
What design/architecture can be used to force a specific server/produce to consume the end result of initial query?
Thanks
In the typical Apache Kafka architecture, there's no way to ensure
that the final output is read by the server that originally produced
the search query - as there are multiple servers reading the same
topic.
You can use custom partitioner in your producer that determines which search query to land in which partition.
Similarly, you can use custom partition assignor in consumer to determine which partitions should be assigned to which consumer. The consumer configuration is partition.assignment.strategy
The fact Kafka is faster than REST is due to the way it is implemented. What is important here is to decide which pattern works for you - request-response or publish-subscribe or something else. You can check this answer for REST vs Kafka.
Maybe it makes sense to have multiple topics for the answers, not just one big topic:
This way the "results" topics act as "mailboxes".
Probably you'll need to set auto.create.topics.enable=true since creating topics for all P1,...PN could be complicated.
Related
my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.
Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.
I want to change the communication between (micro)-services from REST to Kafka.
I'm not sure about the topics and wanted to hear some opinions about that.
Consider the following setup:
I have an API-Gateway that provides CRUD functions via REST for web applications. So I have 4 endpoints which users can call.
The API-Gateway will produce the request and consumes the responses from the second service.
The second service consumes the requests, access the database to execute the CRUD operations on the database and produces the result.
How many topics should I create?
Do I have to create 8 (2 per endpoint (request/response)) or is there a better way to do it?
Would like to hear some experience or links to talks / documentation on that.
The short answer for this question is; It depends on your design.
You can use only one topic for all your operations or you can use several topics for different operations. However you must know that;
Your have to produce messages to kafka in the order that they created and you must consume the messages in the same order to provide consistency. Messages that are send to kafka are ordered within a topic partition. Messages in different topic partitions are not ordered by kafka. Lets say, you created an item then deleted that item. If you try to consume the message related to delete operation before the message related to create operation you get error. In this scenario, you must send these two messages to same topic partition to ensure that the delete message is consumed after create message.
Please note that, there is always a trade of between consistency and throughput. In this scenario, if you use a single topic partition and send all your messages to the same topic partition you will provide consistency but you cannot consume messages fast. Because you will get messages from the same topic partition one by one and you will get next message when the previous message consumed. To increase throughput here, you can use multiple topics or you can divide the topic into partitions. For both of these solutions you must implement some logic on producer side to provide consistency. You must send related messages to same topic partition. For instance, you can partition the topic into the number of different entity types and you send the messages of same entity type crud operation to the same partition. I don't know whether it ensures consistency in your scenario or not but this can be an alternative. You should find the logic which provides consistency with multiple topics or topic partitions. It depends on your case. If you can find the logic, you provide both consistency and throughput.
For your case, i would use a single topic with multiple partitions and on producer side i would send related messages to the same topic partition.
--regards
I need data from kafka brokers,but for fast access I am using multiple consumers with same group id known as consumer groups.But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
By design, different consumers in the same consumer group process data independently from each other. (This behavior is what allows applications to scale well.)
But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
The short but slightly simplified answer when you use Kafka's "Consumer API" (also called: "consumer client" library), which I think is what you are using based on the wording of your question: If you need to combine data from multiple consumers, the easiest option is to make this (new) input data available in another Kafka topic, where you do the combining in a subsequent processing step. A trivial example would be: the other, second Kafka topic would be set up to have just 1 partition, so any subsequent processing step would see all the data that needs to be combined.
If this sounds a bit too complicated, I'd suggest to use Kafka's Streams API, which makes it much easier to define such processing flows (e.g. joins or aggregations, like in your question). In other words, Kafka Streams gives you a lot of the desired built-in "logic" that you are looking for: https://kafka.apache.org/documentation/streams/
The aim of Kafka is to provide you with a scalable, performant and fault tolerant framework. Having a group of consumers reading the data from different partitions asynchronously allows you to archive first two goals. The grouping of the data is a bit outside the scope of standard Kafka flow - you can implement a single partition with single consumer in most simple case but I'm sure that is not what you want.
For such things as aggregation of the single state from different consumers I would recommend you to apply some solution designed specifically for such sort of goals. If you are working in terms of Hadoop, you can use Storm Trident bolt which allows you to aggregate the data from you Kafka spouts. Or you can use Spark Streaming which would allow you to do the same but in a bit different fashion. Or as an option you can always implement your custom component with such logic using standard Kafka libraries.
I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Update Oct 2020: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.
To your question:
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But, fortunately, ...
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
...because Kafka allows a topic to have many partitions, you still get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.
The kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to simply run multiple instances of your streaming app, with all of them running the same consumer group. That way you can spin up as many instances as you need to get optimal partitioning.