I want to implement a queue mechanism using kafka. But could not find anywhere that if it's possible to just peek data from the queue created for any topic without moving forward into it.
I want to read data from the queue and on the basis of different conditions want to remove the existing message or add another message into this queue. Also, is it possible to use a single kafka server from different machines.
I referred to tutorialspoint for learning more about it.
Thanks in advance. Any leads would be appreciated.
Keep in mind that Kakfa scales with multiple partitions per topic, and it doesn't give any ordering guarantee between partitions. So don't use kafka if you want strict ordering. Within a consumer group, if you want n consumers per topic, you need to have atleast n partitions.
Consumers don't remove messages, they commit the offset of a message. Default configuration in most clients is to auto commit offset on read. You can re-insert messages into the topic anytime. But you cannot skip a message and expect to process it later.
You can connect as many machines as you want to a kafka server. Typically, you have multiple servers as a kafka cluster, with replication for fault tolerance.
Related
The topics are dynamically created, and there could be thousands of them. I need a way to detect when messages are produced so I can consume them. Moreover, I need to consume each topic independently so that I can then bulk a large number of messages into a database, each topic corresponding to a different table. So let's say I start consuming a topic, I would consume 1000 messages, bulk them in a database in one operation, then commit the reading in kafka. If I have 10 topics, I could use 10 consumers in parallel. The problem is if I end up with a large number of topics, and that most of them are idle (empty), I need a way to be notified that some topics become suddenly active, so that I don't have to launch thousands of idle consumers that do nothing most of the time.
The only solution I thought so far is using a single signal topic in addition to the real topics, in which the producers would produce in addition to the real topic. But I was wondering if there was another solution. Like polling the meta-data in kafka, maybe. But for what I've seen, I would have to iterate through all the topics matching a regex, then check the offsets of the partitions for each. I don't think it's possible to do that efficiently, but maybe I'm wrong.
You could track JMX metrics from the broker for incoming bytes per topic using Prometheus JMX Exporter, for example, then combine that with AlertManager to send some event/webhook upon some threshold of data to a consuming REST service, which would then start some consumers (maybe Kafka Connect tasks for a database?).
Or, like you said, use a signal topic since producer requests can be made to multiple topics at once.
If I have 10 topics, I could use 10 consumers in parallel
You can have more parallel consumers if any of those topics have multiple partitions
could be thousands of them
There's are reasonable limits on the number of topics a Kafka cluster can support, by the way, but it's upwards of hundreds of thousands, as of latest releases. Something to keep in mind, though.
launch thousands of idle consumers that do nothing most of the time.
You could also use solutions like AWS Lambda or Kubernetes KEDA to auto scale up/down based on topic data (lag)
=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.
Let's say I have a Kafka cluster with several topics spread over several partitions. Also, I have a cluster of applications act as clients for Kafka. Each application in that cluster has a client that is subscribed to a same set of topics, which is identical over the whole cluster. Also, each of these clients share same Kafka group ID.
Now, speaking of commit mode. I really do not want to specify offset manually, but I do not want to use autocommit either, because I need to do some handing after I receive my data from Kafka.
With this solution, I expect to occur "same data received by different consumers" problem, because I do not specify offset before I do reading (consuming), and I read data concurrently from different clients.
Now, my question: what are the solutions to get rid of multiple reads? Several options coming to my mind:
1) Exclusive (sequential) Kafka access. Until one consumer committed read, no other consumers access Kafka.
2) Somehow specify offset before each reading. I do not even know how to do that with assumption that read might fail (and offset will not be committed) - we gonna need some complicated distributed offset storage.
I'd like to ask people experienced with Kafka to recommend something to achieve behavior I need.
Every partition is consumed only by one client - another client with the same group ID won't get access to that partition, so concurrent reads won't occur...
Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.
All, Forgive me I am a newbie just beginner of Kafka. Currently I was reading the document of Kafka about the difference between traditional message system like Active MQ and Kafka.
As the document put.
For the traditional message system. they can not scale the message processing.
Since
Publish-subscribe allows you broadcast data to multiple processes, but
has no way of scaling processing since every message goes to every
subscriber.
I think this make sense to me.
But for the Kafka. Document says the Kafka can scale the message processing even in the publish-subscribe mode. (Please correct me if I was wrong. Thanks.)
The consumer group concept in Kafka generalizes these two concepts. As
with a queue the consumer group allows you to divide up processing
over a collection of processes (the members of the consumer group). As
with publish-subscribe, Kafka allows you to broadcast messages to
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these
properties—it can scale processing and is also multi-subscriber—there
is no need to choose one or the other.
So my question is How Kafka make it ? I mean scaling the processing in the publish-subscribe mode. Thanks.
The main unique features in Kafka that enables scalable pub/sub are:
Partitioning individual topics and spreading the active partitions across multiple brokers in the cluster to take advantage of more machines, disks, and cache memory. Producers and consumers often connect to many or all nodes in the cluster, not just a single master node for a given topic/queue.
Storing all messages in a sequential commit log and not deleting them when consumed. This leads to more sequential reads and writes, offloads the broker from having to deal with keeping track of different copies of messages, deleting individual messages, handling fragmentation, tracking which consumer has acknowledged consuming which messages.
Enabling smart parallel processing of individual consumers and consumer groups in a way that each parallel message stream can come from the distributed partitions mentioned in #1 while offloading the offset management and partition assignment logic onto the clients themselves. Kafka scales with more consumers because the consumers do some of the work (unlike most other pub/sub brokers where the bulk of the work is done in the broker)