Open Source Queuing Solutions for peek, mark as done and then remove [closed] - queue

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am looking at open source queuing platforms that allow me do the following:
I have multiple producers, multiple consumers putting data into a queue in a multithreaded environment with the specific use case:
I want the ability for consumers to be able do the following
Peek at a message from the queue(which should mark as the message as invisible on the queue so that other consumers cannot consume the same message)
The consumer works on the message consumed and if it is able to do the work successfully, it marks the message as consumed which should permanently delete it from the queue.
If the consumer dies abruptly after marking the message as consumed or fails to acknowledge successful consumption after a certain timeout, the message is made visible on the queue again so that another consumer can pick it up.
I've been looking at RabbitMQ, hornetQ, ActiveMQ but I'm not sure I can get this functionality out of the box, any recommendations on a system that gives me this functionality?

RabbitMQ does this out of the box, except for the timeout-based redelivery. If the connection is dropped while a message is unacknowledged, the message will be requeued for delivery to some other consumer of the queue. You can either use pull-mode ("Basic.Get") or push-mode/subscribe-mode ("Basic.Consume") to get the server to feed you messages.

This is how hornetq works in auto acknowledge mode. It's not really "peeking" , but a message is delivered to a listener and is not visible to any other listener. If the listener fails to complete the the transaction, because it dies, throws an exception, etc. , the message reappears on the queue and is redelivered to another listener. If the listener successfully completes the message is removed from the queue for good.
Sorry, just realized this thread is over a year old. Well, maybe this will help someone...

What you're asking for is standard JMS behaviour - which would be implemented out of the box by any compliant JMS implementation.

By way of introduction I can say that I've built and designed many message based systems from the ground up, using many technologies including CORBA, COM and native sockets.
In many of these it is the design that sits on the technology that is important.
Bearing this in mind I would probably choose to start with RabbitMQ and maybe enhance it if needed.
In many ways it is a headful to understand AMQP but it is worth the time, and I believe that it will allow you to make this work.
Even if you can't get the exact functionality out of the box the important question is can you make it do this, which I believe I could. It's opensource after all.

Related

kafka - How to read and process messages in a fault tolerant way? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I am new to Kafka, and I am trying to understand how a Kafka Consumer can consume and process a message from a Kafka topic without losing messages if the consumer fails in between.
For Example:
There is a kafka topic called cart_checkout which just holds an event when the user successfully checks out their cart. All it will do is get an event and send an email to the user that the their items are checked out.
This is the happy path:
Consumer gets an event from the topic.
Read offset is committed to Kafka.
Consumer calls app specific function to send email.
Repeat.
What happens if the app fails during Step 3 ?
If the consumer starts up then it will miss an event, am I right ? (Since the read message was committed)
The consumer can rewind but how will it know to rewind ?
RabbitMQ solution:
It seems easier to solve in RabbitMQ since we can control the ACKs to a message. But in Kafka, if we commit the read-offset we lose the message when the app restarts, if we don't commit the read-offset then the same message is sent to multiple consumers.
What is the best possible way to deal with this problem ?
As you mentioned a consumer offset can also be managed manually which you will need in this case to avoid duplications(by default the delivery guarantee is at least once) or missing an email as mentioned in your use case.
To manually control the offset auto-commit should be disabled (enable.auto.commit = false).
Now regarding the second part of your question where you mention:
if we don't commit the read-offset then the same message is sent to multiple consumers
This understanding is not fully correct. In Kafka by default each consumer controls its own offset and consumers in the same consumer group don't share partitions(each consumer in the same consumer group reads from a single partition) so the message will not be processed (in Kafka consumers poll messages) by other consumers in the same consumer group, that would happen only to consumers in a different consumer group but then it would be the intention that they also read the message anyway, that's by design in Kafka clients. It's also relevant that you understand that the consumer poll returns multiple messages and the default commitSync or commitAsync will commit the offset of all messages returned by that poll call if you want to avoid possible email duplication you might want to use a more specific commit, check the API here: https://kafka.apache.org/30/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
As you're learning this is a common misunderstanding and there are some nice resources to read that clarify this concept for free around, I suggest the official design section of the documentation: https://kafka.apache.org/documentation/#theconsumer also this free book chapter: https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html is specific to what you're doing now. Good luck.

Is possible to publish to MQTT clients during their sleep?

I am going to implement some sort of home automation system (as my bachelor thesis). I have looked on MQTT protocol and I have two question about it.
I have saw this tutorial:
https://www.youtube.com/watch?v=X04yaaydjFo&list=PLeJ_Vi9u6KisKTSNeRRfqsASNZdHSbo8E&index=13
Which has materials (codes etc) here:
https://randomnerdtutorials.com/raspberry-pi-publishing-mqtt-messages-to-esp8266/
My first question is:
Is the logic how to manage acquired data/message (due to some topic subscription) in broker or clients? From tutorial above it seems, that that logic how to deal with message is in clients. Should it always be there? Or is it possible to has it in broker? Sorry if this question is too "abstract", i am at beginning of programming, so I have no concrete example. Basically what I want is to have as "light" program in client, as is possible (because broker will have lot of memory and computing capacity, while clients will be very limited in both).
My second question is:
Is it posible to put ESP8266 (or just any client) in sleep and awake it, let's say every 5 minutes? Of course it should not be problem if that client only publish (and never subscribe) to topics. But what when I have client, which can read some sensor, which will send in that 5 minutes cycles to broker, and also can control some of its output? Is there way to do that? Or if client is not available (if there is some data to publish to it and it is currently sleeping), the message is just thrown out? My thought was if there is way to ask broker after client awake if there was any published message to them during client sleep?
Thanks for every info! :)
1. Answer
The message processing logic is to be fully implemented by the client. With MQTT there is no "stream" processing as is possible with e.g. Apache Kafka. However, you can of course have intermediary (non-IoT) clients that subscribe to the original topic, prepare a modified message and publish it to a new topic - the one which the IoT device would then subscribe to.
2. Answer
You can tell the broker to retain a message. However, it will at most retain 1 message per topic.
P.S. for the future please stick to 1-question-1-post here on SO.

What is difference between Apache Kafka and GCP PubSub? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What is difference between Apache Kafka and GCP PubSub? when to use kafka and when to use pubsub.
Since you did not provide your use case, I will state below the main characteristics of each tool.
PubSub:
It is a cloud asynchronous messaging service that decouples senders and receivers provided by Google Cloud. It offers high availability and consistent performance at scale.
No Ops: in PubSub you do not need to worry about partitions and shards.
Scalability: is built-in without any required operation, it handles scalability automatically.
Monitoring: you can monitor your process at a Topic and Subscription level within StackDriver.
Access management: you can configure access at a project, Topic and Subscriber level.
Reliability: it guarantees the the message will be delivered at least once. Although, it does not guarantee ordering (which can be handled in Dataflow).
Message retention in PubSub: the minimum is 10 minutes and the maximum is 7 days.
Kafka:
It is an open-source distributed publish-subscribe messaging ecosystem. It can be used on-prem or deployed in cloud environments.
Scalability: it does not support automatic scalability. Thus, you need to increase partitions, replications etc, manually.
Ordering: it can support ordered messages in the partition level.
Reliability: it guarantees no data loss.
Monitoring: it offers various types of built-in monitoring systems.
Notice that I just shared the main characteristics of each tool. Although there are many others which can be more relevant for your use case. Here are some links where you can find other information about each one's aspects: 1, 2, 3.

Designing a real-time data pipeline for an e-commerce web site [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to learn Apache Kafka. I read articles and documents but I could not figure out how Kafka works. There are lots of questions in my mind :( I want to create a Kafka cluster and develop some code for preparing data engineering interviews. But, I am stuck. Any help would be appreciated. I will try to explain my questions in an example scenario.
For instance, there is a popular e-commerce company. They have a huge amount of web traffic. The web site is running on AWS. The mobile applications are also using AWS services.
The marketing department wants to observe the efficiency of their advertisement actions like email, SMS etc. They also want to follow important real-time metrics (sold products, page views, active users in the last n minutes etc) in a dashboard.
First, the campaign automation system sends personalized campaign emails to target customers. When a user clicks the link in the advertisement email, the browser is opening the e-commerce web site.
On the background, the website developers should send a clickstream event to the Kafka cluster with the related parameters (like customer id, advertisement id, source_medium etc).
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
All clickstream events should be consumed.
All clickstream events should be consumed for once. If a product view event is consumed more than once, the dashboard will not show the correct product view count.
Do developers need to manage offsets manually? Or is there any technology/way which manages offsets automatically?
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
However, Kafka is a queue and there is not any order in it. Producers can send data to Kafka asynchronously. How can data engineers calculate the durations correctly?
What happens if a producer sends an event to Kafka after the total elapsed duration was calculated.
Note: View duration may fit better to content web sites. For example, Netflix marketing users want to analyze the content view durations and percentages. If a user opens a movie and watched just five minutes, the marketing department may consider that the user does not like the movie.
Thanks in advance
You have really asked several unrelated questions here. Firstly, Kafka has a lot of free documentation available for it, along with many high quality 'getting started' blocks and paid books and courses. I would definitely start there. You might still have questions, but at least you will have a better awareness of the platform and you can ask questions in a lot more focused ways, which will hopefully get a much better answer. Start with the official docs. Personally, I learned Kafka by reading the Effective Kafka book, but I'm sure there are many others.
Going through your list of questions.
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
The website would typically publish an event. This is done by opening a client connection to a set of Kafka brokers and publishing a record to some topic. You mention POST/GET requests: this is not how Kafka generally works — the clients establish persistent connections to a cluster of brokers. However, if you preferred programming model is REST, Confluent does provide a Kafka REST Proxy for this use case.
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
It depends how you write to S3. You may develop a custom consumer application that stages writes in a different persistent layer and then writes to S3 in batches. Kafka Connect also has an Amazon S3 connector that moves data in chunks.
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
There is no correct answer here. All of the technologies you have listed are valid and may be used to a similar effect. Both Connect and Streams are quite popular for this types of applications; however, you can just as easily write a custom consumer application for all your needs.
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
In the simplest case, Kafka offset management is automatic and the default behaviour allows for at-least once delivery, whereby a record will be delivered again if the first processing attempt failed. This may lead to duplicate effects (counting a clickstream event twice, as you described) but this is addressed by making your consumer idempotent. This is a fairly complex topic; there is great answer on Quora that covers the issue of exactly-once delivery in detail.
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
The concept of order is backed into Kafka. Kafka's topics are sharded into partitions, where each partition is a totally-ordered, unbounded stream of records. Records may be strictly ordered provided they are published to the same partition. This is achieved by assigning them the same key, which the Kafka client hashes behind the scenes to arrive at the partition index. Any two records that have the same key will occupy the same partition, and will therefore be ordered.
Welcome to stackoverflow! I will answer a few of your questions, however you should go through the Kafka documentation for such things, if you are facing any problem while implementing it, then you should post here.
How can developers send data to a Kafka cluster? You have talked about producers, but I guess you haven't read about them, the developers will have to use a producer to produce an event to a Kafka topic.You can read more about a Kafka producer in the documentation.
To direct the messages to a storage layer, Kafka consumers will be used.
Note : Kafka Connect can be used instead of Kafka producer and consumer in some scenarios, Kafka connect has source connectors and sink connectors instead of producer and consumer.
For real time data analysis, Kafka Streams or KSQL can be used. These cannot be explained in an answer, I recommend you go through the documentation.
A single Kafka topic can have multiple consumer groups, and every consumer group has a different offset, you can tweak the configuration to use or not to use these offsets for every consumer group.
You can change various configurations such as Ack = All, to guarantee at least once and at most once semantics. Again you should go through the documentation to understand this completely.
You can maintain message order in Kafka as well, for that to happen, your consumers will have to wait for the acknowledgement from Kafka after every message has been sent, obviously this will slow down the process but you will have to compromise one of the things.
I haven't understood your requirements related to the last point, but I guess you should go through Kafka Streams and KSQL documentation once, as you can manage your window size for analysis over there.
I have tried to answer most of your questions in brief but to understand it completely, obviously you will have to go through the documentation in detail.
Agree with the answers above. The questions you ask are reasonably straightforward and likely answered in the official documentation.
As per one of the replies, there are lots of excellent books and tutorials online. I recently wrote a summary of educational resources on Kafka which you might find useful.
Based on your scenario, this will be a straightforward stream processing application with an emitter and a few consumers.
The clickstream event would be published onto the Kafka cluster through a Kafka client library. It's not clear what language the website is written in, but there is likely a library available for that language. The web server connects to Kafka brokers and publishes a message every time the user performs some action of significance.
You mention that order matters. Kafka has inherent support for ordered messages. All you need to do is publish related messages with the same key, for example the username of the customer or their ID. Kafka then ensures that those messages will appear in the order that they were published.
You say that multiple consumers will be reading the same stream. This is easily achieved by giving each set of consumers a different group.id. Kafka keeps a separate set of committed offsets for each consumer group (Kafka's terminology for a related set of consumers), so that one group can process messages independently of another. For committing offsets, the easiest approach is to use the automatic offset commit mode that is enabled by default. This way records will not be committed until your consumer is finished with them, and if a consumer fails midway through processing a batch of records, those records will be redelivered.

I am evaluating Google Pub/Sub vs Kafka. What are the differences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have not worked on kafka much but wanted to build data pipeline in GCE. So we wanted to know Kafka vs PUB/Sub. Basically I want to know how message consistency, message availability, message reliability is maintained in both Kafka and Pub/sub
Thanks
In addition to Google Pub/Sub being managed by Google and Kafka being open source, the other difference is that Google Pub/Sub is a message queue (e.g. Rabbit MQ) where as Kafka is more of a streaming log. You can't "re-read" or "replay" messages with Pubsub. (EDIT - as of 2019 Feb, you CAN replay messages and seek backwards in time to a certain timestamp, per comment below)
With Google Pub/Sub, once a message is read out of a subscription and ACKed, it's gone. In order to have more copies of a message to be read by different readers, you "fan-out" the topic by creating "subscriptions" for that topic, where each subscription will have an entire copy of everything that goes into the topic. But this also increases cost because Google charges Pub/Sub usage by the amount of data read out of it.
With Kafka, you set a retention period (I think it's 7 days by default) and the messages stay in Kafka regardless of how many consumers read it. You can add a new consumer (aka subscriber), and have it start consuming from the front of the topic any time you want. You can also set the retention period to be infinite, and then you can basically use Kafka as an immutable datastore, as described here: http://stackoverflow.com/a/22597637/304262
Amazon AWS Kinesis is a managed version of Kafka whereas I think of Google Pubsub as a managed version of Rabbit MQ.
Amazon SNS with SQS is also similar to Google Pubsub (SNS provides the fanout and SQS provides the queueing).
I have been reading the answers above and I would like to complement them, because I think there are some details pending:
Fully Managed System Both system can have fully managed version in the cloud. Google provides Pubsub and there are some fully managed Kafka versions out there that you can configure on the cloud and On-prem.
Cloud vs On-prem I think this is a real difference between them, because Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka you can use as a both Cloud service and On-prem service (doing the cluster configuration by yourself)
Message duplication
- With Kafka you will need to manage the offsets of the messages by yourself, using an external storage, such as, Apache Zookeeper. In that way you can track the messages read so far by the Consumers. Pubsub works using acknowledging the message, if your code doesn't acknowledge the message before the deadline, the message is sent again, that way you can avoid duplicated messages or another way to avoid is using Cloud Dataflow PubsubIO.
Retention policy Both Kafka and Pubsub have options to configure the maximum retention time, by default, I think is 7 days.
Consumers Group vs Subscriptions Be careful how you read messages in both systems. Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription. Once a message is read and acknowledge, the message for that subscription is gone. Kafka use the concept of "consumer group" and "partition", every consumer process belongs to a group and when a message is read from a specific partition, then any other consumer process which belongs to the same "consumer group" will not be able to read that message (that is because the offset eventually will increase). You can see the offset as a pointer which tells the processes which message have to read.
I think there is not a correct answer for your question, it will really depends on what you will need and the constrains you have (below are some examples of the escenarios):
If the solution must be in GCP, obviously use Google Cloud Pubsub. You will avoid all the settings efforts or pay extra for a fully automated system that Kafka requires.
If the solution should require process data in Streaming way but also needs to support Batch processing (eventually), it is a good idea to use Cloud Dataflow + Pubsub.
If the solution require to use some Spark processing, you could explore Spark Streaming (which you can configure Kafka for the stream processing)
In general, both are very solid Stream processing systems. The point which make the huge difference is that Pubsub is a cloud service attached to GCP whereas Apache Kafka can be used in both Cloud and On-prem.
Update (April 6th 2021):
Finally Kafka without Zookeeper
One big difference between Kafka vs. Cloud Pub/Sub is that Cloud Pub/Sub is fully managed for you. You don't have to worry about machines, setting up clusters, fine tune parameters etc. which means that a lot of DevOps work is handled for you and this is important, especially when you need to scale.