Vert.x WriteStream setWriteQueueMaxSize - vert.x

I am new for Vert.x. When I read writestream in the article from https://vertx.io/docs/vertx-core/java/ as following:
setWriteQueueMaxSize: set the number of object at which the write queue is
considered full, and the method writeQueueFull returns true. Note that,
when the write queue is considered full, if write is called the data will
still be accepted and queued. The actual number depends on the stream
implementation, for Buffer the size represents the actual number of bytes
written and not the number of buffers.
especially this statement - "Note that, when the write queue is considered full, if write is called the data will still be accepted and queued.", from this statement, I have a few of questions:
(1) Is there any size limitation for writing to stream? I mean, such as event bus, how many messages can be written to event bus? Does it depend on memory? suppose I keep on writing messages to event bus and message doesn't be consumed, does it cause Out of Memory in Java?
(2) If there is some limitation for writing, where and how can I check default queue size? Such as, I want to know the default queue size of Vert.x KafkaProducer, where can I check it?
Any ideas are appreciated.

There are actually three separate questions there, but I'll give it a shot anyway:
Is there any size limitation for writing to stream?
That depends on the stream implementation. But I've yet to see one implementation that actually uses WriteQueueMaxSize
how many messages can be written to event bus? Does it depend on memory?
EventBus is a special case. If nobody is consuming from EventBus, it will simply drop messages, so in such a trivial case, an infinite number could be written. But if there are consumers, and they're slow, yes, eventually you'll run out of memory. EventBus implementation currently doesn't do anything with WriteQueueMaxSize
If there is some limitation for writing, where and how can I check default queue size? Such as, I want to know the default queue size of Vert.x KafkaProducer, where can I check it?
All Vert.x project are open source, so you'll usually find them on GitHub (or other open source repo, but I don't remember any that are not on GitHub, actually).
Particularly for Vert.x KafkaProducer, you can see the code here:
https://github.com/vert-x3/vertx-kafka-client/blob/1720d5a6792f70509fd7249a47ab50b930cee7a7/src/main/java/io/vertx/kafka/client/producer/impl/KafkaProducerImpl.java#L217

Related

Should I add a buffer after a Kafka source in Akka stream

According to this blog post:
If the source of the stream polls an external entity for new messages and the downstream processing is non-uniform, inserting a buffer can be crucial to realizing good throughput. For example, a large buffer inserted after the Kafka Consumer from the Reactive Streams Kafka library can improve performance by an order of magnitude in some situations. Otherwise, the source may not poll Kafka fast enough to keep the downstream saturated with work, with the source oscillating between backpressuring and polling Kafka.
The documentation for the alpakka kafka connnector doesn't mention that, so I was wondering if it makes sense to use a buffer in this case. Also does the same thing apply to Kafka sinks (should I add a buffer before)?
...I was wondering if it makes sense to use a buffer in this case
Consider the following segment from the blog post you quoted:
...the downstream processing is non-uniform....
One of the points of that section of the post is to illustrate the similar effects that a user-defined buffer and an asynchronous boundary can have on a stream. The default behavior, in which there are no buffers or asynchronous boundaries, is to enable operator fusion, which runs a stream in a single actor. This essentially means that for every Kafka message that is consumed, the message must go through the entire pipeline of the stream, from source to sink, before the next message goes through the pipeline. In other words, a message m2 will not go through the pipeline until the preceding message m1 is finished processing.
If the processing that occurs downstream from a Kafka connector source is "non-uniform" (i.e, it can take varying amounts of time: sometimes the processing happens quickly, sometimes it takes a while), then introducing a buffer or an asynchronous boundary could improve the overall throughput. This is because a buffer or asynchronous boundary can allow the source to continue consuming Kafka messages even if the downstream processing happens to take a long time. That is, if m1 takes a long time to process, the source could consume messages m2, m3, etc. (until the buffer is full), without waiting for m1 to finish. As Colin Breck states in his post:
The buffer improves performance by decoupling stages, allowing the upstream or downstream to continue to process elements, on average, even if one of them is busy processing a relatively expensive workload.
This potential performance boost doesn't apply for all situations. Again quoting Breck:
Similar to the async method discussed in the previous section, it should be noted that inserting buffers indiscriminately will not improve performance and simply consume additional resources. If adjacent workloads are relatively uniform, the addition of a buffer will not change the performance, as the overall performance of the stream will simply be dominated by the slowest processing stage.
One obvious way to determine whether using a buffer (i.e., .buffer) makes sense in your case is to try it. You might also try adding an asynchronous boundary (i.e., .async) instead. Compare the following three approaches--(1) the default fused behavior without buffering, (2) .buffer, and (3) .async--and see which one results in the best performance.

Sorting Service Bus Queue Messages

i was wondering if there is a way to implement metadata or even multiple metadata to a service bus queue message to be used later on in an application to sort on but still maintaining FIFO in the queue.
So in short, what i want to do is:
Maintaining Fifo, that s First in First Out structure in the queue, but as the messages are coming and inserted to the queue from different Sources i want to be able to sort from which source the message came from with for example metadata.
I know this is possible with Topics where you can insert a property to the message, but also i am unsure if it is possible to implement multiple properties into the topic message.
Hope i made my self clear on what i am asking is possible.
I assume you use .NET API. If this case you can use Properties dictionary to write and read your custom metadata:
BrokeredMessage message = new BrokeredMessage(body);
message.Properties.Add("Source", mySource);
You are free to add multiple properties too. This is the same for both Queues and Topics/Subscriptions.
i was wondering if there is a way to implement metadata or even multiple metadata to a service bus queue message to be used later on in an application to sort on but still maintaining FIFO in the queue.
To maintain FIFO in the queue, you'd have to use Message Sessions. Without message sessions you would not be able to maintain FIFO in the queue itself. You would be able to set a custom property and use it in your application and sort out messages once they are received out of order, but you won't receive message in FIFO order as were asking in your original question.
If you drop the requirement of having an order preserved on the queue, the the answer #Mikhail has provided will be suitable for in-process sorting based on custom property(s). Just be aware that in-process sorting will be not a trivial task.

How often put() is triggered in Kafka Connect sink tasks?

Can I control the intervals at which the put() method of my Kafka Connect Sink tasks is triggered? What is the expected behavior of the Kafka Connect framework in this respect? Ideally, I would like to specify, for example, "don't call me unless you have X new records/Y new bytes, or Z milliseconds passed since the last invocation". This could potentially make the batching logic within the sink task simpler (quoting the documentation, "in many cases internal buffering will be useful so an entire batch of records can be sent at once, reducing the overhead of inserting events into the downstream data store).
Today, put from a SinkTask is only called when deliverMessages is invoked in a WorkerSinkTask. The good news is that the only time deliverMessages happens is within poll so you should have some control over how often you poll for new records by overriding consumer properties.
If you want to do internal buffering, you could have a look at how the HDFSConnector is handling this in its implementation of SinkTask. However, right now, Connect will immediately put any records that get returned by the poll.
All of that said, if you are really looking to batch messages before they hit the downstream system, you might consider looking into offset.flush.interval.ms and offset.flush.timeout.ms which control how often flush() is invoked.

How to get Acknowledgement from Kafka

How to I exactly get the acknowledgement from Kafka once the message is consumed or processed. Might sound stupid but is there any way to know the start and end offset of that message for which the ack has been received ?
What I found so far is in 0.8 they have introduced the following way to choose from the offset for reading ..
kafka.api.OffsetRequest.EarliestTime() finds the beginning of the data in the logs and starts streaming from there, kafka.api.OffsetRequest.LatestTime() will only stream new messages.
example code
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
Still not sure about the acknowledgement part
Kafka isn't really structured to do this. To understand why, review the design documentation here.
In order to provide an exactly-once acknowledgement, you would need to create some external tracking system for your application, where you explicitly write acknowledgements and implement locks over the transaction id's in order to ensure things are only ever processed once. The computational cost of implementing such as system is extraordinarily high, and is one of the main reasons that large transactional systems require comparatively exotic hardware and have arguably lower scalability than systems such as Kafka.
If you do not require strong durability semantics, you can use the groups API to keep rough track of when the last message was read. This ensures that every message is read at least once. Note that since the groups API does not provide you the ability to explicitly track your applications own processing logic, that your actual processing guarantees are fairly weak in this scenario. Schemes that rely on idempotent processing are common in this environment.
Alternatively, you may use the poorly-named SimpleConsumer API (it is quite complex to use), which enables you to explicitly track timestamps within your application. This is the highest level of processing guarantee that can be achieved through the native Kafka API's since it enables you to track your applications own processing of the data that is read from the queue.

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.