Biztalk - How to throttle a streaming disassemble pipeline - streaming

I need to limit the number of orchestration instances spawned while debatching a large message in a streaming disassemble receive pipeline. Let’s say that I have a large xml coming in that contains 100 000 separate "Order" message. The receive pipeline would then debatch it and create 100 000 "ProcessOrder" orchestrations. This is too much and I need to limit that.
Requirements
The debatching needs to be done in a streaming manner so that I only load one "Order" message in memory at a time before sending it to the messagebox;
The debatching needs to be throttled based on the number of current running "ProcessOrder" orchestration instances (say if I already have 100 running instances, the debatching would wait till one is over to send another "Order" message to the messagebox).
Where I'm at
I have the receive pipeline that does the debatching and functional modifications to my messages. It does what it should in a streaming manner and puts individual messages in VirtualStreams;
I have an orchestration and helper methods that can limit the number of “ProcessOrder” orchestration instances.
The problem
I know that I can run a receive pipeline inside an orchestration (and that would solve my problem since on every "getnext" call to the pipeline, I could just hold on if there are too many running orchestration instances) but, digging in biztalk dlls, I noticed that using Microsoft.XLANGs.Pipeline.XLANGPipelineManager still loads up all the messages in memory instead of enumerating them like Microsoft.BizTalk.PipelineOM.PipelineManager does. I know they are putting every messages in VirtualStream but this is still inadequate, memory wise, for such a large message number.
Question
My next step would be to run the receive pipeline directly in the receive port (so it would use Microsoft.BizTalk.PipelineOM.PipelineManager) without having the orchestration that limits the number of “ProcessOrder” instances, but to meet the requirements, I would need to add a delay logic in my pipeline. Is this a viable option? If not, why? and what other alternative do I have?

You should debatch all messages once from pipeline and store those individual messages in MSMQ before even they are processed by orchestration. Use standard pipeline to debatch messages as they are efficient to handle large files debatching. MSMQ is available for free through Turn On Windows Features. Using MSMQ is very easy and does not require any development. Sending to MSMQ will be very fast 100K messages is not issue at all.
Then have a receive location to read from MSMQ. Depending on your orchestration throughput, you can control message flow by using BizTalk receive host throttling or by receiving the messages from MSMQ in Order or using the combination of both. Make sure you have separate host instance for both receive MSMQ and send MSMQ and for your orchestration processing.
This will be done through all configurations without any extra code simplifing your design. Make sure you have orchestration with minimum number of persistent points.

Related

How do I make sure that I process one message at a time at most?

I am wondering how to process one message at a time using Googles pub/sub functionality in Go. I am using the official library for this, https://pkg.go.dev/cloud.google.com/go/pubsub#section-readme. The event is being consumed by a service that runs with multiple instances, so any in memory locking mechanism will not work.
I realise that it's an anti-pattern to do this, so let me explain my use-case. Using mongoDB I store an array of objects as an embedded document for each entity. The event being published is modifying parts of this array and saves it. If I receive more than one event at a time and they start processing exactly at the same time, one of the saves will override the other. So I was thinking a solution for this is to make sure that only one message will be processed at a time, and it would be nice to use any built-in functionality in cloud pub/sub to do so. Otherwise I was thinking of implementing some locking mechanism in the DB but i'd like to avoid that.
Any help would be appreciated.
You can imagine 2 things:
You can use ordering key in PubSub. Like that, all the message in relation with the same object will be delivered in order and one by one.
You can use a PUSH subscription to PubSub, to push to Cloud Run or Cloud Functions. With Cloud Run, set the concurrency to 1 (it's by default with Cloud Functions gen1), and set the max instance to 1 also. Like that you can process only one message at a time, all the other message will be rejected (429 HTTP error code) and will be requeued to PubSub. The problem is that you can parallelize the processing as before with ordering key
A similar thing, and simpler to implement, is to use Cloud Tasks instead of PubSub. With Cloud Tasks you can set a rate limit on a queue, and set the maxConcurrentDispatches to 1 (and you haven't to do the same with Cloud Functions max instances or Cloud Run max instances and concurrency)

Why doesn't my Azure Function scale up?

For a test, I created a new function app. I added two functions, one was an http trigger that when invoked, pushed 500 messages to a queue. The other, a queue trigger to read the messages. The queue trigger function code, was setup to read a message and randomly sleep from 1 to 30 seconds. This was intended to simulate longer running tasks.
I invoked the http trigger to create the messages, then watched the que fill up (messages were processed by the other trigger). I also wired up app insights to this function app, but I did not see is scale beyond 1 server.
Do Azure functions scale up soley on the # of messages in the que?
Also, I implemented these functions in Powershell.
If you're running in the Azure Functions consumption plan, we monitor both the length and the throughput of your queue to determine whether additional VM resources are needed.
Note that a single function app instance can process multiple queue messages concurrently without needing to scale across multiple VMs. So if all 500 messages can be consumed relatively quickly (again, in the consumption plan), then it's possible that you won't scale at all.
The exact algorithm for scaling isn't published (it's subject to lots of tweaking), but generally speaking you can expect the system to automatically scale you out if messages are getting added to the queue faster than your functions can process them. Your app will also scale out if the latency of the first message in the queue is continuously increasing (meaning, messages are sitting idle and not getting processed). The time between VMs getting added is usually in the tens of seconds.
There are some thresholds based on queue count as well. For example, the system tries to ensure that there is at least 1 VM for every 1K queue messages, but usually the scale decisions are based on message throughput as I described earlier.
I think #Chris Gillum put it well, it's hard for us to push the limits of the server to the point that things will start to scale.
Some other options available are:
Use durable functions and scale with Threading:
https://learn.microsoft.com/en-us/azure/azure-functions/durable-functions-cloud-backup
Another method could be to use Event Hubs which are designed for massive scale. Instead of queues, have Function #1 trigger an Event, and your Function #2 subscribed to that Event Hub trigger. Adding Streaming Analytics, could also be an option to more fully expand on capabilities if needed.

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.

MSMQ multiple readers

This is my proposed architecture. Process A would create items and add it to a queue A on the local machine and I plan to have multiple instances of windows service( running on different machines) reading from this queue A .Each of these windows service would read a set of messages and then process that batch.
What I want to make sure is that a particular message will not get processed multiple times ( by the different windows service). Does MSMQ by default guarantee single delivery?
Should I make the queue transactional? or would a regular queue suffice.
If you need to make sure that the message is delivered only once, you would want to use a transactional queue. However, when a service reads a message from the queue it is removed from the queue and can only be received once.

Slow Subscriber

I am newbie to ZMQ
ZMQ Version - 2.2.1
Ubuntu - 10.04
I am using the PUB-SUB pattern for communication between multiple publishers and multiple subscribers. A forwarder is used to subscribe data from multiple publishers and the same is published to all the subscribers.
Currently, if three publishers are running and if each publisher sends 1000 messages in 1second via the PUB channel. The subscriber receives the data, stores it and writes to a database every 1second.
Because of the involvement of database, the rate at which subscriber receives the data is getting delayed, as a result the memory usage (RAM) increases by 6-7MB every 1second. Finally the subscriber gets killed by OS due to OOM
I tried using the options ZWQ_HWM & ZMQ_SWAP on both the sockets of forwarder. But still the issue persists.
Is there any solution for this???
Overall your problem is that your database cannot keep up with your publisher. 0MQ cannot solve this for you. You need an architectural solution based on changing the behavior of your system, presumably the way you do inserts.
You have a few options:
Use a faster database
Use a faster database insert method
Write to a log which is processed asynchronously by another process
Change to a socket pattern that lets the receivers tell the senders that they are backed up, so the senders pause (if that's possible)
I think in your case the spool-to-disk-file option is the best.