Interprocess messaging - MSMQ, Service Broker,? - msmq

I'm in the planning stages of a .NET service which continually processes incoming messages, which involves various transformations, database inserts and updates, etc. As a whole, the service is huge and complicated, but the individual tasks it performs are small, simple, and well-defined.
For this reason, and in order to allow for easy expansion in future, I want to split the service into several smaller services which basically perform part of the processing before passing it onto the next service in the chain.
In order to achieve this, I need some kind of intermediary messaging system that will pass messages from one service to another. I want this to happen in such a way that if a link in the chain crashing or is taken offline briefly, the messages will begin to queue up and get processed once the destination comes back online.
I've always used message queuing for this type of thing, but have recently been made aware of SQL Service Broker which appears to do something similar. Is SQLSB a viable alternative for this scenario and, if so, would I see any performance benefits by using that instead of standard Message Queuing?
Thanks

It sounds to me like you may be after a service bus architecture. This would provide you with the coordination and fault tolerance you are looking for. I'm most familiar and partial to NServiceBus, but there are others including Mass Transit and Rhino Service Bus.

If most of these steps initiate from a database state and end up in a database update, then merging your message storage with your data storage makes a lot of sense:
a single product to backup/restore
consistent state backups
a single high-availability/disaster recoverability solution (DB mirroring, clustering, log shipping etc)
database scale storage (IO capabilities, size and capacity limitations etc as per the database product characteristics, not the limits of message store products).
a single product to tune, troubleshoot, administer
In addition there are also serious performance considerations, as having your message store be the same as the data store means you are not required to do two-phase commit on every message interaction. Using a separate message store requires you to enroll the message store and the data store in a distributed transaction (even if is on the same machine) which requires two-phase commit and is much slower than the single-phase commit of database alone transactions.
In addition using a message store in the database as opposed to an external one has advantages like queryability (run SELECT over the message queues).
Now if we translate the abstract terms 'message store in the database as being Service Broker and 'non-database message store' as being MSMQ, you can see my point why SSB will run circles any time around MSMQ.

My recent experiences with both approaches (starting with Sql Server Service Broker) led me to the situation in which I cry for getting my messages out of SQL server. The problem is quasi-political but you might want to consider it: SQL server in my organisation is managed by a specialized DBA while application servers (i.e. messaging like NServiceBus) by developers and network team. Any change to database servers requires painful performance analysis from DBA and is immersed in fear that we might get standard SQL responsibilities down by our queuing engine living in the same space.
SSSB is pretty difficult to manage (not unlike messaging middleware) but the difference is that I am more allowed to screw something up in the messaging world (the worst that may happen is some pile of messages building up somewhere and logs filling up) and I can't afford for any mistakes in SQL world, where customer transactional data live and is vital for business (including data from legacy systems). I really don't want to get those 'unexpected database growth' or 'wait time alert' or 'why is my temp db growing without end' emails anymore.
I've learned that application servers are cheap. Just add message handlers, add machines... easy. Virtually no license costs. With SQL server it is exactly opposite. It now appears to me that using Service Broker for messaging is like using an expensive car to plow potato field. It is much better for other things.

Related

Limits of processing data on the client vs. processing data on the server

For a desktop App (ERP like functionality) I'm and wondering what would be wiser to do.
Assuming that both machines are equal in performance and the server has to deal with max. 5-10 clients and no other obligations. Is it better to load all data initially (~20.000 objects) and do filtering, sorting etc. on the client (electron) or is it better to do the processing on the backend (golang + postgres) over Axios. The user interface should be as snappy as possible but also get the data as fast as possible.
A costly operation is filtering 15.000 Objects by a reference ID. (e.g. a client can have several orders)
So objects that belong to a "parent object" are displayed by querying all those objects by a parentID.
Is there a general answer to what would be more performant, or a better choice here? Doing some assumptions, like a latency of 5ms in the network + 20ms for the API + a couple for filling the store.
At which data size will this operation be slower on the frontend or completely unsustainable?
If it's not a performance problem, are there other reasons I would want to do this on the server?
Edit: Client and Server are on the same local network
You specifically mention an ERP-like software. For such software you have to carefully consider the value of consistency:
Will your software need to show the same data for all clients?
If the answer to this is yes, then the simplest implementation is to do data processing on the server which informs all clients of changing data.
If the answer to this is no, then you should be fine doing most processing on the client software.
There are of course ways to do most of your processing on the client yet still have consistency but they will add complexity to your overall design. One implementation is to broadcast changes on one client to all other clients. This is the architecture behind most multiplayer online games.
Another way to tackle this is implemented by git: the data on all clients are different from each other but there are ways to synchronize each client data with the server thus achieving eventual consistency.
Another consideration you have to think about is the size of your data:
Will downloading all the data from the server take more than a few seconds?
If downloading all data from the server takes too long then the UI will be essentially unresponsive when starting.

how to design a realtime database update system?

I am designing a whatsapp like messenger application for the desktop using WPF and .Net. Now, when a user creates a group I want other members of the group to receive a notification that they were added to a group. My frontend is built in C#.Net, which is connected to a RESTful Webservice (Ruby on Rails). I am using Postgres for the database. I also have a Redis layer to cache my rails models.
I am considering the following options.
1) Use Postgres's inbuilt NOTIFY/LISTEN mechanism which the clients can subscribe to directly. I foresee two issues here
i) Postgres might not be able to handle 10000's of clients subscribed directly.
ii) There is no guarantee of delivery if the client is disconnected
2) Use Redis' Pub/Sub mechanism to which the clients can subscribe. I am still concerned with no guarantee of delivery here.
3) Use a messaging queue like RabbitMQ. The producer of this queue will be postgres which will push in messages through triggers. The consumer of-course will be the .Net clients.
So far, I am inclined to use the 3rd option.
Does anyone have any suggestions how to design this?
In an application like WhatsApp itself, the client running in your phone is an integral part of a large and complex event-based, distributed system.
Without more context, it would be impossible to point in the right direction. That said:
For option 1: You seem to imply that each client, as in a WhatsApp client, would directly (or through some web service) communicate with Postgres as an event bus, which is not sound and would not scale because you can only have ONE Postgres instance.
For option 2: You have the same problem that in option 1 with worse failure modes.
For option 3: RabbitMQ seems like a reasonable ally here. It is distributed in nature and scales well. As a matter of fact, it runs on erlang just as most of WhatsApp does. Using triggers inside Postgres to publish messages however does not make a lot of sense.
You need a message bus because you would have lots of updates to do in the background, not to directly connect your users to each other. As you said, clients can be offline.
Architecture is more about deferring decisions than taking them.
I suggest that you start simple. Build a small, monolithic, synchronous system first, pushing updates as persisted data to all the involved users. For example; In a group of n users, just write n records to a table. It is already complicated to reliably keep track of who has received and read what.
This heavy "group" updates can then be moved to long-running processes using RabbitMQ or the like, but a system with several thousand users can very well work without such thing, especially because a simple message from user A to user B would not need many writes.

How to implement real-time replication of MongoDB (or CouchDB) to many remote clients

I'm considering how to design a mechanism for replicating a (potentially large) MongoDB or other NoSQL (CouchDB, etc) database to dozens of clients at once. The clients would function like a replica set, but the replication would be one-way and the remote clients would belong to other parties. Specifically, I am looking for the following features:
real-time: changes to the master database should be pushed out to the clients as quickly as possible
replication to new clients: a new client must be able to connect, automatically sync the majority of existing data, then receive real-time updates.
efficient: both the initial synchronization/transfer of data and tracking of real-time updates ("diffs", if you will) are computationally efficient, with multiple clients connected.
secure: the master database presents an interface to which remote clients (who do not belong to the same owner or system) can connect: i.e., we cannot just add all the clients to the master's replica set.
robust: a temporarily connection failure between a client and the master database should be easily and efficiently recoverable.
In some sense, the server is publishing a collection of data and the clients are subscribing to it. I realize that this is a hard software engineering problem, and to my knowledge no piece of software has implemented this exactly yet. However, some approaches have come to mind as close, which I'll list below.
Meteor's DDP protocol: It's designed to do this with Mongo-like collections and exactly implements the model of publishing and subscribing to a set of data (rather than a stream of messages). It manages the initial sync and sends along live changes. However, it's still in development, and far from being an industrial-strength solutions - current drawbacks are that the server keeps a copy of every client's state in a possibly inefficient way and is only tested on collections that can fit in the memory of a web app. Also, it appears that DDP cannot efficiently sync an out-of-date database without fetching everything from scratch. If anyone can point to some examples of how large of a collection can be synced over DDP, that would be great. (See also: https://stackoverflow.com/q/10128430/586086)
Broadcasting the Mongo oplog: Using a high-throughput message bus like Apache Kafka, one may be able to efficiently send the oplog to many clients at once. This tackles some of the system implementation challenges. However, this requires that the clients start with an initial sync that gets them close enough to the current master state somehow and then start replaying the oplog from the appropriate point.
Continuous replication a la CouchDB: I'm not sure how this is implemented and how robust it is, given the sparsity of the documentation. However, it does seem to work over remote database connections. How efficient is this, though, when multiple clients are trying to replicate at the same time? (A similar hack to this would be to make the clients MongoDB Priority 0 replica set members; however, that seems to be far from its intended use. See also: http://guide.couchdb.org/draft/replication.html)
Please give pointers to software or pieces of software that already implement parts of this, or suggestions on the algorithms/data structures needed to do this efficiently.
If you are looking specifically for real-time replication, I'd recommend you look into SaaS offerings specifically for this purpose, such as https://www.firebase.com/

Is it a good idea to write a gateway for PostgreSQL in Erlang?

I'm writing a web application in Erlang, and want to store my data to PostgreSQL.
There're two kinds of resources in my application. One kind is very important while the other one is not that important.
For the important one, no data loss is allowed.
For the less important one, data loss due to system failure is ok.
I want to gain maximum efficiency and came up with such an idea: write a gateway for PostgreSQL. The gateway is a gen_server, and business logic (BL) parts can talk to the gateway for storing resources.
For storing important resources, BL parts send the resources to be stored to the gateway, and block to receive a message (success or failure), and finally respond to the user with a web page.
For storing less important resources, BL parts only send the resources to the gateway without blocking. After sending the resources, BL parts respond with a web page directly.
What I'm expecting from this idea is less seconds per request, since most of the resources are less important ones. But I wonder if this is a good idea, or in other words, can I really get what I'm expecting?
Please answer according to your experience or some reliable "web search results". Thanks. :-)
I can see two problems with your proposal:
All messages sent to the gen_server gateway will be serialized (blocking or non-blocking)
If the gen_server gateway process crashes, you will lose at least the messages in the process mailbox.
What I would do is to create a helper module (not a process) that will be responsible for the database interactions. This process would use a postgresql library that supports connection pool (so the calls to DB can have some parallelism).
If you wish to do non-blocking DB operations for the less important resources, just spawn a process to do the DB interaction and move on.
Some links to postgresql erlang libraries (I haven't used any):
Postgresql connection pooling in Erlang
http://zotonic.com/page/519/epgsql-postgresql-driver

design high volume MSMQ

We have many communication servers sending data packets. We would like to store these data packets coming from these server programs in MSMQ until an updater will process them. Data loss has been a concern and we would like to not lose any data packet coming from these server programs and want an efficient and performant solution.
What will be the best design approach?
Well, there are two basic things you need to do to get started. First, you'll want to modify the default installation to move the storage location to a drive that is mirrored and/or is not the same as the one that the operating system boots from on that server. Also you'll want to ensure there is enough space there to hold messages as they are queued, depending on the volume you're contemplating. This article covers that.
Second, you'll want to use transactions and journaling to ensure reliability. This is both a programming and infrastructure issue, so you can start by looking at this article, and then following up with a general guide on how to program against MSMQ correctly. This for example is a good starting point if you've never used MSMQ, although it's fairly basic. If you're going to use MSMQ as a binding/transport for WCF then you have the plumbing part pretty much covered; it's just a matter of configuring your services to handle the volume and traffic you think you're going to see.
We have many communication servers sending data packets.
When storing 'data packets', I would recommend writing [Serializable] .NET objects to WCF, mainly because WCF can read/write them transparently to MSMQ. This will be easier to work with, but if your data packets are say TCP/IP or binary packets, you will need to turn on 'Ordering', to ensure they go into the queue in the exact order they were placed.
MSMQ also has sessions, so if you want to group items together this is possible. WCF does not make this guarantee. You will need to write custom code for this, but it is only a case of assigning a unique ID to each message in a particular session.
Data loss has been a concern and we would like to not lose any data packet coming from these server programs
MSMQ can persist the data to disk, so if a server goes down, its queue is preserved. MSMQ can hold the queue in memory, which is more efficient but crashes/restarts will not retain the queue information.
and want an efficient( good performance )
MSMQ is fairly performant. The persistence to disk has a small overhead, but only due to the disk write. If performance includes multi-threading, MSMQ does not offer this feature as the queue is sequential, so must be processed in order. But this is typical of queue technologies.
MSMQ also have 4MB max message size, so keep in mind what you want to send across the network.
The only other thing is that MSMQ is not massively scalable. Its primary goal is guaranteed delivery. If you post millions of packets, they will get to their destination, but MSMQ does have a finite ability to push the messages to other machines. It operates a ThreadPool-like system, so it will not scale if this is also a requirement.
I have also added info to the #msmq-wcf wiki with a basic example of writing data.