Message queues and database inserts - postgresql

I'm new to message queues and am intrigued by their capabilities and use. I have an idea about how to use it but wonder if it is the best use of this tool. I have an application that picks up and reads spreadsheets, transforms the data business objects for database storage. My application needs to read and be able to update several hundred thousand records, but I'm running into performance issues holding onto these objects and bulk inserting into the database.
Would having have two different applications (one to read the spreadsheets, one to store the records) using a message queue be proper utilization of a message queue? Obviously there are some optimizations I need to make in my code and is going to be my first step, but wanted to hear thoughts from those that have used message queues.

It wouldn't be an improper use of the queue, but its hard to tell if in you scenario adding a message queue will having any affect on the performance problems you mentioned. We would need more information.
Are you adding one message to a queue to tell a process to convert a spreadsheet and a second message when the data is ready for loading? or are you thinking of adding on message per data record? (That might get expensive fast, and probably won't increase the performance).

Related

Understanding Persistent Entities with streams of data

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

How do I keep the RDMS and Kafka in sync?

We want to introduce a Kafka Event Bus which will contain some events like EntityCreated or EntityModified into our application so other parts of our system can consume from it. The main application uses an RDMS (i.e. postgres) under the hood to store the entities and their relationship.
Now the issue is how you make sure that you only send out EntityCreated events on Kafka if you successfully saved to the RDMS. If you don't make sure that this is the case, you end up with inconsistencies on the consumers.
I saw three solutions, of which none is convincing:
Don't care: Very dangerous, there can be something going wrong when inserting into an RDMS.
When saving the entity, also save the message which should be sent into a own table. Then have a separate process which consumes from this table and publishes to Kafka and after a success deleted from this table. This is quiet complex to implement and also looks like an anti-pattern.
Insert into the RDMS, keep the (SQL-) Transaction open until you wrote successfully to Kafka and only then commit. The problem is that you potentially keep the RDMS transaction open for some time. Don't know how big the problem is.
Do real CQRS which means that you don't save at all to the RDMS but construct the RDMS out of the Kafka queue. That seems like the ideal way but is difficult to retrofit to a service. Also there are problems with inconsistencies due to latencies.
I had difficulties finding good solutions on the internet.
Maybe this question is to broad, feel free to point me somewhere it fits better.
When saving the entity, also save the message which should be sent into a own table. Then have a separate process which consumes from this table and publishes to Kafka and after a success deleted from this table. This is quiet complex to implement and also looks like an anti-pattern.
This is, in fact, the solution described by Udi Dahan in his talk: Reliable Messaging without Distributed Transactions. It's actually pretty close to a "best practice"; so it may be worth exploring why you think it is an anti-pattern.
Do real CQRS which means that you don't save at all to the RDMS but construct the RDMS out of the Kafka queue.
Noooo! That's where the monster is hiding! (see below).
If you were doing "real CQRS", your primary use case would be that your writers make events durable in your book of record, and the consumers would periodically poll for updates. Think "Atom Feed", with the additional constraint that the entries, and the order of entries, is immutable; you can share events, and pages of events; cache invalidation isn't a concern because, since the state doesn't change, the event representations are valid "forever".
This also has the benefit that your consumers don't need to worry about message ordering; the consumers are reading documents of well ordered events with pointers to the prior and subsequent documents.
Furthermore, you've additionally gotten a solution to a versioning story: rather than broadcasting N different representations of the same event, you send out one representation, and then negotiate the content when the consumer polls you.
Now, polling does have latency issues; you can reduce the latency by broadcasting an announcement of the update, and notifying the consumers that new events are available.
If you want to reduce the rate of false polling (waking up a consumer for an event that they don't care about), then you can start adding more information into the notification, so that the consumer can judge whether to pull an update.
Notice that "wake up and maybe poll" is a process that is triggered by a single event in isolation. "Wake up and poll just this message" is another variation on the same idea. We broadcast a thin version of EmailDeliveryScheduled; and the service responsible for that calls back to ask for the email/an enhanced version of the event with the details needed to construct the email.
These are specializations of "wake up and consume the notification". If you have a use case where you can't afford the additional latency required to poll, you can use the state in the representation of the isolated event.
But trying to reproduce an ordered sequence of events when that information is already exposed as a sharable, cacheable document... That's a pretty unusual use case right there. I wouldn't worry about it as a general problem to solve -- my guess is that these cases are rare, and not easily generalized.
Note that all of the above is about messaging, not about Kafka. Notice that messaging and event sourcing are documented as different use cases. Jay Kreps wrote (2013)
I use the term "log" here instead of "messaging system" or "pub sub" because it is a lot more specific about semantics and a much closer description of what you need in a practical implementation to support data replication.
You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics
The book of record should be the sole authority for the order of event messages. Any consumer that cares about order should be reading ordered documents from the book of record, rather than reading unordered documents and reconstructing the order.
In your current design....
Now the issue is how you make sure that you only send out EntityCreated events on Kafka if you successfully saved to the RDMS.
If the RDBMS is the book of record (the source of "truth"), then the Kafka log isn't (yet).
You can get there from here, over a number of gentle steps; roughly, you add events into the existing database, you read from the existing database to write into kafka's log; you use kafka's log as a (time delayed) source of truth to build a replica of the existing RDBMS, you migrate your read use cases to the replica, you migrate your write use cases to kafka, and you decommission the legacy database.
Kafka's log may or may not be the book of record you want. Greg Young has been developing Get Event Store for quite some time, and has enumerated some of the tradeoffs (2016). Horses for courses - I wouldn't expect it to be too difficult to switch the log from one of these to the other with a well written code base, but I can't speak at all to the additional coupling that might occur.
There is no perfect way to do this if your requirement is look SQL & kafka as a single node. So the question should be: "What bad things(power failure, hardware failure) I can afford if it happen? What the changes(programming, architecture) I can take if it must apply to my applications?"
For those points you mentioned:
What if the node fail after insert to kafka before delete from sql?
What if the node fail after insert to kafka before commit the sql transaction?
What if the node fail after insert to sql before commit the kafka offset?
All of them will facing the risk of data inconsistency(4 is slightly better if the data insert to sql can not success more than once such as they has a non database generated pk).
From the viewpoint of changes, 3 is smallest, however, it will decrease sql throughput. 4 is biggest due to your business logic model will facing two kinds of database when you coding(write to kafka by a data encoder, read from sql by sql sentence), it has more coupling than others.
So the choice is depend on what your business is. There is no generic way.

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.

MongoDB Schema Design - Real-time Chat

I'm starting a project which I think will be particularly suited to MongoDB due to the speed and scalability it affords.
The module I'm currently interested in is to do with real-time chat. If I was to do this in a traditional RDBMS I'd split it out into:
Channel (A channel has many users)
User (A user has one channel but many messages)
Message (A message has a user)
The the purpose of this use case, I'd like to assume that there will be typically 5 channels active at one time, each handling at most 5 messages per second.
Specific queries that need to be fast:
Fetch new messages (based on an bookmark, time stamp maybe, or an incrementing counter?)
Post a message to a channel
Verify that a user can post in a channel
Bearing in mind that the document limit with MongoDB is 4mb, how would you go about designing the schema? What would yours look like? Are there any gotchas I should watch out for?
I used Redis, NGINX & PHP-FPM for my chat project. Not super elegant, but it does the trick. There are a few pieces to the puzzle.
There is a very simple PHP script that receives client commands and puts them in one massive LIST. It also checks all room LISTs and the users private LIST to see if there are messages it must deliver. This is polled by a client written in jQuery & it's done every few seconds.
There is a command line PHP script that operates server side in an infinite loop, 20 times per second, which checks this list and then processes these commands. The script handles who is in what room and permissions in the scripts memory, this info is not stored in Redis.
Redis has a LIST for each room & a LIST for each user which operates as a private queue. It also has multiple counters for each room the user is in. If the users counter is less than the total messages in the room, then it gets the difference and sends it to the user.
I haven't been able to stress test this solution, but at least from my basic benchmarking it could probably handle many thousands of messages per second. There is also the opportunity to port this over to something like Node.js to increase performance. Redis is also maturing and has some interesting features like Pub/Subscribe commands, which might be of interest, that would possibly remove the polling on the server side possibly.
I looked into Comet based solutions, but many of them were complicated, poorly documented or would require me learning an entirely new language(e.g. Jetty->Java, APE->C),etc... Also delivery and going through proxies can sometimes be an issue with Comet. So that is why I've stuck with polling.
I imagine you could do something similar with MongoDB. A collection per room, a collection per user & then a collection which maintains counters. You'll still need to write a back-end daemon or script to handle manging where these messages go. You could also use MongoDB's "limited collections", which keeps the documents sorted & also automatically clears old messages out, but that could be complicated in maintaining proper counters.
Why use mongo for a messaging system? No matter how fast the static store is (and mongo is very fast), whether mongo or db, to mimic a message queue your going to have to use some kind of polling, which is not very scalable or efficient. Granted you're not doing anything terribly intense, but why not just use the right tool for the right job? Use a messaging system like Rabbit or ActiveMQ.
If you must use mongo (maybe you just want to play around with it and this project is a good chance to do that?) I imagine you'll have a collection for users (where each user object has a list of the queues that user listens to). For messages, you could have a collection for each queue, but then you'd have to poll each queue you're interested in for messages. Better would be to have a single collection as a queue, as it's easy in mongo to do "in" queries on a single collection, so it'd be easy to do things like "get all messages newer than X in any queues where queue.name in list [a,b,c]".
You might also consider setting up your collection as a mongo capped collection, which just means that you tell mongo when you set up the collection that your collection should only hold X number of bytes, or X number of items. Adding additional items has First-In, First-Out behavior which is pretty much ideal for a message queue. But again, it's not really a messaging system.
1) ape-project.org
2) http://code.google.com/p/redis/
3) after you're through all this - you can dumb data into mongodb for logging and store consistent data (users, channels) as well

Memcache-based message queue?

I'm working on a multiplayer game and it needs a message queue (i.e., messages in, messages out, no duplicates or deleted messages assuming there are no unexpected cache evictions). Here are the memcache-based queues I'm aware of:
MemcacheQ: http://memcachedb.org/memcacheq/
Starling: http://rubyforge.org/projects/starling/
Depcached: http://www.marcworrell.com/article-2287-en.html
Sparrow: http://code.google.com/p/sparrow/
I learned the concept of the memcache queue from this blog post:
All messages are saved with an integer as key. There is one key that has the next key and one that has the key of the oldest message in the queue. To access these the increment/decrement method is used as its atomic, so there are two keys that act as locks. They get incremented, and if the return value is 1 the process has the lock, otherwise it keeps incrementing. Once the process is finished it sets the value back to 0. Simple but effective. One caveat is that the integer will overflow, so there is some logic in place that sets the used keys to 1 once we are close to that limit. As the increment operation is atomic, the lock is only needed if two or more memcaches are used (for redundancy), to keep those in sync.
My question is, is there a memcache-based message queue service that can run on App Engine?
I would be very careful using the Google App Engine Memcache in this way. You are right to be worrying about "unexpected cache evictions".
Google expect you to use the memcache for caching data and not storing it. They don't guarantee to keep data in the cache. From the GAE Documentation:
By default, items never expire, though
items may be evicted due to memory
pressure.
Edit: There's always Amazon's Simple Queueing Service. However, this may not meet price/performance levels either as:
There would be the latency of calling from the Google to Amazon servers.
You'd end up paying twice for all the data traffic - paying for it to leave Google and then paying again for it to go in to Amazon.
I have started a Simple Python Memcached Queue, it might be useful:
http://bitbucket.org/epoz/python-memcache-queue/
If you're happy with the possibility of losing data, by all means go ahead. Bear in mind, though, that although memcache generally has lower latency than the datastore, like anything else, it will suffer if you have a high rate of atomic operations you want to execute on a single element. This isn't a datastore problem - it's simply a problem of having to serialize access.
Failing that, Amazon's SQS seems like a viable option.
Why not use Task Queue:
https://developers.google.com/appengine/docs/python/taskqueue/
https://developers.google.com/appengine/docs/java/taskqueue/
It seems to solve the issue without the likely loss of messages in Memcached-based queue.
Until Google impliment a proper job-queue, why not use the data-store? As others have said, memcache is just a cache and could lose queue items (which would be.. bad)
The data-store should be more than fast enough for what you need - you would just have a simple Job model, which would be more flexible than memcache as you're not limited to key/value pairs