Use of MSMQ to control SQL write operations - msmq

Can I use MSMQ to reduce the number of synchronous write operations to a database and instead have the records written to the database every X number of minutes?

You can't reduce the number of write operations by queuing them, but you can use a message queue to cluster the writes together.
That might be a bit more efficient (by dint of sharing a single connection), and could also let you schedule the writes at a convenient time if you wanted to ('every X minutes' wouldn't do that, but you could perform the writes during low usage times).
The increased complexity of that arrangement will normally outweigh the benefits - what do you really want to achieve?

Related

KDB: parallel insertion to table

I created a multi-threaded connections from Java to KDB then have records inserted to a single table concurrently.
But it seems that the sum of the individual duration and the overall duration is almost the same as if no concurrent insertion happened.
Would you know if KDB supports parallel insertion?
If so, is there any setting I should do?
Does it have a record-level or table-level locking?
kdb does not support parallel inserts into in-memory tables. In fact updates to in-memory data may only be made from the q main thread. This means that tables are 'locked' (can't be amended) essentially to all clients if a q server is started with a negative port, and the issue is irrelevant if the q session is in single threaded mode (as most sessions tend to be). The situation is a little different for tables stored on disk (I can expand on that later if required).
In order to accelerate your inserts I would suggest looking at the following:
a) Are the inserts batched, rather than as a series of single inserts? One insert of 1k rows will take much less time that 1k inserts of one row.
b) Are the inserts sent async or sync? Changing between these two may speed up insertion rates but at the cost of knowing if the inserts executed correctly.
Can you share more about your use case? Is your Java client sending market data? if so would a TP style setup be more appropriate? See kdb+ tick and its derivatives such as TorQ (note that TorQ is developed by my employer).
A KDB process is a single-threaded process in general (except when running in multiple slave thread/process mode) https://code.kx.com/q/ref/cmdline/#-s-slaves
Though you have multiple java threads writing data to q process, the data is getting written in KDB in a sequential manner, hence it is not giving any performance benefit. it does not need the table/row level locking due to this
though I would recommend that you stream the data in async mode (negative handle), this will let your java threads come quickly rather than waiting for KDB to complete the operation, this will definitely improve the performance at the writing side.
While using parallel processing mode(slave threads - positive number), the slave threads are not allowed writing to the global tables/variables; you would need to use multi-process mode to achive that(negative number while launching the q process)

Kafka vs. MongoDB for time series data

I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.
At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.
Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"
Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.
Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?
I'll try to take this question as that you're trying to collect metrics over time
Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value), so the time could not be repeated anyway.
Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.
However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.
So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.
With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before
Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that #Alex Blex explained well.
Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:
Cassandra [best processing speed, best/good data size limits, worst query flexibility]
TimescaleDB on top of PostgresDB [good processing speed, good/OK data size limits, good query flexibility]
ElasticSearch [good processing speed, worst data size limits, best query flexibility + visualization]
P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed
P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_secondsĀ¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.

Read from mongodb without lock

We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.

Which mongo stats to use to throttle writes

I am writing logging information asynchronously to mongodb. Since this is an non-essential function, I am looking for a way to throttle these writes so it does not impact read/writes from other part of the application. Essentially, only write when certain stat is below acceptable level.
One stats I thought of using is "globalLock.ratio" from serverStatus. However, this does not seem to be a moving average and not a good way to measure current usage on the database.
What would be a good stats to use for what I am looking to do? Write lock % would be ideal, but how would I get moving average from serverStatus?
There are a number of things to note about your question:
1) If you want moving averages, then you'll need to keep track of them yourself in your client program. If you're running a multi-threaded program, you could dedicate one thread to polling MongoDB at regular (1 second? 5 second?) intervals, and calculating the moving average yourself. This is the way that MMS does it.
2) When you calculate this average, you need to figure out what a 'loaded database' means to you. There could be many things to check: do you care about write lock percentage? read percentage? I/O usage? Replication delay? Unfortunately, there is no single metric that will work for all use cases at all times: you'll have to figure out what you care about and measure that.
3) Another strategy that you could take to achieve this goal is to do the writes to the logging collection using write concern, a 'w' value of 'majority', and a reasonable timeout (say 10 seconds). Using this, you won't be able to write to your database faster than your replication. If you start getting timeouts, you know that you need to scale back. If you can't write fast enough to drain the queue, then you start dropping log entries at that time.