MongoDB watch single document [scalability] - mongodb

This is MongoDB's api:
db.foo.watch([{$match: {"bar.baz": "qux" }}])
Let's say that collection foo contains millions of documents. The arguments passed into watch indicate that for every single document that gets updated the system will filter the ones that $match the query (but it will be triggered behind the scenes with any document change).
The problem is that as my application scales, my listeners will also scale and my intuition is that I will end up having n^2 complexity with this approach.
I think that as I add more listeners, database performance will deteriorate due to changes to documents that are not part of the $match query. There are other ways to deal with this, (web sockets & rooms) but before prematurely optimizing the system, I would like to know if my intuition is correct.
Actual Question:
Can I attach a listener to a single document, such that watch's performance isn't affected by sibling documents?
When I do collection.watch([$matchQuery]), does the MongoDB driver listen to all documents and then filters out the relevant ones? (this is what I am trying to avoid)

The code collection.watch([$matchQuery]) actually means watch the change stream for that collection rather than the collection directly.
As far as I know, there is no way to add a listener to a single document. Since I do not know of any way, I will give you a couple tips on how to avoid scalability problems with the approach you have chosen. Your code appears to be using change streams. It should not cause problems unless you open too many change streams.
There are two ways to accomplish this task by watching the entire collection with a process outside of that won't lead to deterioration of the database performance.
If you use change streams, you can open only a single change stream with logic that checks for all the conditions you need to filter for over time. The mistake is that people often open many change streams for single document filtering tasks, and that is when people have problems.
The simpler way, since you mentioned Atlas, is to use Triggers. You can use something called a match expression in your Triggers configuration to prevent any operations on your collection unless the
match expression evaluates to true. As noted in the documentation, the trigger function will not execute unless a field status in this case is updated to "blocked", but many match expressions are available:
{
"updateDescription.updatedFields": {
"status": "blocked"
}
}
I hope this helps. If not, I can keep digging. I think with change streams or Triggers, you are ok if you want to write a bit of code. :)

Related

MongoDB aggregate for Dashboard

I want to show the data in MongoDB on the dashboard. I implemented it by applying the "Aggregate"
.
I am constantly receiving the "Query Targeting: Scanned Objects / Returned has gone about 1000" alert. How do I solve this alert? The method I thought of is as follows.
Remove the aggregation function from the dashboard: If we need the aggregation data, send a query at that time to obtain the data.
Separate aggregate functions and send queries from business logic: Divide data obtained at once through aggregate functions into multiple queries and then combine the data.
If there is a better way, I wonder if there is a common way.
I am constantly receiving the "Query Targeting: Scanned Objects / Returned has gone about 1000" alert. How do I solve this alert?
What, specifically, are you trying to solve here?
The Query Targeting metric (and associated alert) provides general information regarding the efficiency of the workload against the cluster. It can help with identifying potential problems there, most notably when relevant indexes are missing. Some more information about the metrics and actions that you can take for it are described here.
That said, the metric itself is not perfect. The fact that the targeting ratio is high enough to trigger an alert does not necessarily mean that there is a problem or that any particular action needs to be taken. Particularly notable here is that aggregation operations can cause misleading targeting ratios depending on what types of transformations the pipeline is applying. So the existence of the alert indicates there may be some improvements that could be pursued, but it does not guarantee that there are. You can certainly take a look at the workload using the strategies described in that documentation to determine if any actions like index creation are needed in your specific situation.
The two approaches that you specifically mention in the question could be considered but they kind of don't directly address the alert itself. Certainly if these are heavy aggregations that aren't needed for the application to function then there may be good reason to consider reducing their frequency. But if they are needed by the application and they are structured to be reasonably efficient, then I would not recommend trying to make any drastic adjustments just to avoid triggering the alert. Rather it may be the case that the default query targeting alert is too low for your particular use case and workload and you may consider raising it instead.

MongoDB sequence number based on count in one operation

I'm working on creating an immutable append only event log for MongoDB, in this I need a sequence number genereated and can base it off of the count of documents, since there will be no removals from the event log. However, I'm trying to avoid having to do two operations on MongoDB and would rather it happen in one "transaction" within the database itself.
If I were to do this from the Mongo shell, it would be something like below:
db['event-log'].insertOne({SequenceNumber: db['event-log'].count() +1 })
Is this doable in any way with the regular API?
Prior to v4, there was the possibility of doing eval - which would have made this much easier.
Update
The reason for my need of a sequence number is to be able to guarantee the order in which they were inserted when reading them back. Default behavior of Mongo is to retrieve them in the $natural order and one can explicitly define that on .find() as well (read more here). Although documentation is clear on not relying on it, it seems that as long as there are no modifications / removal of documents already there, it should be fine from what I can gather.
I realized also that I might get around this in another way as well, I'm going to introduce an Actor framework and I could make my committer a stateful actor with the sequence number in it if I need it.

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.

It's not possible to lock a mongodb document. What if I need to?

I know that I can't lock a single mongodb document, in fact there is no way to lock a collection either.
However, I've got this scenario, where I think I need some way to prevent more than one thread (or process, it's not important) from modifying a document. Here's my scenario.
I have a collection that contains object of type A. I have some code that retrieve a document of type A, add an element in an array that is a property of the document (a.arr.add(new Thing()) and then save back the document to mongodb. This code is parallel, multiple threads in my applications can do theses operations and for now there is no way to prevent to threads from doing theses operations in parallel on the same document. This is bad because one of the threads could overwrite the works of the other.
I do use the repository pattern to abstract the access to the mongodb collection, so I only have CRUDs operations at my disposition.
Now that I think about it, maybe it's a limitation of the repository pattern and not a limitation of mongodb that is causing me troubles. Anyway, how can I make this code "thread safe"? I guess there's a well known solution to this problem, but being new to mongodb and the repository pattern, I don't immediately sees it.
Thanks
Hey the only way of which I think now is to add an status parameter and use the operation findAndModify(), which enables you to atomically modify a document. It's a bit slower, but should do the trick.
So let's say you add an status attribut and when you retrieve the document change the status from "IDLE" to "PROCESSING". Then you update the document and save it back to the collection updating the status to "IDLE" again.
Code example:
var doc = db.runCommand({
"findAndModify" : "COLLECTION_NAME",
"query" : {"_id": "ID_DOCUMENT", "status" : "IDLE"},
"update" : {"$set" : {"status" : "RUNNING"} }
}).value
Change the COLLECTION_NAME and ID_DOCUMENT to a proper value. By default findAndModify() returns the old value, which means the status value will be still IDLE on the client side. So when you are done with updating just save/update everything again.
The only think you need be be aware is that you can only modify one document at a time.
Hope it helps.
Stumbled into this question while working on mongodb upgrades. Unlike at the time this question was asked, now mongodb supports document level locking out of the box.
From: http://docs.mongodb.org/manual/faq/concurrency/
"How granular are locks in MongoDB?
Changed in version 3.0.
Beginning with version 3.0, MongoDB ships with the WiredTiger storage engine, which uses optimistic concurrency control for most read and write operations. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation."
Classic solution when you want to make something thread-safe is to use locks (mutexes).
This is also called pessimistic locking as opposed to optimistic locking described here.
There are scenarios when pessimistic locking is more efficient (more details here). It is also far easier to implement (major difficulty of optimistic locking is recovery from collision).
MongoDB does not provide mechanism for a lock. But this can be easily implemented at application level (i.e. in your code):
Acquire lock
Read document
Modify document
Write document
Release lock
The granularity of the lock can be different: global, collection-specific, record/document-specific. The more specific the lock the less its performance penalty.
"Doctor, it hurts when I do this"
"Then don't do that!"
Basically, what you're describing sounds like you've got a serial dependency there -- MongoDB or whatever, your algorithm has a point at which the operation has to be serialized. That will be an inherent bottleneck, and if you absolutely must do it, you'll have to arrange some kind of semaphore to protect it.
So, the place to look is at your algorithm. Can you eliminate that? Could you, for example, handle it with some kind of conflict resolution, like "get record into local' update; store record" so that after the store the new record would be the one gotten on that key?
Answering my own question because I found a solution while doing research on the Internet.
I think what I need to do is use an Optimistic Concurency Control.
It consist in adding a timestamp, a hash or another unique identifier (I'll used UUIDs) to every documents. The unique identifier must be modified each time the document is modified. before updating the document I'll do something like this (in pseudo-code) :
var oldUUID = doc.uuid;
doc.uuid = new UUID();
BeginTransaction();
if (GetDocUUIDFromDatabase(doc.id) == oldUUID)
{
SaveToDatabase(doc);
Commit();
}
else
{
// Document was modified in the DB since we read it. We can't save our changes.
RollBack();
throw new ConcurencyException();
}
Update:
With MongoDB 3.2.2 using WiredTiger Storage implementation as default engine, MongoDB use default locking at document level.It was introduced in version 3.0 but made default in version 3.2.2. Therefore MongoDB now has document level locking.
As of 4.0, MongoDB supports Transactions for replica sets. Support for sharded clusters will come in MongoDB 4.2. Using transactions, DB updates will be aborted if a conflicting write occurs, solving your issue.
Transactions are much more costly in terms of performance so don't use Transactions as an excuse for poor NoSQL schema design!
An alternative is to do in place update
for ex:
http://www.mongodb.org/display/DOCS/Updating#comment-41821928
db.users.update( { level: "Sourcerer" }, { '$push' : { 'inventory' : 'magic wand'} }, false, true );
which will push 'magic wand' into all "Sourcerer" user's inventory array. Update to each document/user is atomic.
If you have a system with > 1 servers then you'll need a distributive lock.
I prefer to use Hazelcast.
While saving you can get Hazelcast lock by entity id, fetch and update data, then release a lock.
As an example:
https://github.com/azee/template-api/blob/master/template-rest/src/main/java/com/mycompany/template/scheduler/SchedulerJob.java
Just use lock.lock() instead of lock.tryLock()
Here you can see how to configure Hazelcast in your spring context:
https://github.com/azee/template-api/blob/master/template-rest/src/main/resources/webContext.xml
Instead of writing the question in another question, I try to answer this one: I wonder if this WiredTiger Storage will handle the problem I pointed out here:
Limit inserts in mongodb
If the order of the elements in the array is not important for you then the $push operator should be safe enough to prevent threads from overwriting each others changes.
I had a similar problem where I had multiple instances of the same application which would pull data from the database (the order did not matter; all documents had to be updated - efficiently), work on it and write back the results. However, without any locking in place, all instances obviously pulled the same document(s) instead of intelligently distributing their workforce.
I tried to solve it by implementing a lock on application level, which would add an locked-field in the corresponding document when it was currently being edited, so that no other instance of my application would pick the same document and waste time on it by performing the same operation as the other instance(s).
However, when running dozens or more instances of my application, the timespan between reading the document (using find()) and setting the locked-field to true (using update()) where to long and the instances still pulled the same documents from the database, making my idea of speeding up the work using multiple instances pointless.
Here are 3 suggestions that might solve your problem depending on your situation:
Use findAndModify() since the read and write operations are atomic using that function. Theoretically, a document requested by one instance of your application should then appear as locked for the other instances. And when the document is unlocked and visible for other instances again, it is also modified.
If however, you need to do other stuff in between the read find() and write update() operations, you could you use transactions.
Alternatively, if that does not solve your problem, a bit of a cheese solution (which might suffice) is making the application pull documents in large batches and making each instance pick a random document from that batch and work on it. Obvisously this shady solution is based on the fact that coincidence will not punish your application's efficieny.
Sounds like you want to use MongoDB's atomic operators: http://www.mongodb.org/display/DOCS/Atomic+Operations

Is moving documents between collections a good way to represent state changes in MongoDB?

I have two collections, one (A) containing items to be processed (relatively small) and one (B) with those already processed (fairly large, with extra result fields).
Items are read from A, get processed and save()'d to B, then remove()'d from A.
The rationale is that indices can be different across these, and that the "incoming" collection can be kept very small and fast this way.
I've run into two issues with this:
if either remove() or save() time out or otherwise fail under load, I lose the item completely, or process it twice
if both fail, the side effects happen but there is no record of that
I can sidestep the double-failure case with findAndModify locks (not needed otherwise, we have a process-level lock) but then we have stale lock issues and partial failures can still happen. There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Is there a Best Practice for this situation?
There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Yes this is by design. MongoDB explicitly does not provides joins or transactions. Remove + Save is a form of transaction.
Is there a Best Practice for this situation?
You really have two low-complexity options here, both involve findAndModify.
Option #1: a single collection
Based on your description, you are basically building a queue with some extra features. If you leverage a single collection then you use findAndModify to update the status of each item as it is processing.
Unfortunately, that means you will lose this: ...that the "incoming" collection can be kept very small and fast this way.
Option #2: two collections
The other option is basically a two phase commit, leveraging findAndModify.
Take a look at the docs for this here.
Once an item is processed in A you set a field to flag it for deletion. You then copy that item over to B. Once copied to B you can then remove the item from A.
I've not tried this myself yet but the new book 50 Tips and Tricks for MongoDB Developers mentions a few times about using cron jobs (or services/scheduler) to clean up data like this. You could leave the documents in Collection A flagged for deletion and run daily job to clear them out, reducing the overall scope of the original transaction.
From what I've learned so far, I'd never leave the database in a state where I rely on the next database action succeeding unless it is the last action (journalling will resend the last db action upon recovery). For example, I have a three phase account registration process where I create a user in CollectionA and then add another related document to CollectionB. When I create the user I embed the details of the CollectionB document in CollectionA in case the second write fails. Later I will write a process that removes the embedded data from CollectionA if the document in CollectionB exists
Not having transactions does cause pain points like this, but I think in some cases there are new ways of thinking about it. In my case, time will tell as I progress with my app