MongoDB sequence number based on count in one operation - mongodb

I'm working on creating an immutable append only event log for MongoDB, in this I need a sequence number genereated and can base it off of the count of documents, since there will be no removals from the event log. However, I'm trying to avoid having to do two operations on MongoDB and would rather it happen in one "transaction" within the database itself.
If I were to do this from the Mongo shell, it would be something like below:
db['event-log'].insertOne({SequenceNumber: db['event-log'].count() +1 })
Is this doable in any way with the regular API?
Prior to v4, there was the possibility of doing eval - which would have made this much easier.
Update
The reason for my need of a sequence number is to be able to guarantee the order in which they were inserted when reading them back. Default behavior of Mongo is to retrieve them in the $natural order and one can explicitly define that on .find() as well (read more here). Although documentation is clear on not relying on it, it seems that as long as there are no modifications / removal of documents already there, it should be fine from what I can gather.
I realized also that I might get around this in another way as well, I'm going to introduce an Actor framework and I could make my committer a stateful actor with the sequence number in it if I need it.

Related

MongoDB watch single document [scalability]

This is MongoDB's api:
db.foo.watch([{$match: {"bar.baz": "qux" }}])
Let's say that collection foo contains millions of documents. The arguments passed into watch indicate that for every single document that gets updated the system will filter the ones that $match the query (but it will be triggered behind the scenes with any document change).
The problem is that as my application scales, my listeners will also scale and my intuition is that I will end up having n^2 complexity with this approach.
I think that as I add more listeners, database performance will deteriorate due to changes to documents that are not part of the $match query. There are other ways to deal with this, (web sockets & rooms) but before prematurely optimizing the system, I would like to know if my intuition is correct.
Actual Question:
Can I attach a listener to a single document, such that watch's performance isn't affected by sibling documents?
When I do collection.watch([$matchQuery]), does the MongoDB driver listen to all documents and then filters out the relevant ones? (this is what I am trying to avoid)
The code collection.watch([$matchQuery]) actually means watch the change stream for that collection rather than the collection directly.
As far as I know, there is no way to add a listener to a single document. Since I do not know of any way, I will give you a couple tips on how to avoid scalability problems with the approach you have chosen. Your code appears to be using change streams. It should not cause problems unless you open too many change streams.
There are two ways to accomplish this task by watching the entire collection with a process outside of that won't lead to deterioration of the database performance.
If you use change streams, you can open only a single change stream with logic that checks for all the conditions you need to filter for over time. The mistake is that people often open many change streams for single document filtering tasks, and that is when people have problems.
The simpler way, since you mentioned Atlas, is to use Triggers. You can use something called a match expression in your Triggers configuration to prevent any operations on your collection unless the
match expression evaluates to true. As noted in the documentation, the trigger function will not execute unless a field status in this case is updated to "blocked", but many match expressions are available:
{
"updateDescription.updatedFields": {
"status": "blocked"
}
}
I hope this helps. If not, I can keep digging. I think with change streams or Triggers, you are ok if you want to write a bit of code. :)

change collection data formats within Meteor

Looking for the best way to fix data formats in my Meteor app. When I started, I wasn't using anything like SimpleSchema or being as consistent as I should have been with Date formats.
So now I'd like to get everything back to proper Date objects.
I'm still new-ish to Mongo, and I was a little surprised to find- and please correct me if I'm wrong- that there's no way to update all records and modify an attribute using its current value. I've got timestamps that came from an API POST that might be Strings, epoch times from new Date().getTime(), some actual Dates, etc.
I plan to use moment(currentValue).toDate() to fix this. I'm using percolate:migrations for data changes 1) so that changes stay in my repo and 2) so data is consistent wherever the app is run. I've looked at this question and I assume I'll need to iterate over my collections. But snapshot() isn't available in Meteor.
Do I need to write and manually run a mongo script for this?
Generally I prefer to run migration scripts from the mongo shell since it's easier to execute (compared to deploying the code that runs the migration) and it gives you access to the full mongo api. You can run load(path/to/script) in the mongo console if you want to pre define your script.
snapshot() ensures you wont modify the same document twice. From MongoDB docs
Append the snapshot() method to a cursor to toggle the “snapshot” mode. This ensures that the query will not return a document multiple times, even if intervening write operations result in a move of the document due to the growth in document size.
Running without snapshot() would possibly result in passing a date object (that was just converted) to your update function. Since you are planning to cover this case already (you are saying you already have some date objects in your db) it doesnt change much. Ergo, you can run this from meteor without snapshot() but you might as well use the shell to get used to it :)
And you are correct that there is no way to update a document based on its current value. Looping through all documents and updating them one by one is rather slow, so if you have a huge collection you might want to schedule some downtime.

Why mongodb only updates the first matching document in the collection?

Consider a collection student contains the following documents.
{name:”Nithin”,age:23}
{name:”Nithin”,age:25}
{name:”Nithin”,age:28}
{name:”Nithin”,age:12}
I want to update all the documents whose name is “Nithin” as age=60.
If we execute the following query it will only update the first document.
db.student.update({name:”Nithin”},{age:60})
For update all the documents I have to use the query
db.student.update({name:”Nithin”},{age:60},false,true)
or
db.student.update({name:”Nithin”},{age:60},multi:true)
What is the reason by default mongodb not updating all the documents by executing db.student.update({name:”Nithin”},{age:60}) ? What is the motivation for creating separate queries for updating all the documents? Is it improving the performance?
Originally, in the early early days of MongoDB (pre 1.1) it was not possible to update multiple documents. This was a feature added around 1.1.3.
You can see it in the release notes, New Feature 268.
I'm guessing this was not enabled by default for backwards compatibility with previous versions.
This may not really be the reason but I find the additional multi parameter as a safeguard to prevent accidental update of multiple records when one intends to update a single document only, something like accidentally performing UPDATE...SET on SQL without specifying additional constraints.
Again this is just an assumption but may not really be the case.
I suppose part of the reason might be to avoid people coming from the SQL world to think about multi-document updates as isolated transactions.
In fact, during a long update MongoDB will periodically yield control to other queries which can potentially modify the same dataset.
So, by explicitly setting multi=true you are somewhat acknowledging this fact (well, not really, but I guess that's the spirit...)

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.

Is moving documents between collections a good way to represent state changes in MongoDB?

I have two collections, one (A) containing items to be processed (relatively small) and one (B) with those already processed (fairly large, with extra result fields).
Items are read from A, get processed and save()'d to B, then remove()'d from A.
The rationale is that indices can be different across these, and that the "incoming" collection can be kept very small and fast this way.
I've run into two issues with this:
if either remove() or save() time out or otherwise fail under load, I lose the item completely, or process it twice
if both fail, the side effects happen but there is no record of that
I can sidestep the double-failure case with findAndModify locks (not needed otherwise, we have a process-level lock) but then we have stale lock issues and partial failures can still happen. There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Is there a Best Practice for this situation?
There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Yes this is by design. MongoDB explicitly does not provides joins or transactions. Remove + Save is a form of transaction.
Is there a Best Practice for this situation?
You really have two low-complexity options here, both involve findAndModify.
Option #1: a single collection
Based on your description, you are basically building a queue with some extra features. If you leverage a single collection then you use findAndModify to update the status of each item as it is processing.
Unfortunately, that means you will lose this: ...that the "incoming" collection can be kept very small and fast this way.
Option #2: two collections
The other option is basically a two phase commit, leveraging findAndModify.
Take a look at the docs for this here.
Once an item is processed in A you set a field to flag it for deletion. You then copy that item over to B. Once copied to B you can then remove the item from A.
I've not tried this myself yet but the new book 50 Tips and Tricks for MongoDB Developers mentions a few times about using cron jobs (or services/scheduler) to clean up data like this. You could leave the documents in Collection A flagged for deletion and run daily job to clear them out, reducing the overall scope of the original transaction.
From what I've learned so far, I'd never leave the database in a state where I rely on the next database action succeeding unless it is the last action (journalling will resend the last db action upon recovery). For example, I have a three phase account registration process where I create a user in CollectionA and then add another related document to CollectionB. When I create the user I embed the details of the CollectionB document in CollectionA in case the second write fails. Later I will write a process that removes the embedded data from CollectionA if the document in CollectionB exists
Not having transactions does cause pain points like this, but I think in some cases there are new ways of thinking about it. In my case, time will tell as I progress with my app