Mongodb: How to avoid locking on big collection updates - mongodb

I have an events collection of 2.502.011 elements and would like to perform an update on all elements. Unfortunately I facing a lot of mongodb faults due to the write lock.
Question: How can I avoid those faults in order to be sure that all my events are correctly updated ?
Here are the informations regarding my events collection:
> db.events.stats()
{
"count" : 2502011,
"size" : 2097762368,
"avgObjSize" : 838.4305136947839,
"storageSize" : 3219062784,
"numExtents" : 21,
"nindexes" : 6,
"lastExtentSize" : 840650752,
"paddingFactor" : 1.0000000000874294,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 1265898256,
"indexSizes" : {
"_id_" : 120350720,
"destructured_created_at_1" : 387804032,
"destructured_updated_at_1" : 419657728,
"data.assigned_author_id_1" : 76053152,
"emiting_class_1_data.assigned_author_id_1_data.user_id_1_data.id_1_event_type_1" : 185071936,
"created_at_1" : 76960688
}
}
Here is what an event look like:
> db.events.findOne()
{
"_id" : ObjectId("4fd5d4586107d93b47000065"),
"created_at" : ISODate("2012-06-11T11:19:52Z"),
"data" : {
"project_id" : ObjectId("4fc3d2abc7cd1e0003000061"),
"document_ids" : [
"4fc3d2b45903ef000300007d",
"4fc3d2b45903ef000300007e"
],
"file_type" : "excel",
"id" : ObjectId("4fd5d4586107d93b47000064")
},
"emiting_class" : "DocumentExport",
"event_type" : "created",
"updated_at" : ISODate("2013-07-31T08:52:48Z")
}
I would like to update each event to add 2 new fields base on the existing created_at and updated_at. Please correct me if I am wrong but it seems you can't use the mongo update command when you need to access current's element data along the way.
This is my update loop:
db.events.find().forEach(
function (e) {
created_at = new Date(e.created_at);
updated_at = new Date(e.updated_at);
e.destructured_created_at = [e.created_at]; // omitted the actual values
e.destructured_updated_at = [e.updated_at]; // omitted the actual values
db.events.save(e);
}
)
When running the above command, I get a huge amount of page faults due to the write lock on the database.

I think you are confused here, it is not the write lock causing that, it is MongoDB querying for your update documents; the lock does not exist during a page fault (in fact it only exists when actually updating, or rather saving, a document on the disk), it gives way to other operations.
The lock is more of a mutex in MongoDB.
Page faults on this size of data is perfectly normal, since you obviously do not query this data often, I am unsure what you are expecting to see. I am definitely unsure what you mean by your question:
Question: How can I avoid those faults in order to be sure that all my events are correctly updated ?
Ok, the problem you may be seeing is that you are getting page thrashing on that machine in turn destroying your IO bandwidth and flooding your working set with data that is not needed. Do you really need to add this field to ALL documents eagerly, can it not be added on-demand by the application when that data is used again?
Another option is to do this in batches.
One feature you could make use of here is priority queues that dictate that such an update is a background task that shouldn't effect the current workings of your mongod too much. I hear such a feature is due (can't find JIRA :/).
Please correct me if I am wrong but it seems you can't use the mongo update command when you need to access current's element data along the way.
You are correct.

Related

MongoListener + Spring Detect updated fields in Document

I have a Springboot application + MongoDB and I need to audit every update made to a collection on specified fields (data analysis purpose).
If I have a collection like:
{
"_id" : ObjectId("12345678910"),
"label_1" : ObjectId("someIdForLabel1"),
"label_2" : ObjectId("someIdForLabel2"),
"label_3" : ObjectId("someIdForLabel"),
"name": "my data",
"description": "some curious stuff",
"updatedAt" : ISODate("2022-06-21T08:28:23.115Z")
}
I want to write an audit document whenever a label_* is updated. Something like
{
"_id" : ObjectId("111213141516"),
"modifiedDocument" : ObjectId("12345678910"),
"modifiedLabel" : "label_1",
"newValue" : ObjectId("someNewIdForLabel1"),
"updatedBy" : ObjectId("userId"),
"updatedAt" : ISODate("2022-06-21T08:31:20.315Z")
}
How can I achieve this with MongoListener? I already have two methods for AfterSave and AfterDelete , for other purposes, but they give me the whole new Document.
I would rather avoid to query again the DB or to use a findAndModify() in the first place.
I gave a look to ChangeStreams too, but I have too many doubts when it comes to more than 1 instance.
Thank you so much, any tip will be appreciated!

Mongodb tail subdocuments

I have a collection with users. Each user has comments. I want to track for some specific users (according to theirs ids) if there is a new comment.
Tailable cursor I guess are what I need but my main problem is that I want to track subdocuments and not documents.
Sample of tracking documents in python:
db = Connection().my_db
coll = db.my_collection
cursor = coll.find(tailable=True)
while cursor.alive:
try:
doc = cursor.next()
print doc
except StopIteration:
time.sleep(1)
One solution is to run intervals every x time and see if the number of the comments has changed. However I do not find the interval solution very appealing. Is there any better way to track changes? Probably with tailable cursors.
PS: I have a comment_id field (which is an ObjectID) in each comment.
Small update:
Since I have the commect_id bson, I can store the biggest (=latest) one in each user. Then run intervals compare the bson if it's still the latest one. I don't mind not to be a precisely real time method. Even 10 minutes of delay is fine. However now I have 70k users and 180k comments but I worry for the scalability of this method.
This would be my solution. Evaluate if it fits your requirement -
I am assuming a data structure as follows
db.user.find().pretty()
{
"_id" : ObjectId("5335123d900f7849d5ea2530"),
"user_id" : 200,
"comments" : [
{
"comment_id" : 1,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 2,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
{
"_id" : ObjectId("5335123e900f7849d5ea2531"),
"user_id" : 201,
"comments" : [
{
"comment_id" : 3,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 4,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
I added createDate attribute to the document. Add an index as follows -
db.user.ensureIndex({"user_id":1,"comments.createDate":-1})
You can search for latest comments with the query -
db.user.find({"user_id":200,"comments.createDate":{$gt:ISODate('2012-12-31')}})
The time used for "greater than" comparison would be last checked time. Since you are using index, the search will be faster. You can follow the same idea of checking in for new comments in some interval.
You can also use UTC time stamp, instead of ISODate. That way you don't have to worry about bson data type.
Note that while creating index on createDate, I have specified descending index.
If you will have too many comments within a user document, over a period of time, I would suggest that, you move comments to a different collection. Use user_id as one of the attributes in the comment document. That will give a better performance in the long run.

Strange performance test results with different write concerns in MongoDB + C#

I'm trying to test the performance of MongoDB before actually putting it to use. I'm trying to see how many documents can I update per second. I'm using C# (Mono + Ubuntu) MongoDB Driver v1.9 and MongoDB v2.4.6.
I believe one of the most effective MongoDB parameters on write performance is Write Concern. As it is stated in documentation, the most relaxed value for write concern would be -1, then 0 and finally 1 is the slowest one.
After searching I found that I can set write concerns in C# embedded in connection string like this:
var client = new MongoClient("mongodb://localhost/?w=-1");
Here are the results of me playing around with different values for w:
Fastest results are achieved when I set w to 1!
Setting w=0 is slower than w=1 28 times!
w=-1 will lead to an exception thrown with error message W value must be greater than or equal to zero!
Does anyone have any explanations on these results? Am I doing anything wrong?
[UPDATE]
I think it is necessary to set the test procedure maybe there's something hidden within it. So here it goes:
I've got a database with 100M documents in a single collection. Each document is created like this using mongo shell:
{ "counter" : 0, "last_update" : new Date() }
Here's the output of db.collection.stats();:
{
"ns" : "test.collection",
"count" : 100000100,
"size" : 6400006560,
"avgObjSize" : 64.0000015999984,
"storageSize" : 8683839472,
"numExtents" : 27,
"nindexes" : 2,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 5769582448,
"indexSizes" : {
"_id_" : 3251652432,
"last_update_1" : 2517930016
},
"ok" : 1
}
Using Mono 3.2.1 in Ubuntu 12.04 I've written a C# project which connects to MongoDB and tries to update the documents like this:
FindAndModifyArgs args = new FindAndModifyArgs();
args.SortBy = SortBy.Ascending("last_update");
args.Update = Update<Entity>.Set(e => e.last_update, DateTime.Now);
args.Fields = Fields.Include(new string[] { "counter", "_id" });
var m = collection.FindAndModify(args);
Entity ent = m.GetModifiedDocumentAs<Entity>();
var query = Query<Entity>.EQ(e => e.Id, ent.Id);
var update = Update<Entity>.Set(e => e.counter, ent.counter+1);
collection.Update(query, update);
To summarize what this piece of code does; it selects the oldest last_update and while it sets the last_update to current date, it also increments its counter (update happens in two steps).
I ran this code 10k for each of four different types of Write Concerns, w=-1, w=0, w=1 and w=1&j=true. While w=-1 throws an exception and gives out no results, here are the results for the rest of them:
Since the figure is a little hard to read, here're the same results in numbers:
w=-1 w=0 w=1 w=1&j=true
Average N/A 244.0948611492 7064.5143923477 238.1846428156
STDEV N/A 1.7787457992 511.892765742 21.0230097306
And the question is: Does anyone have any explanations why w=0 is much slower than w=1 and why w=-1 is not supported?
[UPDATE]
I've also tested RequestStart in my code like this:
using (server.RequestStart(database)) {
FindAndModifyArgs args = new FindAndModifyArgs();
args.SortBy = SortBy.Ascending("last_update");
args.Update = Update<Entity>.Set(e => e.last_update, DateTime.Now);
args.Fields = Fields.Include(new string[] { "counter", "_id" });
var m = collection.FindAndModify(args);
Entity ent = m.GetModifiedDocumentAs<Entity>();
var query = Query<Entity>.EQ(e => e.Id, ent.Id);
var update = Update<Entity>.Set(e => e.counter, ent.counter+1);
collection.Update(query, update);
}
It had no significant effect on any of the results, what so ever.
To address the w=-1 error, you may want to use the WriteConcern.Unacknowledged property directly.
Your test code might be running into consistency problems. Look into RequestStart/RequestDone to put the query onto the same connection. Per the documentation:
An example of when this might be necessary is when a series of Inserts are called in rapid succession with a WriteConcern of w=0, and you want to query that data is in a consistent manner immediately thereafter (with a WriteConcern of w=0, the writes can queue up at the server and might not be immediately visible to other connections
Aphyr did a good blog post on how well various settings of MongoDB's write concern effect its approximation of consistency. It may help you troubleshoot this issue.
Finally, you can find your driver version in build.ps1

event streaming via mongodb. Get last inserted events

I consuming data from existing database. this database store system events. My service should check this database by timer, check if some new events created, then upload it and handle. Something like simple queue implementation.
The question is - how can I get new docs each time, when I check database. I can't use timestamps, because events goes to database from different sources and there are no any order for events. So I just need to use inserting order only.
There are a couple of options.
The first, and easiest if it matches your use case, is to use a capped collection. The capped collection is a collection as a pre-defined size that acts as a sort of ring-buffer. Once then collection is full it starts overwriting the documents. For iterating over the collection you simply create a "tailable" cursor you will need some way of identifying the "last document processed (even a simple "done" flag in the document could work but it would have to exist when the document is inserted). If you truly can't modify the documents in any way then you could even save off the last processed document somewhere and use a course time stamp to (approximate the start position) and look for the last document before processing more documents.
The only real issue with this solution is that you will be limited in the number of documents you can write in the collection and it won't grow over time. There are limits on the write operations you can perform on the documents (they can't grow) but it does not sound like you are modifying the documents.
The second option, which is more complex, is to use the oplog. For a standalone configuration you will need to still pass the -replSet option to create and use the oplog. You will just not configure the oplog. In a sharded configuration you will need to track each "replica set" separately. The oplog contains a document for each insert, update, delete done to all collections/documents on the server. Each entry contains a timestamp, operation and id (at a minimum). Here are examples of each.
Insert
{ "ts" : { "t" : 1362958492000, "i" : 1 },
"h" : NumberLong("5915409566571821368"), "v" : 2,
"op" : "i",
"ns" : "test.test",
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Delete
{ ... "op" : "d", ..., "b" : true,
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Update
{ ... "op" : "u", ...,
"o2" : { "_id" : "513d189c8544eb2b5e000001" },
"o" : { "$set" : { "i" : 1 } } }
The timestamps are generated on the server and are guaranteed to be monotonically increasing. which allows you to quickly find the documents of interest.
This option is the most robust but requires some work on your part.
I wrote some demo code to create a "watcher" on a collection that is almost what you want. You can find that code on GitHub. Specifically look at the code in the com.allanbank.mongodb.demo.coordination package.
HTH, Rob
You can actually use timestamps if your _id is of type ObjectId:
prefix = Math.floor((new Date( 2013 , 03 , 11 )).getTime()/1000).toString(16)
db.foo.find( { _id : { $gt : new ObjectId( prefix + "0000000000000000" ) } } )
This way, it doesn't matter where the source of the event was or when it was,
it only matters when document insertion was recorded (higher than previous timer)
Of course, it is schema-less and you can always set a field such as isNew to true,
and set it to false in conjunction with your query / cursor

Clean up content after application bug

I'm new on mongoDB and the point is, that we new realize an bug at our application witch results in multiple mongoDB entries instead of update the (edited) document.
now, the application is online, we realize the bug and trying to manage the trouble which comes with that.
Following situation: #mongoDB there are lots of documents containing this structure:
"_id" : ObjectId("4fd9ede5a6b9579f5b000003"),
"USER" : "my_username",
"matchID" : 18809,
"data1" : 2,
"data2" : 1,
"tippDate" : ISODate("2012-06-14T13:57:57Z"),
"data3" : 0
If the user changes the data at the application, the application inserts an new document instead of updating the existing.
Like that
{
"_id" : ObjectId("4fd9ede5a6b9579f5b000003"),
"USER" : "my_username",
"matchID" : 18809,
"data1" : 2,
"data2" : 1,
"tippDate" : ISODate("2012-06-14T13:57:57Z"),
"data3" : 0
}
{
"_id" : ObjectId("4fd9ede5a6b9579f5b000002"),
"USER" : "my_username",
"matchID" : 18809,
"data1" : 4,
"data2" : 2,
"tippDate" : ISODate("2012-06-14T12:45:33Z"),
"data3" : 0
}
Right now, the bug on application side is solved, but now we have to clean up the database.
The gaol is to keep only the newest record/document from each user.
One way is to handle this on application side: loading all the data from one user, order by date, removing all the data from one user and writing the newest entry back to mongoDB.
But: isn't it possible to process that an mongoDB like an delete with joints on MySQL?
Thank you for any kind of help or hints!
is'n it possible to process that an mongoDB like an delete with joints on MySQL?
No. MongoDB does not support joins at all.
However, MongoDB does have sorting. So you can run a script to fetch each user, sort them by date and then delete the old ones.
Also, please not that you can override the _id field. It does not have to be an ObjectId(). Based on your description, you have a unique user_name, so why not simply use that as the _id?