event streaming via mongodb. Get last inserted events - mongodb

I consuming data from existing database. this database store system events. My service should check this database by timer, check if some new events created, then upload it and handle. Something like simple queue implementation.
The question is - how can I get new docs each time, when I check database. I can't use timestamps, because events goes to database from different sources and there are no any order for events. So I just need to use inserting order only.

There are a couple of options.
The first, and easiest if it matches your use case, is to use a capped collection. The capped collection is a collection as a pre-defined size that acts as a sort of ring-buffer. Once then collection is full it starts overwriting the documents. For iterating over the collection you simply create a "tailable" cursor you will need some way of identifying the "last document processed (even a simple "done" flag in the document could work but it would have to exist when the document is inserted). If you truly can't modify the documents in any way then you could even save off the last processed document somewhere and use a course time stamp to (approximate the start position) and look for the last document before processing more documents.
The only real issue with this solution is that you will be limited in the number of documents you can write in the collection and it won't grow over time. There are limits on the write operations you can perform on the documents (they can't grow) but it does not sound like you are modifying the documents.
The second option, which is more complex, is to use the oplog. For a standalone configuration you will need to still pass the -replSet option to create and use the oplog. You will just not configure the oplog. In a sharded configuration you will need to track each "replica set" separately. The oplog contains a document for each insert, update, delete done to all collections/documents on the server. Each entry contains a timestamp, operation and id (at a minimum). Here are examples of each.
Insert
{ "ts" : { "t" : 1362958492000, "i" : 1 },
"h" : NumberLong("5915409566571821368"), "v" : 2,
"op" : "i",
"ns" : "test.test",
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Delete
{ ... "op" : "d", ..., "b" : true,
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Update
{ ... "op" : "u", ...,
"o2" : { "_id" : "513d189c8544eb2b5e000001" },
"o" : { "$set" : { "i" : 1 } } }
The timestamps are generated on the server and are guaranteed to be monotonically increasing. which allows you to quickly find the documents of interest.
This option is the most robust but requires some work on your part.
I wrote some demo code to create a "watcher" on a collection that is almost what you want. You can find that code on GitHub. Specifically look at the code in the com.allanbank.mongodb.demo.coordination package.
HTH, Rob

You can actually use timestamps if your _id is of type ObjectId:
prefix = Math.floor((new Date( 2013 , 03 , 11 )).getTime()/1000).toString(16)
db.foo.find( { _id : { $gt : new ObjectId( prefix + "0000000000000000" ) } } )
This way, it doesn't matter where the source of the event was or when it was,
it only matters when document insertion was recorded (higher than previous timer)
Of course, it is schema-less and you can always set a field such as isNew to true,
and set it to false in conjunction with your query / cursor

Related

How can I update only new data or changed value in mongodb?

Update command updates the key with provided Json. I want to update only the object that is not present in db and changed value. How can I do that?
"data" : [
{
"_id" : "5bb6253d861d057857ec3ff0",
"name" : "C"
},
{
"_id" : "5bb625fc861d057857ec3ff1",
"name" : "B"
},
{
"_id" : "5bb625fe861d057857ec3ff2",
"name" : "A"
}
]
my data is like this. So, if one more array object comes in json of only 2 new object comes then it should insert the two data along with the 3 data.
Update the object that is not present in DB:
Use upsert: Upsert creates a new document when no document matches the query criteria. Alternatively, you can add null checks in your query e.g { user_id:null }. This will allow to update the data where a record for the user is not present in DB.
Update changed value:
This can be implemented maintaining a key to store last_updated_at. If the last_updated_at value does not match to the previously_updatede_at that record can be treated at modified
You can implement Change Streams, introduced in MongoDB 3.6 from which you can receive real time changes on your data. You can receive only the data that was changed, filtering by the "update" operation. Furthermore you can also filter for data that is newly inserted filtering by the "insert" operation. Please see Change Streams.

Locking document in MongoDB while allowing queries to get non-locked records

In our product we have a core collection which can be accessed from a distributed set of workers.
I want to be able to get a document from the collection without any of the workers accidentally picking up the same document.
The best way I've come up with so far for managing to prevent duplicate records being loaded is the following:
Having 2 separate collections with the following basic structure:
core: { _id: '{mongoGeneratedId}', locked: false, lockTimeout: 0}
lock: { _id: null, lockTimeout: 0}
(lockTimeout would have a TTL index)
A worker would run a query that looks something like this:
db.core.findOne({
$or: [
{locked: false},
{lockTimeout < $currentTime}
]
})
and would have a record returned to it.
To test if the record has been grabbed by another worker and locked it would then try to insert a record into lock with a lockTimeout of 5 mins in the future and an id of the same id as your id from the core table.
If this fails, then we know that another worker pipped us to the post and we want to try to run the query again. If it succeeds, then we update core to have locked as true and have the lockTimeout as the same as the lockTimeout from the lock collection.
Apart from the addition of some form of slightly more complicated ordering to reduce the chances of 2 workers picking up the same record I believe this should work.
However, it doesn't feel elegant and I feel like there should be a better way that doesn't require me to create a secondary collection just to keep track of locking.
Does such a thing exist? Kind regards!
Try using the findAndModify command. This command atomically updates a document and returns the document (default pre-, optionally post-update). You can use the atomic update to lock the document as you grab it:
> db.queue.insert({ "x" : 1, "locked" : false })
> db.queue.findAndModify({
"query" : { "locked" : false },
"update" : { "$set" : { "locked" : true } }
})
{ "_id" : ObjectId("53ea6f0ef9b63e0dd3ca1a1f"), "x" : 1, "locked" : false }
You can also remove the document atomically. Check out the link for all of the features that could help for your queue-like use case and to read more about the command's behavior.

Mongodb tail subdocuments

I have a collection with users. Each user has comments. I want to track for some specific users (according to theirs ids) if there is a new comment.
Tailable cursor I guess are what I need but my main problem is that I want to track subdocuments and not documents.
Sample of tracking documents in python:
db = Connection().my_db
coll = db.my_collection
cursor = coll.find(tailable=True)
while cursor.alive:
try:
doc = cursor.next()
print doc
except StopIteration:
time.sleep(1)
One solution is to run intervals every x time and see if the number of the comments has changed. However I do not find the interval solution very appealing. Is there any better way to track changes? Probably with tailable cursors.
PS: I have a comment_id field (which is an ObjectID) in each comment.
Small update:
Since I have the commect_id bson, I can store the biggest (=latest) one in each user. Then run intervals compare the bson if it's still the latest one. I don't mind not to be a precisely real time method. Even 10 minutes of delay is fine. However now I have 70k users and 180k comments but I worry for the scalability of this method.
This would be my solution. Evaluate if it fits your requirement -
I am assuming a data structure as follows
db.user.find().pretty()
{
"_id" : ObjectId("5335123d900f7849d5ea2530"),
"user_id" : 200,
"comments" : [
{
"comment_id" : 1,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 2,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
{
"_id" : ObjectId("5335123e900f7849d5ea2531"),
"user_id" : 201,
"comments" : [
{
"comment_id" : 3,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 4,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
I added createDate attribute to the document. Add an index as follows -
db.user.ensureIndex({"user_id":1,"comments.createDate":-1})
You can search for latest comments with the query -
db.user.find({"user_id":200,"comments.createDate":{$gt:ISODate('2012-12-31')}})
The time used for "greater than" comparison would be last checked time. Since you are using index, the search will be faster. You can follow the same idea of checking in for new comments in some interval.
You can also use UTC time stamp, instead of ISODate. That way you don't have to worry about bson data type.
Note that while creating index on createDate, I have specified descending index.
If you will have too many comments within a user document, over a period of time, I would suggest that, you move comments to a different collection. Use user_id as one of the attributes in the comment document. That will give a better performance in the long run.

MongoDB: Doing $inc on multiple keys

I need help incrementing value of all keys in participants without having to know name of the keys inside of it.
> db.conversations.findOne()
{
"_id" : ObjectId("4faf74b238ba278704000000"),
"participants" : {
"4f81eab338ba27c011000001" : NumberLong(2),
"4f78497938ba27bf11000002" : NumberLong(2)
}
}
I've tried with something like
$mongodb->conversations->update(array('_id' => new \MongoId($objectId)), array('$inc' => array('participants' => 1)));
to no avail...
You need to redesign your schema. It is never a good idea to have "random key names". Even though MongoDB is schemaless, it still means you need to have defined key names. You should change your schema to:
{
"_id" : ObjectId("4faf74b238ba278704000000"),
"participants" : [
{ _id: "4f81eab338ba27c011000001", count: NumberLong(2) },
{ _id: "4f78497938ba27bf11000002", count: NumberLong(2) }
]
}
Sadly, even with that, you can't update all embedded counts in one command. There is currently an open feature request for that: https://jira.mongodb.org/browse/SERVER-1243
In order to still update everything, you should:
query the document
update all the counts on the client side
store the document again
In order to prevent race conditions with that, have a look at "Compare and Swap" and following paragraphs.
It is not possible to update all nested elements in one single move in current version of MongoDB. So I can advice to use "foreach {}".
Read realted topic: How to Update Multiple Array Elements in mongodb
I hope this feature will be implemented in next version.

Multiple update of embedded documents' properties

I have the following collection:
{
"Milestones" : [
{ "ActualDate" : null,
"Index": 0,
"Name" : "milestone1",
"TargetDate" : ISODate("2011-12-13T22:00:00Z"),
"_id" : ObjectId("4ee89ae7e60fc615c42e28d1")},
{ "ActualDate" : null,
"Index" : 0,
"Name" : "milestone2",
"TargetDate" : ISODate("2011-12-13T22:00:00Z"),
"_id" : ObjectId("4ee89ae7e60fc615c42e28d2") } ]
,
"Name" : "a", "_id" : ObjectId("4ee89ae7e60fc615c42e28ce")
}
I want to update definite documents: that have specified _id, List of Milestones._id and ActualDate is null.
I dotnet my code looks like:
var query = Query.And(new[] { Query.EQ("_id", ObjectId.Parse(projectId)),
Query.In("Milestones._id", new BsonArray(values.Select(ObjectId.Parse))),
Query.EQ("Milestones.ActualDate", BsonNull.Value) });
var update = Update.Set("Milestones.$.ActualDate", DateTime.Now.Date);
Coll.Update(query, update, UpdateFlags.Multi, SafeMode.True);
Or in native code:
db.Projects.update({ "_id" : ObjectId("4ee89ae7e60fc615c42e28ce"), "Milestones._id" : { "$in" : [ObjectId("4ee89ae7e60fc615c42e28d1"), ObjectId("4ee89ae7e60fc615c42e28d2"), ObjectId("4ee8a648e60fc615c41d481e")] }, "Milestones.ActualDate" : null },{ "$set" : { "Milestones.$.ActualDate" : ISODate("2011-12-13T22:00:00Z") } }, false, true)
But only the first item is being updated.
This is not possible in current moment. Flag multi in update means update of multiple root documents. Positional operator can match only one nested array item. There is such feature in mongodb jira. You can vote up and wait.
Current solution can be only load document, update as you wish and save back or multiple atomic update for each nested array id.
From documentation at mongodb.org:
Currently the $ operator only applies to the first matched item in the
query
As answered by Andrew Orsich, this is not possible for the moment, at least not as you wish. But loading the document, modifying the array then saving it back will work. The risk is that some other process could modify the array in the meantime, so you would overwrite its changes. To avoid this, you can use optimistic locking, especially if the array is not modified every second.
load the document, including a new attribute: milestones_version
modify the array as needed
save back to mongodb, but now add a query constraint on the milestones_version, and increment it:
db.Projects.findAndModify({
query: {
_id: your_project_id,
milestones_version: expected_milestones_version
},
update: {
$set: {
Milestones: modified_milestones
},
$inc: {
milestones_version: 1
}
},
new: 1
})
If another process modified the milestones array (and hence the milestones_version) before we did, then this command will do nothing and simply return null. We just need to reload the document and try again. If the array is not modified every second, then this will be very rare and will not have any impact on performance.
The main problem with this solution is that you have to edit every Project, one by one (no multi: true). You could still write a javascript function and have it run on the server though.
According to their JIRA page "This new feature is available starting with the MongoDB 3.5.12 development version, and included in the MongoDB 3.6 production version"
https://jira.mongodb.org/browse/SERVER-1243