Mongodb tail subdocuments - mongodb

I have a collection with users. Each user has comments. I want to track for some specific users (according to theirs ids) if there is a new comment.
Tailable cursor I guess are what I need but my main problem is that I want to track subdocuments and not documents.
Sample of tracking documents in python:
db = Connection().my_db
coll = db.my_collection
cursor = coll.find(tailable=True)
while cursor.alive:
try:
doc = cursor.next()
print doc
except StopIteration:
time.sleep(1)
One solution is to run intervals every x time and see if the number of the comments has changed. However I do not find the interval solution very appealing. Is there any better way to track changes? Probably with tailable cursors.
PS: I have a comment_id field (which is an ObjectID) in each comment.
Small update:
Since I have the commect_id bson, I can store the biggest (=latest) one in each user. Then run intervals compare the bson if it's still the latest one. I don't mind not to be a precisely real time method. Even 10 minutes of delay is fine. However now I have 70k users and 180k comments but I worry for the scalability of this method.

This would be my solution. Evaluate if it fits your requirement -
I am assuming a data structure as follows
db.user.find().pretty()
{
"_id" : ObjectId("5335123d900f7849d5ea2530"),
"user_id" : 200,
"comments" : [
{
"comment_id" : 1,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 2,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
{
"_id" : ObjectId("5335123e900f7849d5ea2531"),
"user_id" : 201,
"comments" : [
{
"comment_id" : 3,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 4,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
I added createDate attribute to the document. Add an index as follows -
db.user.ensureIndex({"user_id":1,"comments.createDate":-1})
You can search for latest comments with the query -
db.user.find({"user_id":200,"comments.createDate":{$gt:ISODate('2012-12-31')}})
The time used for "greater than" comparison would be last checked time. Since you are using index, the search will be faster. You can follow the same idea of checking in for new comments in some interval.
You can also use UTC time stamp, instead of ISODate. That way you don't have to worry about bson data type.
Note that while creating index on createDate, I have specified descending index.
If you will have too many comments within a user document, over a period of time, I would suggest that, you move comments to a different collection. Use user_id as one of the attributes in the comment document. That will give a better performance in the long run.

Related

Search full document in mongodb for a match

Is there a way to match a value with every array and sub document inside the document in mongodb collection and return the document
{
"_id" : "2000001956",
"trimline1" : "abc",
"trimline2" : "xyz",
"subtitle" : "www",
"image" : {
"large" : 0,
"small" : 0,
"tiled" : 0,
"cropped" : false
},
"Kytrr" : {
"count" : 0,
"assigned" : 0
}
}
for eg if in the above document I am searching for xyz or "ab" or "xy" or "z" or "0" this document should be returned.
I actually have to achieve this at the back end using C# driver but a mongo query would also help greatly.
Please advice.
Thanks
You could probably do this using '$where'
db.mycollection({$where:"JSON.stringify(this).indexOf('xyz')!=-1"})
I'm converting the whole record to a big string and then searching to see if your element is in the resulting string. Probably won't work if your xyz is in the fieldnames!
You can make it iterate through the fields to make a big string and then search it though.
This isn't the most elegant way and will involve a full tablescan. It will be faster if you look through the individual fields!
While Malcolm's answer above would work, when your collection gets large or you have high traffic, you'll see this fall over pretty quickly. This is because of 2 things. First, dropping down to javascript is a big deal and second, this will always be a full table scan because $where can't use an index.
MongoDB 2.6 introduced text indexing which is on by default (it was in beta in 2.4). With it, you can have a full text index on all the fields in the document. The documentation gives the following example where a text index is created for every field and names the index "TextIndex".
db.collection.ensureIndex(
{ "$**": "text" },
{ name: "TextIndex" }
)

Optimizing Compound Mongo GeoSpatial Index

I have a MongoDB $within that looks like this:
db.action.find( { $and : [
{ actionType : "PLAY" },
{
location : {
$within : {
$polygon : [ [ 0.0, 0.1 ], [ 0.0, 0.2 ] .. [ a.b, c.d ] ]
}
}
}
] } ).sort( { time : -1 } ).limit(50)
With regard to the action collection documents
There are 5 actionTypes
The action documents MAY or MAY NOT have a location with a ratio of approximately 70:30 for PLAY actions
Otherwise there is no location
The action documents will ALWAYS have time
The collection contains the following indexes
# I am interested recent actions
db.action.ensureIndex({"time": -1}
# I am interested in recent actions by a specific user
db.action.ensureIndex({"userId" : 1}, "time" -1}
# I am interested in recent actions that relate to a unique song id
db.action.ensureIndex({"songId" : 1}, "time" -1}
I am experimenting with the following two indexes
LocationOnly: db.action.ensureIndex({"location":"2d"})
LocationPlusTime: db.action.ensureIndex({"location":"2d"}, { "time": -1})
Identical queries with each index are explained below:
LocationOnly
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91076,
"nscanned":91076,
"nscannedObjectsAllPlans":273229,
"nscannedAllPlans":273229,
"scanAndOrder":true,
"indexOnly":false,
"nYields":1,
"nChunkSkips":0,
"millis":1090,
"indexBounds":{},
"server":"xxxx"
}
LocationPlusTime
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91224,
"nscanned":91224,
"nscannedObjectsAllPlans":273673,
"nscannedAllPlans":273673,
"scanAndOrder":true,
"indexOnly":false,
"nYields":44,
"nChunkSkips":0,
"millis":1156,
"indexBounds":{},
"server":"xxxxx"
}
Given
The geosearch will cover documents of ALL types
The geosearch will cover documents with NO Location and WITH Location in a ratio of roughly 60:40
My questions are
Can anybody explain why isMultiKey="false" on the second explain plan?
Can anybody explain why there are more yields on the 2nd explain plan?
My speculative thoughts are
The potential for NULL location is reducing the effectiveness of the
GeoSpatial index.
Compound Indexes of the GeoSpatial variety are not as powerful as standard compound indexes.
UPDATE
A sample document looks like this.
{ "_id" : "adba1154f1f3d4ddfafbff9bb3ae98f2a50e76ffc74a38bae1c44d251db315d25c99e7a1b4a8acb13d11bcd582b9843e335006a5be1d3ac8a502a0a205c0c527",
"_class" : "ie.soundwave.backstage.model.action.Action",
"time" : ISODate("2013-04-18T10:11:57Z"),
"actionType" : "PLAY",
"location" : { "lon" : -6.412839696767714, "lat" : 53.27401934563561 },
"song" : { "_id" : "82e08446c87d21b032ccaee93109d6be",
"title" : "Motion Sickness", "album" : "In Our Heads", "artist" : "Hot Chip"
},
"userId" : "51309ed6e4b0e1fb33d882eb", "createTime" : ISODate("2013-04-18T10:12:59.127Z")
}
UPDATE
The geo-query looks like this
https://www.google.com/maps/ms?msid=214949566612971430368.0004e267780661744eb95&msa=0&ll=-0.01133,-0.019226&spn=0.14471,0.264187
For various reasons approximately 250,000 documents exist in our DB at the point 0.0
I played with this for a number of days and got the result I was looking for.
Firstly, given that action types other than "PLAY" CAN NOT have a location the additional query parameter "actionType==PLAY" was unnecessary and removed. Straight away I flipped from "time-reverse-b-tree" cursor to "Geobrowse-polygon" and for my test search latency improved by an order of 10.
Next, I revisited the 2dsphere as suggested by Derick. Again another latency improvement by roughly 5. Overall a much better user experience for map searches was achieved.
I have one refinement remaining. Queries in areas where there are no plays for a number of days have generally increased in latency. This is due to the query looking back in time until it can find "some play". If necessary, I will add in a time range guard to limit the search space of these queries to a set number of days.
Thanks for the hints Derick.

event streaming via mongodb. Get last inserted events

I consuming data from existing database. this database store system events. My service should check this database by timer, check if some new events created, then upload it and handle. Something like simple queue implementation.
The question is - how can I get new docs each time, when I check database. I can't use timestamps, because events goes to database from different sources and there are no any order for events. So I just need to use inserting order only.
There are a couple of options.
The first, and easiest if it matches your use case, is to use a capped collection. The capped collection is a collection as a pre-defined size that acts as a sort of ring-buffer. Once then collection is full it starts overwriting the documents. For iterating over the collection you simply create a "tailable" cursor you will need some way of identifying the "last document processed (even a simple "done" flag in the document could work but it would have to exist when the document is inserted). If you truly can't modify the documents in any way then you could even save off the last processed document somewhere and use a course time stamp to (approximate the start position) and look for the last document before processing more documents.
The only real issue with this solution is that you will be limited in the number of documents you can write in the collection and it won't grow over time. There are limits on the write operations you can perform on the documents (they can't grow) but it does not sound like you are modifying the documents.
The second option, which is more complex, is to use the oplog. For a standalone configuration you will need to still pass the -replSet option to create and use the oplog. You will just not configure the oplog. In a sharded configuration you will need to track each "replica set" separately. The oplog contains a document for each insert, update, delete done to all collections/documents on the server. Each entry contains a timestamp, operation and id (at a minimum). Here are examples of each.
Insert
{ "ts" : { "t" : 1362958492000, "i" : 1 },
"h" : NumberLong("5915409566571821368"), "v" : 2,
"op" : "i",
"ns" : "test.test",
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Delete
{ ... "op" : "d", ..., "b" : true,
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Update
{ ... "op" : "u", ...,
"o2" : { "_id" : "513d189c8544eb2b5e000001" },
"o" : { "$set" : { "i" : 1 } } }
The timestamps are generated on the server and are guaranteed to be monotonically increasing. which allows you to quickly find the documents of interest.
This option is the most robust but requires some work on your part.
I wrote some demo code to create a "watcher" on a collection that is almost what you want. You can find that code on GitHub. Specifically look at the code in the com.allanbank.mongodb.demo.coordination package.
HTH, Rob
You can actually use timestamps if your _id is of type ObjectId:
prefix = Math.floor((new Date( 2013 , 03 , 11 )).getTime()/1000).toString(16)
db.foo.find( { _id : { $gt : new ObjectId( prefix + "0000000000000000" ) } } )
This way, it doesn't matter where the source of the event was or when it was,
it only matters when document insertion was recorded (higher than previous timer)
Of course, it is schema-less and you can always set a field such as isNew to true,
and set it to false in conjunction with your query / cursor

Get the latest record from mongodb collection

I want to know the most recent record in a collection. How to do that?
Note: I know the following command line queries works:
1. db.test.find().sort({"idate":-1}).limit(1).forEach(printjson);
2. db.test.find().skip(db.test.count()-1).forEach(printjson)
where idate has the timestamp added.
The problem is longer the collection is the time to get back the data and my 'test' collection is really really huge. I need a query with constant time response.
If there is any better mongodb command line query, do let me know.
This is a rehash of the previous answer but it's more likely to work on different mongodb versions.
db.collection.find().limit(1).sort({$natural:-1})
This will give you one last document for a collection
db.collectionName.findOne({}, {sort:{$natural:-1}})
$natural:-1 means order opposite of the one that records are inserted in.
Edit: For all the downvoters, above is a Mongoose syntax,
mongo CLI syntax is: db.collectionName.find({}).sort({$natural:-1}).limit(1)
Yet another way of getting the last item from a MongoDB Collection (don't mind about the examples):
> db.collection.find().sort({'_id':-1}).limit(1)
Normal Projection
> db.Sports.find()
{ "_id" : ObjectId("5bfb5f82dea65504b456ab12"), "Type" : "NFL", "Head" : "Patriots Won SuperBowl 2017", "Body" : "Again, the Pats won the Super Bowl." }
{ "_id" : ObjectId("5bfb6011dea65504b456ab13"), "Type" : "World Cup 2018", "Head" : "Brazil Qualified for Round of 16", "Body" : "The Brazilians are happy today, due to the qualification of the Brazilian Team for the Round of 16 for the World Cup 2018." }
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
Sorted Projection ( _id: reverse order )
> db.Sports.find().sort({'_id':-1})
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
{ "_id" : ObjectId("5bfb6011dea65504b456ab13"), "Type" : "World Cup 2018", "Head" : "Brazil Qualified for Round of 16", "Body" : "The Brazilians are happy today, due to the qualification of the Brazilian Team for the Round of 16 for the World Cup 2018." }
{ "_id" : ObjectId("5bfb5f82dea65504b456ab12"), "Type" : "NFL", "Head" : "Patriots Won SuperBowl 2018", "Body" : "Again, the Pats won the Super Bowl" }
sort({'_id':-1}), defines a projection in descending order of all documents, based on their _ids.
Sorted Projection ( _id: reverse order ): getting the latest (last) document from a collection.
> db.Sports.find().sort({'_id':-1}).limit(1)
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
I need a query with constant time response
By default, the indexes in MongoDB are B-Trees. Searching a B-Tree is a O(logN) operation, so even find({_id:...}) will not provide constant time, O(1) responses.
That stated, you can also sort by the _id if you are using ObjectId for you IDs. See here for details. Of course, even that is only good to the last second.
You may to resort to "writing twice". Write once to the main collection and write again to a "last updated" collection. Without transactions this will not be perfect, but with only one item in the "last updated" collection it will always be fast.
php7.1 mongoDB:
$data = $collection->findOne([],['sort' => ['_id' => -1],'projection' => ['_id' => 1]]);
My Solution :
db.collection("name of collection").find({}, {limit: 1}).sort({$natural: -1})
If you are using auto-generated Mongo Object Ids in your document, it contains timestamp in it as first 4 bytes using which latest doc inserted into the collection could be found out. I understand this is an old question, but if someone is still ending up here looking for one more alternative.
db.collectionName.aggregate(
[{$group: {_id: null, latestDocId: { $max: "$_id"}}}, {$project: {_id: 0, latestDocId: 1}}])
Above query would give the _id for the latest doc inserted into the collection
This is how to get the last record from all MongoDB documents from the "foo" collection.(change foo,x,y.. etc.)
db.foo.aggregate([{$sort:{ x : 1, date : 1 } },{$group: { _id: "$x" ,y: {$last:"$y"},yz: {$last:"$yz"},date: { $last : "$date" }}} ],{ allowDiskUse:true })
you can add or remove from the group
help articles: https://docs.mongodb.com/manual/reference/operator/aggregation/group/#pipe._S_group
https://docs.mongodb.com/manual/reference/operator/aggregation/last/
Mongo CLI syntax:
db.collectionName.find({}).sort({$natural:-1}).limit(1)
Let Mongo create the ID, it is an auto-incremented hash
mymongo:
self._collection.find().sort("_id",-1).limit(1)

In MongoDB, how does on get the value in a field for an embedded document, but query based on a different value

I have a basic structure like this:
> db.users.findOne()
{
"_id" : ObjectId("4f384903cd087c6f720066d7"),
"current_sign_in_at" : ISODate("2012-02-12T23:19:31Z"),
"current_sign_in_ip" : "127.0.0.1",
"email" : "something#gmail.com",
"encrypted_password" : "$2a$10$fu9B3M/.Gmi8qe7pXtVCPu94mBVC.gn5DzmQXH.g5snHT4AJSZYCu",
"last_sign_in_at" : ISODate("2012-02-12T23:19:31Z"),
"last_sign_in_ip" : "127.0.0.1",
"name" : "Trip Jameson",
"sign_in_count" : 100,
"usertimes" : [
...thousands and thousands of records like this one....
{
"enddate" : 348268392.115282,
"idle" : 0,
"startdate" : 348268382.116728,
"title" : "My Awesome Title"
},
]
}
So I want to find only usertimes for a single user where the title was "My Awesome Title", and then I want to see what the value for "idle" was in that record(s)
So far all I can figure out is that I can find the entire user record with a search like:
> db.users.find({'usertimes.title':"My Awesome Title"})
This just returns the entire User record though, which is useless for my purposes. Am I misunderstanding something?
Return only partial embedded documents is currently not supported by MongoDB
The matching User record will always be returned (at least with the current MongoDB version).
see this question for similar reference
Filtering embedded documents in MongoDB
This is the correspondent Jira on MongoDB space
http://jira.mongodb.org/browse/SERVER-142
Use:
db.users.find({'usertimes.title': "My Awesome Title"}, {'idle': 1});
May I suggest you take a more detailed look at http://www.mongodb.org/display/DOCS/Querying, it'll explain things for you.