I have a MongoDB $within that looks like this:
db.action.find( { $and : [
{ actionType : "PLAY" },
{
location : {
$within : {
$polygon : [ [ 0.0, 0.1 ], [ 0.0, 0.2 ] .. [ a.b, c.d ] ]
}
}
}
] } ).sort( { time : -1 } ).limit(50)
With regard to the action collection documents
There are 5 actionTypes
The action documents MAY or MAY NOT have a location with a ratio of approximately 70:30 for PLAY actions
Otherwise there is no location
The action documents will ALWAYS have time
The collection contains the following indexes
# I am interested recent actions
db.action.ensureIndex({"time": -1}
# I am interested in recent actions by a specific user
db.action.ensureIndex({"userId" : 1}, "time" -1}
# I am interested in recent actions that relate to a unique song id
db.action.ensureIndex({"songId" : 1}, "time" -1}
I am experimenting with the following two indexes
LocationOnly: db.action.ensureIndex({"location":"2d"})
LocationPlusTime: db.action.ensureIndex({"location":"2d"}, { "time": -1})
Identical queries with each index are explained below:
LocationOnly
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91076,
"nscanned":91076,
"nscannedObjectsAllPlans":273229,
"nscannedAllPlans":273229,
"scanAndOrder":true,
"indexOnly":false,
"nYields":1,
"nChunkSkips":0,
"millis":1090,
"indexBounds":{},
"server":"xxxx"
}
LocationPlusTime
{
"cursor":"BasicCursor",
"isMultiKey":false,
"n":50,
"nscannedObjects":91224,
"nscanned":91224,
"nscannedObjectsAllPlans":273673,
"nscannedAllPlans":273673,
"scanAndOrder":true,
"indexOnly":false,
"nYields":44,
"nChunkSkips":0,
"millis":1156,
"indexBounds":{},
"server":"xxxxx"
}
Given
The geosearch will cover documents of ALL types
The geosearch will cover documents with NO Location and WITH Location in a ratio of roughly 60:40
My questions are
Can anybody explain why isMultiKey="false" on the second explain plan?
Can anybody explain why there are more yields on the 2nd explain plan?
My speculative thoughts are
The potential for NULL location is reducing the effectiveness of the
GeoSpatial index.
Compound Indexes of the GeoSpatial variety are not as powerful as standard compound indexes.
UPDATE
A sample document looks like this.
{ "_id" : "adba1154f1f3d4ddfafbff9bb3ae98f2a50e76ffc74a38bae1c44d251db315d25c99e7a1b4a8acb13d11bcd582b9843e335006a5be1d3ac8a502a0a205c0c527",
"_class" : "ie.soundwave.backstage.model.action.Action",
"time" : ISODate("2013-04-18T10:11:57Z"),
"actionType" : "PLAY",
"location" : { "lon" : -6.412839696767714, "lat" : 53.27401934563561 },
"song" : { "_id" : "82e08446c87d21b032ccaee93109d6be",
"title" : "Motion Sickness", "album" : "In Our Heads", "artist" : "Hot Chip"
},
"userId" : "51309ed6e4b0e1fb33d882eb", "createTime" : ISODate("2013-04-18T10:12:59.127Z")
}
UPDATE
The geo-query looks like this
https://www.google.com/maps/ms?msid=214949566612971430368.0004e267780661744eb95&msa=0&ll=-0.01133,-0.019226&spn=0.14471,0.264187
For various reasons approximately 250,000 documents exist in our DB at the point 0.0
I played with this for a number of days and got the result I was looking for.
Firstly, given that action types other than "PLAY" CAN NOT have a location the additional query parameter "actionType==PLAY" was unnecessary and removed. Straight away I flipped from "time-reverse-b-tree" cursor to "Geobrowse-polygon" and for my test search latency improved by an order of 10.
Next, I revisited the 2dsphere as suggested by Derick. Again another latency improvement by roughly 5. Overall a much better user experience for map searches was achieved.
I have one refinement remaining. Queries in areas where there are no plays for a number of days have generally increased in latency. This is due to the query looking back in time until it can find "some play". If necessary, I will add in a time range guard to limit the search space of these queries to a set number of days.
Thanks for the hints Derick.
Related
I have created a collection and added just a name field and tried to apply the following index.
db.names.createIndex({"name":1})
Even after applying the index I see the below result.
db.names.find()
{ "_id" : ObjectId("57d14139eceab001a19f7e82"), "name" : "kkkk" } {
"_id" : ObjectId("57d1413feceab001a19f7e83"), "name" : "aaaa" } {
"_id" : ObjectId("57d14144eceab001a19f7e84"), "name" : "zzzz" } {
"_id" : ObjectId("57d14148eceab001a19f7e85"), "name" : "dddd" } {
"_id" : ObjectId("57d1414ceceab001a19f7e86"), "name" : "rrrrr" }
What am I missing here.
Khans...
the way you built your index is correct however building an ascending index on names wont return the results in ascending order.
if you need results to be ordered by name you have to use
{db.names.find().sort({names:1})}
what happens when you build an index is that when you search for data the Mongo process perform the search behind the scenes in an ordered fashion for faster outcomes.
Please note: if you just want to see output in sorted order. you dont even need an index.
You won't be able to see if an index has been successfully created (unless there is a considerable speed performance) by running a find() command.
Instead, use db.names.getIndexes() to see if the index has been created (it may take some time if you're running the index in the background for it to appear in the index list)
I'm running some queries to a mongodb 2.4.9 server that populate a datatable on a webpage. The user needs to be able to do a substring search across multiple fields, sort the data on various columns, and flip through the results in pages. I have to check multiple fields for matches since the user could be searching for anything related to the documents. There are about 300,000 documents in the collection so the database is relatively small.
I have indexes created for the created_by, requester, desc.name, metaprogram.id, program.id, and arr.programid fields. I've also created indexes [("created", 1), ("created_by", 1), ("requester", 1)] and [("created_by", 1), ("requester", 1)] at the suggestion of Dex.
It's also worth mentioning that documents might not have all of the fields that are being searched for here. Some documents might have a metaprogram.id but not the other ID fields for example.
An example of a query I might run is
{
"$query" : {
"$and" : [
{
"created_by" : {"$ne" : "automation"},
"requester" : {"$in" : ["Broadway", "Spec", "Falcon"] }
},
{
"$or" : [
{"requester" : /month/i },
{"created_by" : /month/i },
{"desc.name" : /month/i },
{"metaprogram.id" : {"$in" : [708, 2314, 709 ] } },
{"program.id" : {"$in" : [708, 2314, 709 ] } },
{"arr.programid" : {"$in" : [708, 2314, 709 ] } }
]
}
]
},
"$orderby" : {
"created" : 1
}
}
with differing orderby, limit, and skip values as well.
Queries on average take 500-1500ms to complete.
I've looked into how to make it faster, but haven't been able to come up with anything. Some of the text searching stuff looks handy but as far as I know each collection only supports at most one text index and it doesn't support pagination (skips). I'm sure that prefix searching instead of regex substring matches would be faster as well but I need substring matching.
Is there anything you can think of to improve the speed of a query like this?
It's quite hard to optimize a query when it's unpredictable.
Analyze how the system is being used and place indexes on the most popular fields.
Use .explain() to make sure the indexes are being used.
Also limit the results returned to a value of 50 or 100. The user doesn't need to see everything at once.
Try upgrading mongodb to see if there's a performance improvement.
Side note:
You might want to consider using ElasticSearch as a search engine instead of Mongodb. ElasticSearch would store the searchable fields and return the Mongodb Ids for matched results. ElasticSearch is a magnitude faster as a search engine than Mongodb.
More info:
How to find queries not using indexes or slow in mongodb
Range query for MongoDB pagination
http://www.elasticsearch.org/overview/
Let's assume I would like to track computers on my network storing information about mac-address with the device name, port name, vlan number and timestamp.
I could grab mac-address-table from all switches in regular intervals, parse it and dump that data into mongodb.
The problem is: how to STORE only last 100 unique entries for each mac-address.
Capped collections is no-go, because to do that I would have to create separate collection for each mac, which is bad idea.
The number of switches and mac-addresses may change over time, and new data might be inserted at irregular intervals.
The other idea I have is to write some query which looks for the timestamp of 100th oldest entry for each mac-address and remove all older entries, and run this queries after each batch of inserts. It may work, but doesn't seem very efficient.
Do you have any better ideas?
Hmm... found something interesting:
how about storing each version in an array using $push operator with $slice modifier?
there are some examples in the docs:
http://docs.mongodb.org/manual/reference/operator/update/slice/
The other idea I have is to write some query which looks for the timestamp of 100th oldest entry for each mac-address and remove all older entries, and run this queries after each batch of inserts. It may work, but doesn't seem very efficient.
That sounds good to me. It might be cleaner to use cyclic buffer for this:
{
mac : "AA:AA:AA:AA:AA",
entryPointer : 2, // pointer of the next entry to be written
lastEntries : [
{ "ip" : "127.0.0.1", "service" : "foo", ts : ISODate(...), ... },
{ "ip" : "127.0.0.1", "service" : "foo", ts : ISODate(...), ... },
{ "ip" : "255.255.255.255", "service" : "longestProbableServiceName",
ts : ISODate(0001-01-01), ... }
...
{ "ip" : "255.255.255.255", "service" : "longestProbableServiceName",
ts : ISODate(0001-01-01), ... }
]
}
An update would have to increase the pointer and overwrite the position in the array given by pointer % 100. It will be helpful to preallocate the memory in the array as demonstrated to avoid fragmentation and reallocation overhead.
As you pointed out, the modulo-update can be done using $slice and $push:
db.foo.update(
{ mac : "AA:AA:AA..." },
{
$push: {
lastEntries : {
$each: [ { "ip" : "012.002.003.012", ... } ],
$slice: -100
}
}
}
)
Pre-populating the array also comes with the advantage that the most recent entry is always at the last position.
I have a collection with users. Each user has comments. I want to track for some specific users (according to theirs ids) if there is a new comment.
Tailable cursor I guess are what I need but my main problem is that I want to track subdocuments and not documents.
Sample of tracking documents in python:
db = Connection().my_db
coll = db.my_collection
cursor = coll.find(tailable=True)
while cursor.alive:
try:
doc = cursor.next()
print doc
except StopIteration:
time.sleep(1)
One solution is to run intervals every x time and see if the number of the comments has changed. However I do not find the interval solution very appealing. Is there any better way to track changes? Probably with tailable cursors.
PS: I have a comment_id field (which is an ObjectID) in each comment.
Small update:
Since I have the commect_id bson, I can store the biggest (=latest) one in each user. Then run intervals compare the bson if it's still the latest one. I don't mind not to be a precisely real time method. Even 10 minutes of delay is fine. However now I have 70k users and 180k comments but I worry for the scalability of this method.
This would be my solution. Evaluate if it fits your requirement -
I am assuming a data structure as follows
db.user.find().pretty()
{
"_id" : ObjectId("5335123d900f7849d5ea2530"),
"user_id" : 200,
"comments" : [
{
"comment_id" : 1,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 2,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
{
"_id" : ObjectId("5335123e900f7849d5ea2531"),
"user_id" : 201,
"comments" : [
{
"comment_id" : 3,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 4,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
I added createDate attribute to the document. Add an index as follows -
db.user.ensureIndex({"user_id":1,"comments.createDate":-1})
You can search for latest comments with the query -
db.user.find({"user_id":200,"comments.createDate":{$gt:ISODate('2012-12-31')}})
The time used for "greater than" comparison would be last checked time. Since you are using index, the search will be faster. You can follow the same idea of checking in for new comments in some interval.
You can also use UTC time stamp, instead of ISODate. That way you don't have to worry about bson data type.
Note that while creating index on createDate, I have specified descending index.
If you will have too many comments within a user document, over a period of time, I would suggest that, you move comments to a different collection. Use user_id as one of the attributes in the comment document. That will give a better performance in the long run.
I want to know the most recent record in a collection. How to do that?
Note: I know the following command line queries works:
1. db.test.find().sort({"idate":-1}).limit(1).forEach(printjson);
2. db.test.find().skip(db.test.count()-1).forEach(printjson)
where idate has the timestamp added.
The problem is longer the collection is the time to get back the data and my 'test' collection is really really huge. I need a query with constant time response.
If there is any better mongodb command line query, do let me know.
This is a rehash of the previous answer but it's more likely to work on different mongodb versions.
db.collection.find().limit(1).sort({$natural:-1})
This will give you one last document for a collection
db.collectionName.findOne({}, {sort:{$natural:-1}})
$natural:-1 means order opposite of the one that records are inserted in.
Edit: For all the downvoters, above is a Mongoose syntax,
mongo CLI syntax is: db.collectionName.find({}).sort({$natural:-1}).limit(1)
Yet another way of getting the last item from a MongoDB Collection (don't mind about the examples):
> db.collection.find().sort({'_id':-1}).limit(1)
Normal Projection
> db.Sports.find()
{ "_id" : ObjectId("5bfb5f82dea65504b456ab12"), "Type" : "NFL", "Head" : "Patriots Won SuperBowl 2017", "Body" : "Again, the Pats won the Super Bowl." }
{ "_id" : ObjectId("5bfb6011dea65504b456ab13"), "Type" : "World Cup 2018", "Head" : "Brazil Qualified for Round of 16", "Body" : "The Brazilians are happy today, due to the qualification of the Brazilian Team for the Round of 16 for the World Cup 2018." }
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
Sorted Projection ( _id: reverse order )
> db.Sports.find().sort({'_id':-1})
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
{ "_id" : ObjectId("5bfb6011dea65504b456ab13"), "Type" : "World Cup 2018", "Head" : "Brazil Qualified for Round of 16", "Body" : "The Brazilians are happy today, due to the qualification of the Brazilian Team for the Round of 16 for the World Cup 2018." }
{ "_id" : ObjectId("5bfb5f82dea65504b456ab12"), "Type" : "NFL", "Head" : "Patriots Won SuperBowl 2018", "Body" : "Again, the Pats won the Super Bowl" }
sort({'_id':-1}), defines a projection in descending order of all documents, based on their _ids.
Sorted Projection ( _id: reverse order ): getting the latest (last) document from a collection.
> db.Sports.find().sort({'_id':-1}).limit(1)
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
I need a query with constant time response
By default, the indexes in MongoDB are B-Trees. Searching a B-Tree is a O(logN) operation, so even find({_id:...}) will not provide constant time, O(1) responses.
That stated, you can also sort by the _id if you are using ObjectId for you IDs. See here for details. Of course, even that is only good to the last second.
You may to resort to "writing twice". Write once to the main collection and write again to a "last updated" collection. Without transactions this will not be perfect, but with only one item in the "last updated" collection it will always be fast.
php7.1 mongoDB:
$data = $collection->findOne([],['sort' => ['_id' => -1],'projection' => ['_id' => 1]]);
My Solution :
db.collection("name of collection").find({}, {limit: 1}).sort({$natural: -1})
If you are using auto-generated Mongo Object Ids in your document, it contains timestamp in it as first 4 bytes using which latest doc inserted into the collection could be found out. I understand this is an old question, but if someone is still ending up here looking for one more alternative.
db.collectionName.aggregate(
[{$group: {_id: null, latestDocId: { $max: "$_id"}}}, {$project: {_id: 0, latestDocId: 1}}])
Above query would give the _id for the latest doc inserted into the collection
This is how to get the last record from all MongoDB documents from the "foo" collection.(change foo,x,y.. etc.)
db.foo.aggregate([{$sort:{ x : 1, date : 1 } },{$group: { _id: "$x" ,y: {$last:"$y"},yz: {$last:"$yz"},date: { $last : "$date" }}} ],{ allowDiskUse:true })
you can add or remove from the group
help articles: https://docs.mongodb.com/manual/reference/operator/aggregation/group/#pipe._S_group
https://docs.mongodb.com/manual/reference/operator/aggregation/last/
Mongo CLI syntax:
db.collectionName.find({}).sort({$natural:-1}).limit(1)
Let Mongo create the ID, it is an auto-incremented hash
mymongo:
self._collection.find().sort("_id",-1).limit(1)