MongoDB find without index is fast certain values of a field - mongodb

I have a collection named "message" with 3.5 million documents. There are no indices on the collection except the _id index.
3.3 million of the documents have field "name" with value "peter".
200k of the documents have field "name" with value "john".
If I query db.message.find({name: "peter"}) it takes around 1 milliseconds.
If I query db.message.find({name: "john"}) it takes around 2 seconds.
My question is why is the one query really fast while the other slow?

Related

MongoDB - Weird difference in _id Index size

I have two sharded collections on 12 shards, with the same number of documents. The shard key of Collection1 is compound (two fields are used), and its document consists of 4 fields. The shard key of Collection2 two is single, and its documents consists of 5 fields.
Via db.collection.stats() command, I get the information about the indexes.
What seems strange to me, is that for the Collection1, the total size of _id index is 1342MB.
Instead, the total size of the _id index for Collection2 is 2224MB. Is this difference reasonable? I was awaiting that the total size would be more less the same because of the same number of docucments. Note that the sharding key for both collections, does not integrate the _id field.
MongoDB uses prefix compression for indexes.
This means that if sequential values in the index begin with the same series of bytes, the bytes are stored for the first value, and subsequent values contain a tag indicating the length of the prefix.
Depending on the datatype of the _id value, this could be quite a bit.
There may also be orphaned documents causing one node to have more entries in its _id index.

large field in mongodb document is slowing down aggregate query

I have a collection named "questions" with around 15 fields. There is an indexed field called "user". Another field is "response_api" which is a subdocument of around 60KB. And there are around 40000 documents in this collection.
When I run aggreate query with only $match stage on user field, it is very slow and takes around 11 seconds to complete.
The query is:
db.questions.aggregate([{$match: {user: ObjectId("5c9a19abc89b2d09740ccd1d")} }])
But when I run this same query with $project stage, it returns pretty fast in less than 10 millis. The query is:
db.questions.aggregate([{$match: {user: ObjectId("5c9a19abc89b2d09740ccd1d")} }, {$project: {_id: 1, user: 1, subject: 1}}])
Note: This particular user has 5000 documents in questions collection.
When I copied this collection without "response_api" field and created index on user field, then both these queries on the copied collection were pretty fast and took less than 10 millis.
Can somebody explain What's going on here? Is it because of that large field?
According to this documentation provided by MongoDB, you should keep a check on the size of indexes. Basically, if index size is more than what your RAM can accomadate, then MongoDB will be reading them from the disk and thus your queries will be a lot slower.

Efficient mongodb query to find the average time in a collection of 10K+ records?

Following is the one record of a collections named outputs.
db.outputs.findOne()
{
"_id" : ObjectId("4e4131e8c7908d3eb5000002"),
"company" : "West Edmonton Mall",
"country" : "Canada",
"created_at" : ISODate("2011-08-09T13:11:04Z"),
"started_at" : ISODate("2011-08-09T11:11:04Z"),
"end_at" : ISODate("2011-08-09T13:09:04Z")
}
The above is just a document. There are around 10K docs and it will keep increasing.
What I need is to find the average hours (taking started_at and end_at) for the past 1 week (taking created_at)?
Right now, youre going to need to query the documents you need to average, likely selecting only the fields you need (started_at and end_at) and do the calculation in your app code.
If you wait for the next major version of MongoDB, there will be a new aggregation framework that will allow you to build an aggregation pipeline for querying documents, selecting fields, and performing calculations on them, and finally returning the calculated value(s). its very cool.
https://www.mongodb.org/display/DOCS/Aggregation+Framework
You can maintain the sum and counts in a separate collection using $inc operator with a value of _id that represents a week. That way, you don't have to query all 10k records. You can just query the collection mantaining sum & count, and divide the sum by count to get the average.
I have explained this in detail in the following post:
http://samarthbhargava.wordpress.com/2012/02/01/real-time-analytics-with-mongodb/

MongoDB Query for records with non-existant field & indexing

We have a mongo database with around 1M documents, and we want to poll this database using a processed field to find documents which we havent seen before. To do this we are setting a new field called _processed.
To query for documents which need to be processed, we query for documents which do not have this processed field:
db.stocktwits.find({ "_processed" : { "$exists" : false } })
However, this query takes around 30 seconds to complete each time, which is rather slow. There is an index (asc) which sits on the _processed field:
db.stocktwits.ensureIndex({ "_processed" : -1 },{ "name" : "idx_processed" });
Adding this index does not change query performance. There are a few other indexes sitting on the collection (namely the ID idx & a unique index of a couple of fields in each document).
The _processed field is a long, perhaps this should be changed to a bool to make things quicker?
We have tried using a $where query (i.e. $where : this._processed==null) to do the same thing as $exists : false and the performance is about the same (few secs slower which makes sense)...
Any ideas on what would be casusing the slow performance (or is it normal)? Does anyone have any suggestions on how to improve the query speed?
Cheers!
Upgrading to 2.0 is going to do this for you:
From MongoDB.org:
Before v2.0, $exists is not able to use an index. Indexes on other fields are still used.
Its slow because checking for _processed -> not exists doesnt offer much selectivity. Its like having an index on "Gender" - and since there are only two possible options male or female then if you have 1M rows and an index on Gender it will have to scan 50% or 500K rows to find all males.
You need to make your index more selective.

Mongo group query does not used indexes or slow down queries

I have used mongodb 1.8.1. In which I have collection which contains more than 1.8 million records. In this collections all records are simple object means not nested objects or array
Like as follows
{ name : "xyz" , "id" : 123 ,"a" : "na" , "c" : "in" , "cmp" : "pq" , "ttl" : "sd"}
All records are like this.
On this collections at time more 5 queries fire in which 2 is simple queries one contains exists in it and another one is simple query which uses index properly.
Another 2 are group queries which in which condition fields are in indexes and one contains exists.
Another one 1 distinct query with proper condition which is also index.
And order of query fire is first qroup queries then 1 simple query then distinct query and last simple query.
So data loads slowly.
If such 2 -3 calls make then it loads very lowly sometimes gives error read time out.
The collections have more than 1 index.
$exists queries do not use indexes (fixed from 1.9.1 onwards)
group commands use the JS context of mongodb which is exclusively locked while it's being used. This will affect performance of concurrent group queries. A new aggregation framework is under development that should help with this (2.1 onwards). Monitor https://jira.mongodb.org/browse/SERVER-447 for progress. In my experience it's usually more performant to do "group" like aggregation app-side.