Mongoid: why fetching count is slower than fetching documents - mongodb

I noticed a strange behavior. It might be mongoid or mongodb, I am not sure, but Counting documents is slower than fetching the documents. Here are the queries I fired:
Institution.all.any_of(:portaled_at.ne => nil).any_of(portaled: true).order_by(:portaled_at.desc).count
# mongodb query and timing as per mongoid logs,
# times are consistent over multiple runs
# MONGODB (236ms) db['$cmd'].find({"count"=>"institutions", "query"=>{"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}, "fields"=>nil}).limit(-1)
# MONGODB (245ms) db['$cmd'].find({"count"=>"institutions", "query"=>{"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}, "fields"=>nil}).limit(-1)
Institution.all.any_of(:portaled_at.ne => nil).any_of(portaled: true).order_by(:portaled_at.desc).to_a
# mongodb query and timing as per mongoid logs
# times are not so consistent over multiple runs,
# but consistently much lower than count query
# MONGODB (9ms) db['institutions'].find({"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}).sort([[:portaled_at, :desc]])
# MONGODB (18ms) db['institutions'].find({"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}).sort([[:portaled_at, :desc]])
I believe indexes are not used by mongodb for $and and $or queries, but just so if it matters, I have a sparse index on portaled_at in descending order. Out of around 200,000 documents only around 50-60 have portaled_at set.
rails 3.2.12
mongoid 2.6.0
mongodb 2.2.3
This is against my common sense and if anybody can explain what is going on I would really appreciate it.

While the two are running through different subsystems in MongoDB (one is using runCommand and the other the standard query engine), the specific issue in this case is very likely a known issue in the current version of MongoDb.
The quick summary is that counting without fetching is extremely slow as MongoDb is doing a lot of extra work that often isn't necessary. It's been fixed in the development branch, so it should be in 2.4 when it is released.

For some reason Mongo defaults to not counting records using only indexes. However, if you construct a query correctly, Mongo will count from the index. The trick is to only fetch the fields that are in the index, and you have to specify a query.
In Mongo Shell:
db.MyCollection.find({"_id":{$ne:''}},{"_id":1}).count()
You can check with the explain method:
db.MyCollection.find({"_id":{$ne:''}},{"_id":1}).explain()
Which will include "indexOnly" : true in the output.
And similarly the command can be executed via the Moped driver directly like so:
Mongoid::Sessions.default.command(:count => "MyCollection", :query=>{"_id"=>{"$ne"=>""}}, :fields => {:_id=>1})
Which, in my benchmarks (on my live data, YMMV) is about 100x faster than simply doing MyMongoidDocumentClass.count
Unfortunately, there doesn't seem to be a way to do this quickly through the Mongoid gem.

Related

MongoDB collection toArray() length is 20 less than collection.count()

I am using mongoDB version 3.6.3 on a ubuntu operating system.
I have created a collection with 100 records
To manipulate the data on the mongo shell I assign cursor like below
cur = db.dummyData.find({}, {_id: 0})
now the cur.count() is 100 but cur.toArray().length is 80.
I not sure why this is happening. I have tried with bunch of different collections toArray() length is always 20 less than the actual count.
Would appreciate any help to understand this behavior.
MongoDB keeps a running count of documents for each collection which is updated for each insert/delete operation. Some occurrences such a hard shutdown can result in this number in the metadata differing from the actual collection.
The cursor.count() function queries the MongoDB asking for this number from the metadata without fetching any documents, so it is very fast. The cursor.itcount() function will actually fetch the documents, so it will run slower, but will always return an accurate count.
To correct the count in the collections metadata, run db.collectionName.validate(true) on the collection in question from the mongo shell.

MongoDB - A very Simple query is taking high response time

We have an issue with MongoDB query performance. There is a collection in MongoDb with approx 30 properties and total documents are near to 2000. Below is an example with one collection "deal", facing similar issue with almost all the collections in my db. Please note that i have created indexes, even compound indexes based on query use cases, and there is no COLLSCAN determined in any of the query. I have tried this with MongoDb 3.6 as well as 4.x, the result is same.
db.getCollection("deals").find({isActive:true, branch: ObjectId("5af1de276fcd080007ed79fd")})
enter image description here

Meteor Mongo Never Returns Data

Using Meteor 1.3.2.4 and Mongo 3.2 (which doesn't seem like it should have major problems using it with Meteor), when running queries (really any queries) on collections larger than ~10,000 documents, they never return, or take many minutes to return.
I have indexes on most of these fields; this is not a no-index problem (I wish).
There is no evidence of any issues in the mongodb logs, just connection accepted. I have no mongo warnings or errors of any kind (fixed the mongo kernel warnings I had).
And the weirdest part about this is that when using the mongo cli, these queries run just fine, in a second or so.
One collection I'm running has ~500k docs and the other 15M.
What could be the issue? I read a few places that MongoDB 3.2 should work fine with Meteor, am I wrong?

Query with $in operator and large list of Ids

I have a pretty large number of document Ids to iterate through (say 5k-10k). The $in operator doesn't limit that number starting from mongodb version 2.6. Earlier versions had a combinatorial limit of 4 mio.
That said, does it make sense at all to do something like that in mongodb or is it an anti-pattern with performance penalties and I should iterate manually in the application layer?
It's somewhat of an anti-pattern, but sometimes there's no other choice.
If you can change the schema and make that query redundant then you should. If you can't, doing it yourself will surly be slower than letting MongoDB do it.
However, there is another limit you need to consider. Each document in MongoDB is limited to 16MB and each query is sent as a document so with enough items you may reach that limit and get an exception.

Mongodb model for Uniqueness

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses