MongoDB collection toArray() length is 20 less than collection.count() - mongodb

I am using mongoDB version 3.6.3 on a ubuntu operating system.
I have created a collection with 100 records
To manipulate the data on the mongo shell I assign cursor like below
cur = db.dummyData.find({}, {_id: 0})
now the cur.count() is 100 but cur.toArray().length is 80.
I not sure why this is happening. I have tried with bunch of different collections toArray() length is always 20 less than the actual count.
Would appreciate any help to understand this behavior.

MongoDB keeps a running count of documents for each collection which is updated for each insert/delete operation. Some occurrences such a hard shutdown can result in this number in the metadata differing from the actual collection.
The cursor.count() function queries the MongoDB asking for this number from the metadata without fetching any documents, so it is very fast. The cursor.itcount() function will actually fetch the documents, so it will run slower, but will always return an accurate count.
To correct the count in the collections metadata, run db.collectionName.validate(true) on the collection in question from the mongo shell.

Related

Difference between count() and find().count() in MongoDB

What is the difference between, I basically wanted to find all the documents in the mycollection.
db.mycollection.count() vs
db.mycollection.find().count()?
They both returns the same result. Is there any reason why would somebody choose the count() vs the find().count()? In contrast to the fact that find() has a default limit applied (correct me if I'm wrong) to which you would have to type "it" in order to see more in the shell.
db.collection.count() and cursor.count() are simply wrappers around the count command thus running db.collection.count() and cursor.count() with/without the same will return the same query argument, will return the same result. However the count result can be inaccurate in sharded cluster.
MongoDB drivers compatible with the 4.0 features deprecate their
respective cursor and collection count() APIs in favor of new APIs for
countDocuments() and estimatedDocumentCount(). For the specific API
names for a given driver, see the driver documentation.
The db.collection.countDocuments method internally uses an aggregation query to return the document count while db.collection.estimatedDocumentCount/ returns documents count based on metadata.
It is worth mentioning that the estimatedDocumentCount output can be inaccurate as mentioned in the documentation.
db.collection.count() without parameters counts all documents in a collection. db.collection.find() without parameters matches all documents in a collection, and appending count() counts them, so there is no difference.
This is confirmed explicitly in the db.collection.count() documentation:
To count the number of all documents in the orders collection, use the
following operation:
db.orders.count()
This operation is equivalent to the following:
db.orders.find().count()
As is mentioned in another answer by sheilak, the two are equivalent - except that db.collection.count() can be inaccurate for sharded clusters.
The latest documentation says:
count() is equivalent to the db.collection.find(query).count()
construct.
And then,
Sharded Clusters
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress.
The documentation explains how to mitigate this bug (use an aggregate).
db.collection.count() is equivalent to the db.collection.find(query).count() construct.
Examples
Count all Documents in a Collection
db.orders.count()
This operation is equivalent to the following:
db.orders.find().count()
Count all Documents that Match a Query
Count the number of the documents in the orders collection with the field ord_dt greater than new Date('01/01/2012'):
db.orders.count( { ord_dt: { $gt: new Date('01/01/2012') } } )
The query is equivalent to the following:
db.orders.find( { ord_dt: { $gt: new Date('01/01/2012') } } ).count()
As per the documentation in the following scenario db.collection.count() can be inaccurate :
On a sharded cluster, db.collection.count() without a query predicate can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress.
After an unclean shutdown of a mongod using the Wired Tiger storage engine, count statistics reported by count() may be inaccurate.
I believe if you are using some kind of pagination like:
find(query).limit().skip().count()
You will not get the same result as
count(query)
So in cases like this, if you want to get the total, I think you might have to use both.

Incorrect Count returned by MongoDB (WiredTiger)

This sounds odd, and I hope I am doing something wrong, but my MongoDB collection is returning the Count off by one in my collection.
I have a collection with (I am sure) 359671 documents. However the count() command returns 359670 documents.
I am executing the count() command using the mongo shell:
rs0:PRIMARY> db.COLLECTION.count()
359670
This is incorrect.
It is not finding each and every document in my collection.
If I provide the following query to count, I get the correct result:
rs0:PRIMARY> db.COLLECTION.count({_id: {$exists: true}})
359671
I believe this is a bug in WiredTiger. As far as I am aware each document has the same definition, an _id field of an integer ranging from 0 to 359670, and a BinData field. I did not have this problem with the older storage engine (or Mongo 2, either could have caused the issue).
Is this something I have done wrong? I do not want to use the {_id: {$exists: true}} query as that takes 100x longer to complete.
According to this issue, this behaviour can occur if mongodb experiences a hard crash and is not shut down gracefully. If not issuing any query, mongodb probably just falls back to the collected statistics.
According to the article, calling db.COLLECTION.validate(true) should reset the counters.
As now stated in the doc, db.collection.count() without using a query parameter, returns results based on the collection’s metadata:
This may result in an approximate count. In particular:
On a sharded cluster, the resulting count will not correctly filter out orphaned documents.
After an unclean shutdown, the count may be incorrect.
When using a query parameter, as you did in the second query ({_id: {$exists: true}}), then it forces count to not use the collection's metadata, but to scan the collection instead.
Starting Mongo 4.0.3, count() is considered deprecated and the following alternatives are recommended instead:
Exact count of douments:
db.collection.countDocuments({})
which under the hood actually performs the following "expensive", but accurate aggregation (expensive since the whole collection is scanned to count records):
db.collection.aggregate([{ $group: { _id: null, n: { $sum: 1 } } }])
Approximate count of documents:
db.collection.estimatedDocumentCount()
which performs exactly what db.collection.count() does/did (it's actually a wrapper around count), which uses the collection’s metadata.
This is thus almost instantaneous, but may lead to an approximate result in the particular cases mentioned above.

Mongos count items not real?

I have a strange behavior of count() function in a mongos instance.
More than one hour ago I updated about 8.000 items in posts collection because I needed to convert tags objects to Array.
Now, when I query mongos with:
mongos> db.posts.find({blog: 'blog1', tags: {$type: 3}}).count()
4139
mongos> db.posts.findOne({blog: 'blog1', tags: {$type: 3}})
null
Why count() shows 4139 items and findOne returns a null value, even if RS are synchronized ?
EDIT:
There are 4 RS (all synchronized).
I also did the same count query on all PRIMARIES and the result is always 0.
Only if I count on mongos the result is 4139!
count() takes corresponding value from metadata field count and on a sharded environment can show wrong value (there is a bug). It may count chunks which are currently moved by the balancer. I assume that you have more than one shard.
I would not really rely on count on environment with shards and use simple M/R script instead (try to see it with M/R by the way) until above mentioned bug will be fixed (2.5?). You can also take a look at my question regarding count - db.collection.count() returns a lot more documents for sharded collection in MongoDB
If count() and limit() are acting weird, maybe your best shot is trying to repair the database. Go into the Mongo Shell and enter the following command:
db.repairDatabase()
For further explanations you can check the MongoDB docs.

Mongoid: why fetching count is slower than fetching documents

I noticed a strange behavior. It might be mongoid or mongodb, I am not sure, but Counting documents is slower than fetching the documents. Here are the queries I fired:
Institution.all.any_of(:portaled_at.ne => nil).any_of(portaled: true).order_by(:portaled_at.desc).count
# mongodb query and timing as per mongoid logs,
# times are consistent over multiple runs
# MONGODB (236ms) db['$cmd'].find({"count"=>"institutions", "query"=>{"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}, "fields"=>nil}).limit(-1)
# MONGODB (245ms) db['$cmd'].find({"count"=>"institutions", "query"=>{"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}, "fields"=>nil}).limit(-1)
Institution.all.any_of(:portaled_at.ne => nil).any_of(portaled: true).order_by(:portaled_at.desc).to_a
# mongodb query and timing as per mongoid logs
# times are not so consistent over multiple runs,
# but consistently much lower than count query
# MONGODB (9ms) db['institutions'].find({"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}).sort([[:portaled_at, :desc]])
# MONGODB (18ms) db['institutions'].find({"$or"=>[{:portaled_at=>{"$ne"=>nil}}, {:portaled=>true}]}).sort([[:portaled_at, :desc]])
I believe indexes are not used by mongodb for $and and $or queries, but just so if it matters, I have a sparse index on portaled_at in descending order. Out of around 200,000 documents only around 50-60 have portaled_at set.
rails 3.2.12
mongoid 2.6.0
mongodb 2.2.3
This is against my common sense and if anybody can explain what is going on I would really appreciate it.
While the two are running through different subsystems in MongoDB (one is using runCommand and the other the standard query engine), the specific issue in this case is very likely a known issue in the current version of MongoDb.
The quick summary is that counting without fetching is extremely slow as MongoDb is doing a lot of extra work that often isn't necessary. It's been fixed in the development branch, so it should be in 2.4 when it is released.
For some reason Mongo defaults to not counting records using only indexes. However, if you construct a query correctly, Mongo will count from the index. The trick is to only fetch the fields that are in the index, and you have to specify a query.
In Mongo Shell:
db.MyCollection.find({"_id":{$ne:''}},{"_id":1}).count()
You can check with the explain method:
db.MyCollection.find({"_id":{$ne:''}},{"_id":1}).explain()
Which will include "indexOnly" : true in the output.
And similarly the command can be executed via the Moped driver directly like so:
Mongoid::Sessions.default.command(:count => "MyCollection", :query=>{"_id"=>{"$ne"=>""}}, :fields => {:_id=>1})
Which, in my benchmarks (on my live data, YMMV) is about 100x faster than simply doing MyMongoidDocumentClass.count
Unfortunately, there doesn't seem to be a way to do this quickly through the Mongoid gem.

general questions about using mongodb

I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!
Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References
Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.