Mongodb find->insert and count has different results - mongodb

I am trying to filter a data collection by making a query and storing the results in a smaller collection. However, the number of records that were found using count() and the number in the collection are very different(count() is much higher). Am I doing something wrong?
This returns about 110 million.
db.getCollection('ex').count({
'data.points': {$exists: true},
'data.points.points': {$exists: false},
}, {
'data.id': 1,
'data.author.id': 1
})
Then I execute this.
db.getCollection('ex').find({
'data.points': {$exists: true},
'data.points.points': {$exists: false},
}, {
'data.id': 1,
'data.author.id': 1
})
.forEach(function (doc) {
db.s_uid_wid.insert(doc)
})
But this only gives about 5 million records. They should be exactly the same. What is going on?
db.getCollection('s_uid_wid').count({})
Edit
Previously I was running this in robomongo gui and it gave the impression that everything was well. Now I tried this in mongo shell and I got this
2016-02-04T00:39:21.735+0800 Error: getMore: cursor didn't exist on server, possible restart or timeout? at src/mongo/shell/query.js:116

The following fixes the problem. It takes about one day to complete the insertion.
db.getCollection('ex').find({
'data.points': {$exists: true},
'data.points.points': {$exists: false},
}, {
'data.id': 1,
'data.author.id': 1
}).addOption(DBQuery.Option.noTimeout)
.forEach(function (doc) {
db.s_uid_wid.insert(doc)
})

Related

Difference between the UpdateOne() and findOneAndUpdate Methods in Mongo DB

What is the difference between the UpdateOne() and the findOneAndUpdate() methods in Mongo DB?
I can't seem o understand their differences. Would appreciate it if a demonstrative example using UpdateOne() and findOneAndUpdate could be used.
Insert a document in an otherwise empty collection using the mongo-shell to start:
db.users.insertOne({name: "Jack", age: 11})
UpdateOne
db.users.updateOne({name: "Jack"}, {$set: {name: "Joe"}})
This operation returns an UpdateResult.
{ acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0 }
FindOneAndUpdate
db.users.findOneAndUpdate({name: "Joe"}, {$set: {name: "Jill"}})
This operation returns the document that was updated.
{ _id: ObjectId("62ecf94510fc668e92f3cecf"),
name: 'Joe',
age: 11 }
FindOneAndUpdate is preferred when you have to update a document and fetch it at the same time.
If you need to return the New Document instead of the original document, you can use one of these ways:
db.users.findOneAndUpdate(
{name: "Joe"},
{$set: {name: "Jill"}},
{returnDocument: "after"}
)
returnDocument: "before" --> returns the original document (default).
returnDocument: "after" --> returns the updated document.
Or
db.users.findOneAndUpdate(
{name: "Joe"},
{$set: {name: "Jill"}},
{returnNewDocument: true}
)
returnNewDocument: false --> returns the original document (default).
returnNewDocument: true --> returns the updated document.
Note: If both options are set (returnDocument and returnNewDocument), returnDocument takes precedence.

Can you have a collection that's randomly distributed in mongodb?

I have a collection that's essentially just a collection of unique IDs, and I want to store them randomly distributed so I can quickly just findOne instead of sampling them since it's quicker than aggregation.
I ran the following aggregation to sort it randomly:
db.my_coll.aggregate([{"$sample": {"size": 1200000}}, {"$out": {db: "db", coll: "my_coll"}}], {allowDiskUse: true})
it seems to work?
db.my_coll.find():
{ _id: 581848, schema_version: 1 },
{ _id: 1184557, schema_version: 1 },
{ _id: 213688, schema_version: 1 },
....
Is this allowed? I thought _id is a default index, and it should always be sorted by the index. I'm only ever removing elements from this collection, so it's fine if they don't get inserted randomly, but I don't know if this is just a hack that might at some point behave differently.

Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":
db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})
This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:
db.locations.count()
1513671
So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.
MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.
MongoDB log show the following:
2020-05-30T12:21:18.598+0200 I COMMAND [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms
Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:
{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.
You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.
You want to force mongo to come up with a better query plan than that that uses the
{internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.
If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.
Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).
If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.
So finally I've been able to do all the testing, here is all version I wrote, thanks to willis answer and the result:
Original aggregate query
mongo_query = [
{"$match": group_filter},
{"$sort": {"timestamp": -1}},
{"$group": {"_id": "$internal_id", "doc": {"$first": "$$ROOT"}}},
]
res = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query)
res = await res.to_list(None)
9.61 seconds
Give MongoDB a hint to use proper index (filter internal_id first)
from bson.son import SON
cursor = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query, hint=SON([("internal_id", 1), ("timestamp", -1)]))
res = await cursor.to_list(None)
Not working, MongoDB replies with an exception, saying sorting consume too much memory
Split aggregation, to first find latest timestamp for each internal_id
cursor = mongo.db[self.factory.config.mongo_collection].aggregate([{"$group": {"_id": "$internal_id", "timestamp": {"$max": "$timestamp"}}}])
res = await cursor.to_list(None)
or_query = []
for entry in res:
or_query.append({"internal_id": entry["_id"], "timestamp": entry["timestamp"]})
cursor = mongo.db[self.factory.config.mongo_collection].find({"$or": or_query})
fixed_res = await cursor.to_list(None)
1.88 seconds, a lot better but still not that fast
Parallel coroutines (and the winner is....)
In the meanwhile, as I already have the list of internal_id, and I'm using asynchronous Python, I went for parallel coroutine, getting latest entry for a single internal_id at once:
fixed_res: List[Dict] = []
async def get_one_result(db_filter: Dict) -> None:
""" Coroutine getting one result for each known internal ID """
cursor = mongo.db[self.factory.config.mongo_collection].find(db_filter).sort("timestamp", -1).limit(1)
res = await cursor.to_list(1)
if res:
fixed_res.append(res[0])
coros: List[Awaitable] = []
for internal_id in self.list_of_internal_ids:
coro = get_one_result({"internal_id": internal_id})
coros.append(coro)
await asyncio.gather(*coros)
0.5s, way better than others
If you don't have a list of internal_id
There's an alternative I did not implement but I confirmed the call is very fast: use lowlevel distinct command against {internal_id: 1} index to retrieve list of individual IDs, then use parallel calls.

Auto calculating fields in mongodb

Let's say there are documents in MongoDB, that look something like this:
{
"lastDate" : ISODate("2013-14-01T16:38:16.163Z"),
"items":[
{"date":ISODate("2013-10-01T16:38:16.163Z")},
{"date":ISODate("2013-11-01T16:38:16.163Z")},
{"date":ISODate("2013-12-01T16:38:16.163Z")},
{"date":ISODate("2013-13-01T16:38:16.163Z")},
{"date":ISODate("2013-14-01T16:38:16.163Z")}
]
}
Or even like this:
{
"allAre" : false,
"items":[
{"is":true},
{"is":true},
{"is":true},
{"is":false},
{"is":true}
]
}
The top level fields "lastDate" and "allAre" should be recalculated every time the data in array changes. "lastDate" should be the biggest "date" of all. "allAre" should be equal to true only if all the items have "is" as true.
How should I build my queries to achieve such a behavior with MongoDB?
Is it considered to be a good practice to precalculate some values on insert, instead of calculating them during the get request?
MongoDB cannot make what you are asking for with 1 query.
But you can make it in two-step query.
First of all, push the new value into the array:
db.Test3.findOneAndUpdate(
{_id: ObjectId("58047d0cd63cf401292fe0ad")},
{$push: {"items": {"date": ISODate("2013-01-27T16:38:16.163+0000")}}},
{returnNewDocument: true},
function (err, result) {
}
);
then update "lastDate" only if is less then the last Pushed.
db.Test3.findOneAndUpdate (
{_id: ObjectId("58047d0cd63cf401292fe0ad"), "lastDate":{$lt: ISODate("2013-01-25T16:38:16.163+0000")}},
{$set: {"lastDate": ISODate("2013-01-25T16:38:16.163+0000")}},
{returnNewDocument: true},
function (err, result) {
}
);
the second parameter "lastDate" is needed in order to avoid race condition. In this way you can be sure that inside "lastDate" there are for sure the "highest date pushed".
Related to the second problem you are asking for you can follow a similar strategy. Update {"allAre": false} only if {"_id":yourID, "items.is":false)}. Basically set "false" only if some child of has a value 'false'. If you don't found a document with this property then do not update nothing.
// add a new Child to false
db.Test4.findOneAndUpdate(
{_id: ObjectId("5804813ed63cf401292fe0b0")},
{$push: {"items": {"is": false}}},
{returnNewDocument: true},
function (err, result) {
}
);
// update allAre to false if some child is false
db.Test4.findOneAndUpdate (
{_id: ObjectId("5804813ed63cf401292fe0b0"), "items.is": false},
{$set: {"allAre": false}},
{returnNewDocument: true},
function (err, result) {
}
);

Mongo efficient querying log collection by level range

I have a capped collection for storing server logs:
var schema = new mongoose.Schema({
level: { type: Number, required: true },
...
}, { capped: 64 * 1024 * 1024, versionKey: false });
I'm having trouble figuring out how to query logs by level range efficiently. Here's a sample query I want to run:
db.getCollection('logs').find({
level: { $gte: 2, $lte: 6 }
}).sort({ _id: -1 }).limit(500)
Indexing on { _id: 1, level: 1 } doesn't make any sense, as _id is unique and there will be only a single level for each of them, so in worst case whole collection will be checked.
If I index on { level: 1, _id: -1 }, in worst case Mongo pulls all logs for levels 2, 3, 4, 5, 6 joins them and sorts them manually, so performance is horrible. Sometimes it also decides to use { _id: 1 } index, which is terrible too.
It could just walk through these 6 indexes at once and get the result while checking at most 504 documents. Or it could pull only first 500 results from each level, so it would sort at most 2500 documents. But it won't, Mongo is just plain stupid when it comes to range queries.
The fastest solution I can think of is implementing the last mentioned method on the client, so running 5 queries and then merging them manually:
db.getCollection('logs').find({ level: 2 }).sort({ _id: -1 }).limit(500)
db.getCollection('logs').find({ level: 2 }).sort({ _id: -1 }).limit(500)
db.getCollection('logs').find({ level: 3 }).sort({ _id: -1 }).limit(500)
...
Merging can be done in O(n) on the client, there are only 7 log levels so at most 7 queries will be executed and 3500 documents pulled from the database.
Is there a better way?
Since you have only 7 levels, it may worth to consider { level: 1, _id: -1 } index with $or query:
db.logs.find({$or:[
{level: 2},
{level: 3},
{level: 4},
{level: 5},
{level: 6}
]}).sort({_id:-1}).limit(500)
Since it is equality condition, it should make use of the index, but I never tried it on capped collections.
I would give it a try and run explain() to confirm it works, then probably enabled profiler and run few other queries.