MongoDB partial index doesn't work with $elemMatch

MongoDB partial index doesn't work with $elemMatch - mongodb

I have following document:
user{
_id: objId(..),
name: "John Doe",
transactions:[
{
_id: 1,
amount: 10.00
item_id: 123,
condition: SUCCESS
},
{
_id: 2,
amount: 5.00
item_id: 124,
condition: FAILED
}
..
]
..
}
I tried placing a partial index for failed transactions using:
db.user.createIndex(
{ "transactions.condition": 1 },
{ partialFilterExpression: {"transactions.condition": "FAILED"} }
)
But whenever I do a query or $match through aggregate pipeline with followiing:
{$match: {"transactions": {$elemMatch: {"condition": "FAILED"}}}}
I always get a full document scan COLLSCAN with explain(). I am guessing the filter needs to strictly to follow expression of transactions.condition: "FAILED", but I thought {"transactions": {$elemMatch: {"condition": "FAILED"}}} was identicial to transactions.condition: "FAILED" if you just do one expression. What am I missing here?

Yes, try this instead:
db.user.explain().aggregate({$match: {"transactions.condition": "FAILED"}})
The reason is that MongoDB doesn't build the index for you. It has to be a key.
You'll get a even faster query if you can reduce it to:
db.user.explain().aggregate([
{$match: {"transactions.condition": "FAILED"}},
{$project:{_id:0, transactions.condition:1}}
])
but that may not be the case here.

Related

Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":
db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})
This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:
db.locations.count()
1513671
So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.
MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.
MongoDB log show the following:
2020-05-30T12:21:18.598+0200 I COMMAND [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms

Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:
{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.
You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.
You want to force mongo to come up with a better query plan than that that uses the
{internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.
If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.
Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).
If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.

So finally I've been able to do all the testing, here is all version I wrote, thanks to willis answer and the result:
Original aggregate query
mongo_query = [
{"$match": group_filter},
{"$sort": {"timestamp": -1}},
{"$group": {"_id": "$internal_id", "doc": {"$first": "$$ROOT"}}},
]
res = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query)
res = await res.to_list(None)
9.61 seconds
Give MongoDB a hint to use proper index (filter internal_id first)
from bson.son import SON
cursor = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query, hint=SON([("internal_id", 1), ("timestamp", -1)]))
res = await cursor.to_list(None)
Not working, MongoDB replies with an exception, saying sorting consume too much memory
Split aggregation, to first find latest timestamp for each internal_id
cursor = mongo.db[self.factory.config.mongo_collection].aggregate([{"$group": {"_id": "$internal_id", "timestamp": {"$max": "$timestamp"}}}])
res = await cursor.to_list(None)
or_query = []
for entry in res:
or_query.append({"internal_id": entry["_id"], "timestamp": entry["timestamp"]})
cursor = mongo.db[self.factory.config.mongo_collection].find({"$or": or_query})
fixed_res = await cursor.to_list(None)
1.88 seconds, a lot better but still not that fast
Parallel coroutines (and the winner is....)
In the meanwhile, as I already have the list of internal_id, and I'm using asynchronous Python, I went for parallel coroutine, getting latest entry for a single internal_id at once:
fixed_res: List[Dict] = []
async def get_one_result(db_filter: Dict) -> None:
""" Coroutine getting one result for each known internal ID """
cursor = mongo.db[self.factory.config.mongo_collection].find(db_filter).sort("timestamp", -1).limit(1)
res = await cursor.to_list(1)
if res:
fixed_res.append(res[0])
coros: List[Awaitable] = []
for internal_id in self.list_of_internal_ids:
coro = get_one_result({"internal_id": internal_id})
coros.append(coro)
await asyncio.gather(*coros)
0.5s, way better than others
If you don't have a list of internal_id
There's an alternative I did not implement but I confirmed the call is very fast: use lowlevel distinct command against {internal_id: 1} index to retrieve list of individual IDs, then use parallel calls.

Mongo using the wrong index during aggregation with match+sort operations

I'm using MongoDB version 4.2.0. I have a collection with the following indexes:
{uuid: 1},
{unique: true, name: "uuid_idx"}
and
{field1: 1, field2: 1, _id: 1},
{unique: true, name: "compound_idx"}
When executing this query
aggregate([
{"$match": {"uuid": <uuid_value>}}
])
the planner correctly selects uuid_idx.
When adding this sort clause
aggregate([
{"$match": {"uuid": <uuid_value>}},
{"$sort": {"field1": 1, "field2": 1, "_id": 1}}
])
the planner selects compound_idx, which makes the query slower.
I would expect the sort clause to not make a difference in this context. Why does Mongo not use the uuid_idx index in both cases?
EDIT:
A little clarification, I understand there are workarounds to use the correct index, but I'm looking for an explanation of why this does not happen automatically (if possible with links to the official documentation). Thanks!

Why is this happening?:
Lets understand how Mongo chooses which index to use as explained here.
If a query can be satisfied by multiple indexes (satisfied is used losely as Mongo actually chooses all possibly relevant indexes) defined in the collection.
MongoDB will then test all the applicable indexes in parallel. The first index that can returns 101 results will be selected by the query planner.
Meaning that for that certain query that index actually wins.
What can we do?:
We can use $hint, hint basically forces Mongo to use a specific index, however Mongo this is not recommended because if changes occur Mongo will not adapt to those.

The query:
aggregate(
[
{ $match : { uuid : "some_value" } },
{ $sort : { fld1: 1, fld2: 1, _id: 1 } }
],
)
doesn't use the index "uuid_idx".
There are couple of options you can work with for using indexes on both the match and sort operations:
(1) Define a new compound index: { uuid: 1, fld1: 1, fld2: 1, _id: 1 }
Both the match and match+sort queries will use this index (for both the match and sort operations).
(2) Use the hint on the uuid index (using existing indexes)
Both the match and match+sort queries will use this index (for both the match and sort operations).
aggregate(
[
{ $match : { uuid : "some_value" } },
{ $sort : { fld1: 1, fld2: 1, _id: 1 } }
],
{ hint: "uuid_idx"}
)

If you can use find instead of aggregate, it will use the right index. So this is still problem in aggregate pipeline.

Mongo query: array of objects where a key's value is repeated

I am new to Mongo. Posting this question because i am not sure how to search this on google
i have a book documents like below
{
bookId: 1
title: 'some title',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 2
}]
}
and another one like this
{
bookId: 2
title: 'some title 2',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 1
}]
}
I want to write a query db.books.find({}) which would return only those books where editions.editionId has been duplicated for a book.
So in this example, for bookId: 2 there are two editions with the editionId:1.
Any suggestions?

You can use the aggregation framework; specifically, you can use the $group operator to group the records together by book and edition id, and count how many times they occur : if the count is greater than 1, then you've found a duplication.
Here is an example:
db.books.aggregate([
{$unwind: "$editions"},
{$group: {"_id": {"_id": "$_id", "editionId": "$editions.editionId"}, "count": {$sum: 1}}},
{$match: {"count" : {"$gt": 1}}}
])
Note that this does not return the entire book records, but it does return their identifiers; you can then use these in a subsequent query to fetch the entire records, or do some de-duplication for example.

MongoDB, right projection subfield [duplicate]

Is it possible to rename the name of fields returned in a find query? I would like to use something like $rename, however I wouldn't like to change the documents I'm accessing. I want just to retrieve them differently, something that works like SELECT COORINATES AS COORDS in SQL.
What I do now:
db.tweets.findOne({}, {'level1.level2.coordinates': 1, _id:0})
{'level1': {'level2': {'coordinates': [10, 20]}}}
What I would like to be returned is:
{'coords': [10, 20]}

So basically using .aggregate() instead of .find():
db.tweets.aggregate([
{ "$project": {
"_id": 0,
"coords": "$level1.level2.coordinates"
}}
])
And that gives you the result that you want.
MongoDB 2.6 and above versions return a "cursor" just like find does.
See $project and other aggregation framework operators for more details.
For most cases you should simply rename the fields as returned from .find() when processing the cursor. For JavaScript as an example, you can use .map() to do this.
From the shell:
db.tweets.find({},{'level1.level2.coordinates': 1, _id:0}).map( doc => {
doc.coords = doc['level1']['level2'].coordinates;
delete doc['level1'];
return doc;
})
Or more inline:
db.tweets.find({},{'level1.level2.coordinates': 1, _id:0}).map( doc =>
({ coords: doc['level1']['level2'].coordinates })
)
This avoids any additional overhead on the server and should be used in such cases where the additional processing overhead would outweigh the gain of actual reduction in size of the data retrieved. In this case ( and most ) it would be minimal and therefore better to re-process the cursor result to restructure.

As mentioned by #Neil Lunn this can be achieved with an aggregation pipeline:
And starting Mongo 4.2, the $replaceWith aggregation operator can be used to replace a document by a sub-document:
// { level1: { level2: { coordinates: [10, 20] }, b: 4 }, a: 3 }
db.collection.aggregate(
{ $replaceWith: { coords: "$level1.level2.coordinates" } }
)
// { "coords" : [ 10, 20 ] }
Since you mention findOne, you can also limit the number of resulting documents to 1 as such:
db.collection.aggregate([
{ $replaceWith: { coords: "$level1.level2.coordinates" } },
{ $limit: 1 }
])
Prior to Mongo 4.2 and starting Mongo 3.4, $replaceRoot can be used in place of $replaceWith:
db.collection.aggregate(
{ $replaceRoot: { newRoot: { coords: "$level1.level2.coordinates" } } }
)

As we know, in general, $project stage takes the field names and specifies 1 or 0/true or false to include the fields in the output or not, we also can specify the value against a field instead of true or false to rename the field. Below is the syntax
db.test_collection.aggregate([
{$group: {
_id: '$field_to_group',
totalCount: {$sum: 1}
}},
{$project: {
_id: false,
renamed_field: '$_id', // here assigning a value instead of 0 or 1 / true or false effectively renames the field.
totalCount: true
}}
])

Stages (>= 4.2)
$addFields : {"New": "$Old"}
$unset : {"$Old": 1}

MongoDB query to find property of first element of array

I have the following data in MongoDB (simplified for what is necessary to my question).
{
_id: 0,
actions: [
{
type: "insert",
data: "abc, quite possibly very very large"
}
]
}
{
_id: 1,
actions: [
{
type: "update",
data: "def"
},{
type: "delete",
data: "ghi"
}
]
}
What I would like is to find the first action type for each document, e.g.
{_id:0, first_action_type:"insert"}
{_id:1, first_action_type:"update"}
(It's fine if the data structured differently, but I need those values present, somehow.)
EDIT: I've tried db.collection.find({}, {'actions.action_type':1}), but obviously that returns all elements of the actions array.
NoSQL is quite new to me. Before, I would have stored all this in two tables in a relational database and done something like SELECT id, (SELECT type FROM action WHERE document_id = d.id ORDER BY seq LIMIT 1) action_type FROM document d.

You can use $slice operator in projection. (but for what you do i am not sure that the order of the array remain the same when you update it. Just to keep in mind))
db.collection.find({},{'actions':{$slice:1},'actions.type':1})

You can also use the Aggregation Pipeline introduced in version 2.2:
db.collection.aggregate([
{ $unwind: '$actions' },
{ $group: { _id: "$_id", first_action_type: { $first: "$actions.type" } } }
])

Using the $arrayElemAt operator is actually the most elegant way, although the syntax may be unintuitive:
db.collection.aggregate([
{ $project: {first_action_type: {$arrayElemAt: ["$actions.type", 0]}
])

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoDB partial index doesn't work with $elemMatch - mongodb

Related

Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Mongo using the wrong index during aggregation with match+sort operations

Mongo query: array of objects where a key's value is repeated

MongoDB, right projection subfield [duplicate]

MongoDB query to find property of first element of array

Categories

Resources