creating covered index for aggregation framework - mongodb

I have a problem with creating index for my query and can't find any similar solution on the web, so maybe some of you will help me.
To simplify problem let's say we have Phones with some attributes,
{
"type":"Samsung",
"model":"S3",
"attributes":[{
"value":"100",
"name":"BatteryLife"
},{
"value":"200$",
"name":"Price"
}
}
With index: {"type":1, "attributes.value":1}
We have millions of phones for every type and i want to find phones for given type that have given attributes, my query looks like:
db.Phone.aggregate([
{ "$match" : { "type" : "Samsung"}} ,
{ "$match" : { "attributes" : { "$all" : [
{ "value" : "100", "name" : "BatteryLife" } ,
{ "value" : "200$", "name" : "Price"}
]}
}
}
])
And it works !
The problem is that this query is highly inefficient, beacuse it use only first part of my index, that is "type"(and i have millions of phones of every type), and doesn't use 'attributes.value' part (type + attributes.value is almost unique, so it would reduce complexity significantly).
#Edit
Thanks to Neil Lunn i know it's because index is used only in my first match, so i have to change my query.
#Edit2
I think i found solution:
db.Phone.aggregate([
{$match: {
$and: [
{type: "Samsung"},
{attributes: {
$all: [
{ "value":"100", "type" : "BatteryLife" },
{ "value":"200$", "type" : "Price" }
]
}}
]}
}])
+db.Phone.ensureIndex({type:1, attributes:1}), seems to work. I think we can close now. Thanks for tip about $match.

To get the most out of the index you need to have a $match early enough in the pipeline that uses all the fields in the index. And avoid using $and operator since it's unnecessary and in the current (2.4) version can cause an index not to be fully utilized (luckily fixed for the upcoming 2.6).
However, the query is not quite correct as you need to use $elemMatch to make sure the same element is used to satisfy the name and value fields.
Your query should be:
db.Phone.aggregate([
{$match: { type: "Samsung",
attributes: { $all: [
{$elemMatch: {"value":"100", "type" : "BatteryLife" }},
{$elemMatch: {"value":"200$", "type" : "Price" }}
] }
}
}]);
Now, it's not going to be a covered query, since the attributes.value and name are embedded, not to mention the fact that name is not in the index.
You need the index to be {"type":1, "attributes.value":1, "attributes.name":1} for best performance, though it still won't be covered, it'll be much more selective than now.

Related

find() return the latest value only on MongoDB

I have this collection in MongoDB that contains the following entries. I'm using Robo3T to run the query.
{
"_id" : ObjectId("xxx1"),
"Evaluation Date" : "2021-09-09",
"Results" : [
{
"Name" : "ABCD",
"Version" : "3.2.x"
}
]
"_id" : ObjectId("xxx2"),
"Evaluation Date" : "2022-09-09",
"Results" : [
{
"Name" : "ABxD",
"Version" : "5.2.x"
}
]
}
This document contains multiple entries of similar format. Now, I need to extract the latest value for "Version".
Expected output:
5.2.x
Measures I've taken so far:
(1) I've only tried findOne() and while I was able to extract the value of "Version": db.getCollection('TestCollectionName').findOne().Results[0].Version
...only the oldest entry was returned.
3.2.x
(2) Using the find().sort().limit() like below, returns the entire document for the latest entry and not just the data value that I wanted; db.getCollection('TestCollectionName').find({}).sort({"Results.Version":-1}).limit(1)
Results below:
"_id" : ObjectId("xxx2"),
"Evaluation Date" : "2022-09-09",
"Results" : [
{
"Name" : "ABxD",
"Version" : "5.2.x"
}
]
(3) I've tried to use sort() and limit() alongside findOne() but I've read that findOne is maybe deprecated and also not compatible with sort. And thus, resulting to an error.
(4) Finally, if I try to use sort and limit on find like this: db.getCollection('LD_exit_Evaluation_Result_MFC525').find({"Results.New"}).sort({_id:-1}).limit(1) I would get an unexpected token error.
What would be a good measure for this?
Did I simply mistake to/remove a bracket or need to reorder the syntax?
Thanks in advance.
I'm not sure if I understood well, but maybe this could be what are you looking for:
db.collection.aggregate([
{
"$project": {
lastResult: {
"$last": "$Results"
},
},
},
{
"$project": {
version: "$lastResult.Version",
_id: 0
}
}
])
It uses aggregate with some operators: the first $project calculate a new field called lastResult with the last element of each array using $last operator. The second $project is just to clean the output. If you need the _id reference, just remove _id: 0 or change its value to 1.
You can check how it works here: https://mongoplayground.net/p/jwqulFtCh6b
Hope I helped

Mongo index for query

I have a collection with millions of records. I am trying to implement an autocomplete on a field called term that I broke down into an array of words called words. My query is very slow because I am missing something with regards to the index. Can someone please help?
I have the following query:
db.vx.find({
semantic: "product",
concept: true,
active: true,
$and: [ { words: { $regex: "^doxycycl.*" } } ]
}).sort({ length: 1 }).limit(100).explain()
The explain output says that no index was used even though I have the following index:
{
"v" : 1,
"key" : {
"words" : 1,
"active" : 1,
"concept" : 1,
"semantic" : 1
},
"name" : "words_1_active_1_concept_1_semantic_1",
"ns" : "mydatabase.vx"
}
You can check if the compound index is exploited correctly using the mongo shell
db.vx.find({YOURQUERY}).explain('executionStats')
and check the field winningPlan.stage:
COLLSCAN means the indexes are partially used or not used at all.
IXSCAN means the indexes are used correctly in this query.
You can also check if the text search fits your needs since is way more fast than $regex operator.
https://comsysto.com/blog-post/mongodb-full-text-search-vs-regular-expressions

MongoDB querying to with changing values for key

Im trying to get back into Mongodb and Ive come across something that I cant figure out.
I have this data structure
> db.ratings.find().pretty()
{
"_id" : ObjectId("55881e43424cbb1817137b33"),
"e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
"type" : "like",
"time" : 1434984003156,
"u_id" : ObjectId("55817c072e48b4b60cf366a7")
}
{
"_id" : ObjectId("55893be1e6a796c0198e65d3"),
"e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
"type" : "dislike",
"time" : 1435057121808,
"u_id" : ObjectId("55817c072e48b4b60cf366a7")
}
{
"_id" : ObjectId("55893c21e6a796c0198e65d4"),
"e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
"type" : "null",
"time" : 1435057185089,
"u_id" : ObjectId("55817c072e48b4b60cf366a7")
}
What I want to be able to do is count the documents that have either a like or dislike leaving the "null" out of the count. So I should have a count of 2. I tried to go about it like this whereby I set the query to both fields:
db.ratings.find({e_id: ObjectId("5565e106cd7a763b2732ad7c")}, {type: "like", type: "dislike"})
But this just prints out all three documents. Is there any reason?
If its glaringly obvious im sorry pulling out my hair at the moment.
Use the following db.collection.count() method which returns the count of documents that would match a find() query:
db.ratings.count({
"e_id": ObjectId("5565e106cd7a763b2732ad7c"),
type: {
"$in": ["like", "dislike"]
}
})
The db.collection.count() method is equivalent to the db.collection.find(query).count() construct. Your query selection criteria above can be interpreted as:
Get me the count of all documents which have the e_id field values as ObjectId("5565e106cd7a763b2732ad7c") AND the type field which has either value "like" or "dislike", as depicted by the $in operator that selects the documents where the value of a field equals any value in the specified array.
db.ratings.find({e_id: ObjectId("5565e106cd7a763b2732ad7c")},
{type: "like", type: "dislike"})
But this just prints out all three
documents. Is there any reason? If its glaringly obvious im sorry
pulling out my hair at the moment.
The second argument here is the projection used by the find method . It specifies fields that should be included -- regardless of their value. Normally, you specify a boolean value of 1 or true to include the field. Obviously, MongoDB accepts other values as true.
If you only need to count documents, you should issue a count command:
> db.runCommand({count: 'collection',
query: { "e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
type: { $in: ["like", "dislike"]}}
})
{ "n" : 2, "ok" : 1 }
Please note the Mongo Shell provides the count helper for that:
> db.collection.find({ "e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
type: { $in: ["like", "dislike"]}}).count()
2
That being said, to quote the documentation, using the count command "can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress." To avoid that, you might prefer using the aggregation framework:
> db.collection.aggregate([
{ $match: { "e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
type: { $in: ["like", "dislike"]}}},
{ $group: { _id: null, n: { $sum: 1 }}}
])
{ "_id" : null, "n" : 2 }
This query should solve your problem
db.ratings.find({$or : [{"type": "like"}, {"type": "dislike"}]}).count()

Mongodb : count array values with mapreduce / aggregation

I have documents with the following structure :
{
"name" : "John",
"items" : [
{"key1" : "value1"},
{"key1" : "value1"}
]
}
And have built a simple function to count the number of "items" total.
var count = 0;
db.collection.find({},{items:1}).limit(10000).forEach(
function (doc) {
if(doc.items){
count += doc.items.length;
}
}
)
print(count);
But after ~1 million items, my function breaks, Mongo exits. I've looked at the new aggregation framework as well as mapreduce functions, and I'm not sure which would be the best to use for a simple count like this.
Suggestions welcome! Thanks.
It becomes very easy when you use aggregation http://docs.mongodb.org/manual/core/aggregation-pipeline/
db.collection.aggregate(
{ $unwind : "$items" },
{ $group : {_id:null, items_count : {$sum:1} }}
)
to return count of items for each document,
{ $group : {_id:"$_id", items_count : {$sum:1} }}
You can store length of doc.items as an element of doc. This method causes disk redundancy but a fast and easy way to deal with large collections.
{
"name" : "John",
"itemsLength" : 2,
"items" : [
{"key1" : "value1"},
{"key1" : "value1"}
]
}
Another option may be using mapreduce but, I think, without sharding mapreduce would be slow.

group in aggregate framework stopped working properly

I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.
One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.
I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.