mongodb fast tags query - mongodb

I have a very large collection ( more than 800k ) and I need to implement a query for auto-complete ( based on word beginnings only ) functionality based on tags. my documents look like this:
{
"_id": "theid",
"somefield": "some value",
"tags": [
{
"name": "abc tag1",
"vote": 5
},
{
"name": "hij tag2",
"vote": 22
},
{
"name": "abc tag3",
"vote": 5
},
{
"name": "hij tag4",
"vote": 77
}
]
}
if for example my query would be for all tags that start with "ab" and has a "somefield" that is "some value" the result would be "abc tag1","abc tag3" ( only names ).
I care about the speed of the queries much more than the speed of the inserts and updates.
I assume that the aggregation framework would be the right way to go here, but what would be the best pipeline and indexes for very fast querying ?
the documents are not 'tag' documents they are documents representing a client object, they contain much more data fields that I left out for simplicity, each client has several tags and another field ( I changed its name so it wont be confused with the tags array ). I need to get a set without duplicates of all tags that a group of clients have.

Your document structure doesn't make sense - I'm assuming tags is an array and not an object. Try queries like this
db.tags.find({ "somefield" : "some value", "tags.name" : /^abc/ })
with an index on { "maintag" : 1, "tags.name" : 1 }. MongoDB optimizes left-anchored regex queries into range queries, which can be fulfilled efficiently using an index (see the $regex docs).
You can get just the tags from this document structure using an aggregation pipeline:
db.tags.aggregate([
{ "$match" : { "somefield" : "some value", "tags.name" : /^abc/ } },
{ "$unwind" : "$tags" },
{ "$match" : { "tags.name" : /^abc/ } },
{ "$project" : { "_id" : 0, "tag_name" : "$tags.name" } }
])
Index only helps for first $match, so same indexes for the pipeline as for the query.

Related

How to extract grouped results from array in $group stage and return as separate fields?

I'm running an aggregation query, and the $group stage is as follows
$group:
{
_id:
{
year_month: { $dateToString: { "date": "$updated_at", "format": "%Y-%m" } }
,client_name: "$clients_docs.client_name"
,client_label: "$clients_docs.client_label"
,client_code: "$clients_docs.client_code"
,client_country: "$clients_docs.client_country"
,base_curr: "$clients_docs.client_base_currency"
,inv_curr: "$clients_docs.client_invoice_currency"
,dest_curr: "$store.destination_currency"
}
,total_vol: { $sum: "$USD_Value" }
,total_tran: { $sum: 1 }
}
It returns the correct results, and returns all the grouped results in the _id:{} array.
I now want to extract all those fields from the array and return them not within the array so I can more easily export the output to a spreadsheet.
I tried using this stage:
{
$project:
{
year_month: 1
,client_name: 1
,client_label: 1
,client_code: 1
,client_country: 1
,base_curr: 1
,inv_curr: 1
,dest_curr: 1
,total_vol: 1
,total_tran : 1
}
},
But that returned the same results as the $group stage:
{
"_id" : {
"year_month" : "2022-01",
"client_name" : "client A",
"client_label" : "client A",
"client_code" : NumberInt(0000),
"client_country" : "TH",
"base_curr" : "USD",
"inv_curr" : "USD",
"dest_curr" : "HKD"
},
"total_vol" : 100000,
"total_tran" : 100.0
}
I want the "year_month" through "dest_curr" fields at the same level as the "total_vol" and "total_tran", so that when the data is exported they all appear as separate columns (now it's all captured as one "_id" column, and a "total_vol" and "total_tran" column). What's the best way to do this?
From a terminology perspective, you currently have an embedded document (or nested fields) rather than an array.
The straightforward way to do this is to simply enumerate each field, eg:
"year_month": "$_id.year_month",
There are fancier ways to do this, but as you only have a handful of fields this should suffice. Working playground example here.
Edit
An alternative ("fancier") approach is to leverage the $replaceWith stage using the $mergeObjects operator inside of it. Then you can $unset the previous _id field afterwards. It would look like this:
db.collection.aggregate([
{
"$replaceWith": {
"$mergeObjects": [
"$$ROOT",
"$_id"
]
}
},
{
$unset: "_id"
}
])
Playground link here
I also fixed the earlier playground link that had a typo for the client_label field.

Mongo pull multiple elements inside an array of object

I'm trying to pull one or multiple objects from an array and I noticed something odd.
Let's take the following document.
{
"_id" : UUID("f7e80c8e-6b4a-4741-95a3-2567cccf9e5f"),
"createdAt" : ISODate("2021-07-19T17:07:28.499Z"),
"description" : null,
"externalLinks" : [
{
"id" : "ZV8xMjM0NQ==",
"type" : "event"
},
{
"id" : "cF8xMjM0NQ==",
"type" : "planning"
}
],
"updatedAt" : ISODate("2021-07-19T17:07:28.499Z")
}
I wrote a basic query to pull one element of externalLinks which looks like
db.getCollection('Collection').update(
{
_id: {
$in: [UUID("f7e80c8e-6b4a-4741-95a3-2567cccf9e5f")]
}
}, {
$pull: {
externalLinks: {
"type": "planning",
"id": "cF8xMjM0NQ=="
}
}
})
And it's working fine. But it's getting trickier when I want to pull multiple element from the externalLinks. And I'm using the operator $in for that.
And the strange behaviour is here :
db.getCollection('Collection').update(
{
_id: {
$in: [UUID("f7e80c8e-6b4a-4741-95a3-2567cccf9e5f")]
}
}, {
$pull: {
externalLinks: {
$in: [{
"type": "planning",
"id": "cF8xMjM0NQ=="
}]
}
}
})
And this query doesn't work. The solution is to switch both field from externalLinks
and do something like :
$in: [{
"id": "cF8xMjM0NQ==",
"type": "planning"
}]
I tried multiple things like : $elemMatch, $positioning but it should be possible to pull multiple externalLinks.
I also tried the $and operator without success.
I could easily iterate over the externalLinks to update but it'd be too easy.
And it's tickling my brain to choose that solution.
Any help would be appreciate, thank you !
Document fields have order, and MongoDB compares documents based on the order of the fields see here, so what field you put first matters.
After MongoDB 4.2 we can also do pipeline updates, that can be sometimes bigger, but they are much more powerful, and feels more like programming.
(less declarative and pattern matching)
This doesn't mean that you need pipeline update in your case but check this way also.
Query
pipeline update
filter and keep members that the condition doesn't exist
Test code here
db.collection.update(
{_id: {$in: ["f7e80c8e-6b4a-4741-95a3-2567cccf9e5f"]}},
[{"$set":
{"externalLinks":
{"$filter":
{"input": "$externalLinks",
"cond":
{"$not":
[{"$and":
[{"$eq": ["$$this.id", "ZV8xMjM0NQ=="]},
{"$eq": ["$$this.type", "event"]}]}]}}}}}])

Counting entries of subdocument in MongoDB documents

I have a document structure like so
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : {
"r152f47f1daf-0-2" : "c",
"r152f48413c1-0-2" : "c",
"r152f4851bf7-0-1" : "c"
}
}
My task is to find all documents with the following conditions:
The "_id" needs to start with "5:"
The number of revisions need to be exclusively greater then 3
The first part is easy, I have solved it with
db.nodes.find( {'_id': /^5:/} )
But I am struggling with the second part, am supposed to use $gt.
Since I am new to MongoDB, I was first looking at $size, but _revisions is not an array, it is a subdocument, right?.
Was also looking at $unwind and then counting the results, but that does not make sense either, since my result need to be the documents that match the above two conditions.
Any pointers highly appreciated.
Using the $where operator.
db.nodes.find(function() {
return (/^5:/.test(this._id) && Object.keys(this._revisions).length > 3 );
})
The problem with this as mentioned in the documentation is that:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
You should definitely consider to change the _revisions field to an array of sub-documents like this:
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : [
{
"rev": "r152f47f1daf-0-2",
"value": "c"
},
{
"rev": "r152f48413c1-0-2",
"value": "c"
},
{
"rev": "r152f4851bf7-0-1",
"value": "c"
}
]
}
And use the $exists operator.
db.nodes.find({ "_id": /^5:/, "_revisions.3": { "$exists": true } } )

Struggling to get ordered results from the last retrieved article, given array of elements to search in

I have a collections of objects with structure like this:
{
"_id" : ObjectId("5233a700bc7b9f31580a9de0"),
"id" : "3df7ce4cc2586c37607a8266093617da",
"published_at" : ISODate("2013-09-13T23:59:59Z"),
...
"topic_id" : [
284,
9741
],
...
"date" : NumberLong("1379116800055")
}
I'm trying to use the following query:
db.collection.find({"topic_id": { $in: [ 9723, 9953, 9558, 9982, 9833, 301, ... 9356, 9990, 9497, 9724] }, "date": { $gte: 1378944001000, $lte: 1378954799000 }, "_id": { $gt: ObjectId('523104ddbc7b9f023700193c') }}).sort({ "_id": 1 }).limit(1000)
The above query uses topic_id, date index but then it does not keep the order of returned results.
Forcing it to use hint({_id:1}) makes the results ordered, but the nscanned is 1 million documents even though limit(1000) is specified.
What am I missing?

Nested documents and _id indexes in mongodb

I have a collection with nested documents in it. Each document also has an _id field.
Here's an example of a documents structure
{
"_id": ObjectId("top_level_doc"),
"title": "Cadernos",
"parent": "4fd55bbc5d1709793b000008",
"criterias": {
"0": {
"_id": ObjectId("a_nested_doc"),
"value": "caderno",
"operator": "contains",
"field": "design0"
}
}
}
I want to be able to find the nested document just by searching it's _id
With this query
{
"criterias._id" : ObjectId("a_nested_doc")
}
It returns the parent document (i just want the one that's nested).
Ideally I would do this
{
"_id" : ObjectId("a_nested_doc")
}
And it would return the document with that id (either its nested or not).
Ps. I edited the "_id" values for the sake of simplicity just for this example.
You may have to live with selecting criterias._id (without writing a wrapper around the query, at least), but you can select the document itself by simply retrieving a subset of the fields.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields
// The simplest case converted to your use case
db.collection.find( { criterias._id : ObjectId("a_nested_doc") }, { criterias : 1 } );