I've written some $redact operation to filter my documents:
db.test.aggregate([
{ $redact: {
$cond: {
if: { "$ifNull" : ["$_acl.READ", false] },
then: { $cond: {
if: { $anyElementTrue: {
$map: {
input: "$_acl.READ",
as: "myfield",
in: { $setIsSubset: [ "$$myfield", ["user1“] ] }
}
}},
then: "$$DESCEND",
else: "$$PRUNE"
}},
else: "$$DESCEND",
}
}}
])
This will remove all (sub)documents, where _acl.READ doesn't contain user1. But it will keep all (sub)documents where _acl.READ is not set.
After the aggregation I can't tell if some information was removed of if it simply wasn't part of the document.
Though I'd like remove sensitive information, but keep some hint which tells that access was denied. I.e.
{
id: ...,
subDoc1: {
foo: "bar",
_acl: {
READ: [ ["user1"] ]
}
},
subDoc2: {
_error: "ACCESS DENIED"
}
}
I just can't figure out, how to modify the document while using $redact.
Thank you!
The $redact pipeline stage is quite unique in the aggregation framework as it is not only capable of recursively descending into the nested structure of a document but also in that it can traverse across all of the keys at any level. It does however still require a concept of "depth" in that a key must either contain a sub-document object or an array which itself is composed of sub-documents.
But what it cannot do is "replace" or "swap-out" content. The only actions allowed here are fairly set, or more specifically from the documentation:
The argument can be any valid expression as long as it resolves to $$DESCEND, $$PRUNE, or $$KEEP system variables. For more information on expressions, see Expressions.
The possibly misleading statement there is "The argument can be any valid expression", which is in fact true, but it must however return exactly the same content as what would be resolved to be present in one of those system variables anyhow.
So in order to give some sort of "Access Denied" response in replacement of the "redacted" content, you would have to process differently. Also you would need to consider the limitations of other operators which could simply not work in a "recursive" or in a manner that requires "traversal" as mentioned earlier.
Keeping with the example from the documentation:
{
"_id": 1,
"title": "123 Department Report",
"tags": [ "G", "STLW" ],
"year": 2014,
"subsections": [
{
"subtitle": "Section 1: Overview",
"tags": [ "SI", "G" ],
"content": "Section 1: This is the content of section 1."
},
{
"subtitle": "Section 2: Analysis",
"tags": [ "STLW" ],
"content": "Section 2: This is the content of section 2."
},
{
"subtitle": "Section 3: Budgeting",
"tags": [ "TK" ],
"content": {
"text": "Section 3: This is the content of section3.",
"tags": [ "HCS" ]
}
}
]
}
If we want to process this to "replace" when matching the "roles tags" of [ "G", "STLW" ], then you would do something like this instead:
var userAccess = [ "STLW", "G" ];
db.sample.aggregate([
{ "$project": {
"title": 1,
"tags": 1,
"year": 1,
"subsections": { "$map": {
"input": "$subsections",
"as": "el",
"in": { "$cond": [
{ "$gt": [
{ "$size": { "$setIntersection": [ "$$el.tags", userAccess ] }},
0
]},
"$$el",
{
"subtitle": "$$el.subtitle",
"label": { "$literal": "Access Denied" }
}
]}
}}
}}
])
That's going to produce a result like this:
{
"_id": 1,
"title": "123 Department Report",
"tags": [ "G", "STLW" ],
"year": 2014,
"subsections": [
{
"subtitle": "Section 1: Overview",
"tags": [ "SI", "G" ],
"content": "Section 1: This is the content of section 1."
},
{
"subtitle": "Section 2: Analysis",
"tags": [ "STLW" ],
"content": "Section 2: This is the content of section 2."
},
{
"subtitle" : "Section 3: Budgeting",
"label" : "Access Denied"
}
]
}
Basically, we are rather using the $map operator to process the array of items and pass a condition to each element. In this case the $cond operator first looks at the condition to decide whether the "tags" field here has any $setIntersection result with the userAccess variable we defined earlier.
Where that condition was deemed true then the element is returned un-altered. Otherwise in the false case, rather than remove the element ( not impossible with $map but another step), since $map returns an equal number of elements as it received in "input", you just replace the returned content with something else. In this case and object with a single key and a $literal value. Being "Access Denied".
So keep in mind what you cannot do, being:
You cannot actually traverse document keys. Any processing needs to be explicit to the keys specifically mentioned.
The content therefore cannot be in another other form than an array as MongoDB cannot traverse accross keys. You would need to otherwise evaluate conditionally at each key path.
Filtering the "top-level" document is right out. Unless you really want to add an additional stage at the end that does this:
{ "$project": {
"doc": { "$cond": [
{ "$gt": [
{ "$size": { "$setIntersection": [ "$tags", userAccess ] }},
0
]},
"$ROOT",
{
"title": "$title",
"label": { "$literal": "Access Denied" }
}
]}
}}
With all said and done, there really is not a lot of purpose in any of this unless you are indeed intending to actually "aggregate" something at the end of the day. Just making the server do exactly the same filtering of document content that you can do in client code it usually not the best use of expensive CPU cycles.
Even in the basic examples as given, it makes a lot more sense to just do this in client code unless you are really getting a major benefit out of removing entries that do not meet your conditions from being transferred over the network. In your case there is no such benefit, so better to client code instead.
Related
I have the query:
db.changes.find(
{
$or: [
{ _id: ObjectId("60b1e8dc9d0359001bb80441") },
{ _oid: ObjectId("60b1e8dc9d0359001bb80441") },
],
},
{
_id: 1,
}
);
which returns almost instantly.
But the moment I add a sort, the query doesn't return. The query just runs. The longest I could tolerate the query running was over 30 Min, so I'm not entirely sure if it does eventually return.
db.changes
.find(
{
$or: [
{ _id: ObjectId("60b1e8dc9d0359001bb80441") },
{ _oid: ObjectId("60b1e8dc9d0359001bb80441") },
],
},
{
_id: 1,
}
)
.sort({ _id: -1 });
I have the following indexes:
[
{
"_oid" : 1
},
{
"_id" : 1
}
]
and this is what db.currentOp() returns:
{
"host": "xxxx:27017",
"desc": "conn387",
"connectionId": 387,
"client": "xxxx:55802",
"appName": "MongoDB Shell",
"clientMetadata": {
"application": {
"name": "MongoDB Shell"
},
"driver": {
"name": "MongoDB Internal Client",
"version": "4.0.5-18-g7e327a9017"
},
"os": {
"type": "Linux",
"name": "Ubuntu",
"architecture": "x86_64",
"version": "20.04"
}
},
"active": true,
"currentOpTime": "2021-09-24T15:26:54.286+0200",
"opid": 71111,
"secs_running": NumberLong(23),
"microsecs_running": NumberLong(23860504),
"op": "query",
"ns": "myDB.changes",
"command": {
"find": "changes",
"filter": {
"$or": [
{
"_id": ObjectId("60b1e8dc9d0359001bb80441")
},
{
"_oid": ObjectId("60b1e8dc9d0359001bb80441")
}
]
},
"sort": {
"_id": -1.0
},
"projection": {
"_id": 1.0
},
"lsid": {
"id": UUID("38c4c09b-d740-4e44-a5a5-b17e0e04f776")
},
"$readPreference": {
"mode": "secondaryPreferred"
},
"$db": "myDB"
},
"numYields": 1346,
"locks": {
"Global": "r",
"Database": "r",
"Collection": "r"
},
"waitingForLock": false,
"lockStats": {
"Global": {
"acquireCount": {
"r": NumberLong(2694)
}
},
"Database": {
"acquireCount": {
"r": NumberLong(1347)
}
},
"Collection": {
"acquireCount": {
"r": NumberLong(1347)
}
}
}
}
This wasn't always a problem, it's only recently started. I've also rebuilt the indexes, and nothing seems to work. I've tried using .explain(), and that also doesn't return.
Any suggestions would be welcome. For my situation, it's going to be much easier to make changes to the DB than it is to change the query.
This is happening due to the way Mongo chooses what's called a "winning plan", I recommend you read more on this in my other answer which explains this behavior. However it is interesting to see if the Mongo team will consider this specific behavior a feature or a bug.
Basically the $or operator has some special qualities, as specified:
When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
It seems that the addition of the sort is disrupting the usage this quality, meaning you're running a collection scan all of a sudden.
What I recommend you do is use the aggregation pipeline instead of the query language, I personally find it has more stable behavior and it might work there. If not maybe just do the sorting in code ..
The server can use a separate index for each branch of the $or, but in order to avoid doing an in-memory sort the indexes used would have to find the documents in the sort order so a merge-sort can be used instead.
For this query, an index on {_id:1} would find documents matching the first branch, and return them in the proper order. For the second branch, and index on {oid:1, _id:1} would do the same.
If you have both of those indexes, the server should be able to find the matching documents quickly, and return them without needing to perform an explicit sort.
let's say I have a collection like so:
{
"id": "2902-48239-42389-83294",
"data": {
"location": [
{
"country": "Italy",
"city": "Rome"
}
],
"time": [
{
"timestamp": "1626298659",
"data":"2020-12-24 09:42:30"
}
],
"details": [
{
"timestamp": "1626298659",
"data": {
"url": "https://example.com",
"name": "John Doe",
"email": "john#doe.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "https://www.myexample.com",
"name": "John Doe",
"email": "doe#john.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "http://example.com/sub/directory",
"name": "John Doe",
"email": "doe#johnson.com"
}
}
]
}
}
Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:
db.info.find({"data.details.data.url": "example.com"})
How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"
Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:
{ $unwind: "$data.details" },
{
$match: {
"data.details.data.url": /.*example.com.*/,
},
},
{
$project: {
id: 1,
"data.details.data.url": 1,
"data.details.data.email": 1,
},
},
I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.
More info:
My collection has dozens of GB, I created two indexes:
Compound like so:
"data.details.data.url": 1,
"data.details.data.email": 1
Text Index:
"data.details.data.url": "text",
"data.details.data.email": "text"
It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!
Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.
To avoid this you have to use another regex, for example that just start with that domain.
For example:
(http[s]?:\/\/|www\.)YOUR_SEARCH
will check that what you are searching for is behind an http:// or www. marks.
https://regex101.com/r/M4OLw1/1
I leave you the full query.
[
{
'$unwind': {
'path': '$data.details'
}
}, {
'$match': {
'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
}
}
]
Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.
I have the following document:
"content": [
{
"_id": "5dbef12ae3976a2775851bfb",
"name": "Item AAA",
"itemImages": [
{
"_id": "5dbef12ae3976a2775851bfd",
"imagePath": "https://via.placeholder.com/300",
"imageTitle": "Test 300"
},
{
"_id": "5dbef12ae3976a2775851bfc",
"imagePath": "https://via.placeholder.com/250",
"imageTitle": "Test 250"
}
]
}
and I am wondering if there is a way to return only the data in the array yet with the "name" and document "main _id" so that result set will be
"itemImages": [
{
"_id": "5dbef12ae3976a2775851bfb",
"name": "Item AAA",
"imagePath": "https://via.placeholder.com/300",
"imageTitle": "Test 300"
},
{
"_id": "5dbef12ae3976a2775851bfb",
"name": "Item AAA",
"imagePath": "https://via.placeholder.com/250",
"imageTitle": "Test 250"
}
]
I've tried using mongodb find and aggregate functions but neither helped in retrieving the above results. Thanks for your help
You should be able to get what you want with aggregation.
You'll need to:
unwind the array so there is a separate document for each element
use addFields to add the _id and name to the element
group to reassemble the array
project to get just the field you want
This might look something like:
db.collection.aggregate([
{$unwind:"$itemImages"},
{$addFields: {
"itemImages._id":"$_id",
"itemImages.name":"$name"
}},
{$group:{_id:"$_id", itemImages:{$push:"$itemImages"}}},
{$project:{itemImages:1,_id:0}}
])
My documents' structure look like this:
{
"element": "A",
"date": "2014-01-01",
"valid_until": "2014-02-01"
},
{
"element": "A",
"date": "2014-02-01",
"valid_until": "9999-12-31"
}
The date "9999-12-31" is here to say: "it has not yet expired". There is always range like this, so for a given element "A", date > valid_until can never overlaps. I can therefore count how much element I have by using the pseudo-code like this: COUNT elements WHERE date < date_to_count AND valid_until >= date_to_count
Where "date_to_count" is the date at which I want to count the values for. As I want to calculate this at several points in time, I could either use a date histogram, or a date range aggregation. However, the date range does seem to work only with one kind of field. Ideally, I'd like to be able to do that:
"aggs": {
"foo": {
"date_range": {
"fields": ["date", "valid_until"],
"ranges": [
{"from": "2014-01-01", "to": {"2014-02-01"}},
{"from": "2014-02-01", "to": {"2014-03-01"}},
{"from": "2014-03-01", "to": {"2014-04-01"}}
]
}
}
}
Where the "date" will be used for "from", and the "valid_until" would be used for "to".
I've tried several other ideas with script, but can't find an efficient way to do it this way :/.
I think I could also workaround this if, in a script, I could have access to the current from/to values, but once again, I've tried things like "ctx.to", "context.to", but those variables are undefined.
Thanks!
Since both the date_range and date_histogram aggregations work on a single field, I do not think you can achieve your goal with an aggregation. But if you don't have too many date ranges that you need to query for, you could call the count API with a query for each date range. That would look something like this:
"query": {
"filtered": {
"filter": {
"bool" {
"must": [
{ "range": { "date": { "gte": "2014-01-01" }}},
{ "range": { "valid_until": { "lt": "2014-02-01" }}}
]
}
}
}
}
I was facing the same problem, and wanted to address this by using one single query. Here is the solution that works for me in Elasticsearch 5.2
"aggs": {
"range1": {
"date_range": {
"fields": "date",
"ranges": [
{"from": "2014-01-01", "to": {"2014-02-01"}},
{"from": "2014-02-01", "to": {"2014-03-01"}},
{"from": "2014-03-01", "to": {"2014-04-01"}}
]
},
"range2": {
"date_range": {
"field": "valid_until",
"ranges": [
{"from": "2014-01-01", "to": {"2014-02-01"}},
{"from": "2014-02-01", "to": {"2014-03-01"}},
{"from": "2014-03-01", "to": {"2014-04-01"}}
]
}
}
}
Given the following MongoDB example collection ("schools"), how do you remove student "111" from all clubs?
[
{
"name": "P.S. 321",
"structure": {
"principal": "Fibber McGee",
"vicePrincipal": "Molly McGee",
"clubs": [
{
"name": "Chess",
"students": [
ObjectId("111"),
ObjectId("222"),
ObjectId("333")
]
},
{
"name": "Cricket",
"students": [
ObjectId("111"),
ObjectId("444")
]
}
]
}
},
...
]
I'm hoping there's some way other than using cursors to loop over every school, then every club, then every student ID in the club...
MongoDB doesn't have a great support for arrays within arrays (within arrays ...). The simplest solution I see is to read the whole document into your app, modify it there and then save. This way, of course, the operation is not atomic, but for your app it might be ok.