I have the following query to be executed on my MongoDB collection order_error. It has over 60 million documents. The main concern is I am having a $in operator within my query. I tried several possibilities of indices but none of them gave a high-performance improvement. The query is as follows
db.getCollection("order_error").find({
"$and":[
{
"type":"order"
},
{
"Origin.SN":{
"$in":[
"4095",
"4100",
"4509",
"4599",
"4510"
]
}
}
]
}).sort({"timestamp.milliseconds" : 1}).skip(1).limit(100).explain("executionStats")
One issue that needs to be noted is I am allowing sort on timestamp.milliseconds in both directions(ASC + DESC). I have limited the entries within the $in. Usually, it is more. SO what kind of index gives the performance improvement. I tried creating the following indices already
type_1_Origin.SN_1_timestamp.milliseconds_-1
type_1_timestamp.milliseconds_-1_Origin.SN
Is there any better way for index creation?
Hi I'm trying to write a conditional query on nested document array.
I've read the document for days and couldn't figure out how to make this work.
DB looks like below :
[
{
"id":1,
"team":"team1",
"players":[
{
"name":"Mario",
"substitutes":[
"Luigi",
"Yoshi"
]
},
{
"name":"Wario",
"substitutes":[
]
}
]
},
{
"id":2,
"team":"team2",
"players":[
{
"name":"Bowser",
"substitutes":[
"Toad",
"Mario"
]
},
{
"name":"Wario",
"substitutes":[
]
}
]
}
]
Due to my lack of English, it's hard to put but what I'm trying to do is
to find teams that includes all queried players.
Each object in players array, some have substitutes.
For each objects in players array, if one of the queried players is not the main player("players.name"), then I want it to look for if one of substitutes("players.substitutes") is.
Team.find({players:{$in:[ 'Mario', 'Wario' ]}}) (mongoose query)
this will give me an array with 'team1'.
but what I want to get is both teams because 'Mario' is one of the substitutes for 'Bowser'(team2).
I failed to make a query but what I've been trying is not to use $where since the official MongoDB docs says :
AGGREGATION ALTERNATIVES PREFERRED
Starting in MongoDB 3.6, the $expr operator allows the use of
aggregation expressions within the query language. And, starting in
MongoDB 4.4, the $function and $accumulator allows users to define
custom aggregation expressions in JavaScript if the provided pipeline
operators cannot fulfill your application’s needs.
Given the available aggregation operators:
The use of $expr with aggregation operators that do not use JavaScript
(i.e. non-$function and non-$accumulator operators) is faster than
$where because it does not execute JavaScript and should be preferred
if possible. However, if you must create custom expressions, $function
is preferred over $where.
BUT if it could be easily written with $where operator then it's totally fine.
Any suggestions or ideas that lead to any further would be highly appreciated.
Firstly, your query is incorrect. And it is not very obvious what exactly is your filter criteria. So I am giving two suggestions:
If you want to filter all documents that have name defined in your matching criteria (which returns both documents):
db.Team.find({"players.name":{$in:[ 'Mario', 'Wario' ]}}).pretty()
If you want to filter all documents that have any provided player names in the substitutes array (which returns only one, because team1 doesn't have any substitutes are Mario/Wario)
db.Team.find({"players.substitutes":{$in:[ 'Mario', 'Wario' ]}}).pretty()
The names being looked at could be present in name or substitute
db.Team.find({ $or: [{"players.substitutes":{$in:[ 'Mario', 'Wario' ]}}, {"players.name":{$in:[ 'Mario', 'Wario' ]}}] }).pretty()
My DB has many documents with mostly random field order as displayed in Mongo Compass. The first field is always _id but the rest of the fields could be in any order. This makes scanning records by eye very difficult.
I have read that this reordering due to upserts no longer happens with Mongo 4.2 and I have upgraded - but the problem remains.
Is there a way for me to reorder my fields so each document in a collection has the same field order - say -id first then a-z?
You can use $replaceWith to do this.
https://mongoplayground.net/p/VBzpabZuJpy
db.YOURCOLLECTION.updateMany({}, [
{$replaceWith: {
$mergeObjects: [
{
"fieldA": "$fieldA",
"fieldB": "$fieldB",
"fieldC": "$fieldC",
"fieldD": "$fieldD",
"fieldE": "$fieldE"
},
"$$ROOT"
]
}}
])
You can try reading each document in a language that preserves hash key order, reordering fields as you see fit, then writing each document back.
Since bson implements maps as lists of ordered key-value pairs, I expect all regular tools to preserve the order of keys that currently exists for each individual document.
My question might be simple be here's some more context to it.
I have a MySQL DB, I've used an ETL tool to populate a MongoDBwith, however I couldn't manage to create proper ObjectId reference to it (I can only get a string of the ObjectId.
So far I've had an idea (maybe crazy but still.. could work)
I got this field populated like this in one document :
"field1" : "ObjectId('5d48845c456145ee9d1ccffde')",
What I would want to achieve through mongoDB is removing the first and last char to get (stripping the double quotes):
"field1" : ObjectId('5d48845c456145ee9d1ccffde'),
(note that MongoDB seems to automatically convert simple to Double quote after the change, so my reference become corret).
Problem is, I don't find anything close to a sort of Update script for MongoDB to achieve this.
Is there any way to do this ?
Using NodeJS could work, however, querying the document at this state doesn't return the field1 (probably cause it find it incorect)...
If its one time update, you can use the following query:
db.COLLECTION.aggregate([
{
$addFields:{
"field1":{
$toObjectId:{
$substrBytes:[
"$field1",
10,
24
]
}
}
}
},
{
$out:"COLLECTION"
}
])
In aggregation, the 'field1' is cast to ObjectId. Later on, the old data in the collection is replaced with the aggregated one.
I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.