Having very simple 2 mongo documents:
{_id:1, v:1}
{_id:2, v:1}
Now, basing on array of _id I need increase field v as many times how _id appears. For example [1, 2, 1] should produce
{_id:1, v:3} //increased 2 times
{_id:2, v:2} //increased 1 times
Of course simple update eliminates duplicate in $in:
db.r.update({_id:{$in:[1,2,1]}}, {$inc:{v:1}}, {multi:true})
Is there a way to do it without for-loop? /Thank you in advance/
No there isn't a way to do this in a single update statement.
The reason why the $in operator "removes the duplicate" is a simple matter of the fact that th 1 was already matched, no point in matching again. So you can't make the document "match twice" as it were.
Also there is no current way to batch update operations. But that feature is coming.
You could look at your "batch" and make a decision to group together occurrences of the same document to be updated and then issue your increment to the appropriate number of units. However just like looping the array items, the operation would be programitic, albeit a little more efficient.
That isn't possible directly. You'll have to do that in your client, where you can at least try to minimize the number of batch updates required.
First, find the counts. This depends on your programming language, but what you want is something like [1, 2, 1] => [ { 1 : 2 }, { 2 : 1} ] (these are the counts for the respective ids, i.e. id 1 appears twice, etc.) Something like linq oder underscore.js is helpful here.
Next, since you can't perform different updates in a single operation, group them by their count, and update all objects whose count must be incremented by a common fixed value in one batch:
Pseudocode:
var groups = data.groupBy(p => p.Value);
foreach(var group in groups)
db.update({"_id" : { $in : group.values.asArray }},
// increase by the number of times those ids were present
{$inc : { v : group.key } })
That is better than individual updates only if there are many documents that must be increased by the same value.
Related
When looking for pagination techniques on the internet, one usually finds two ways :
Offset-based pagination :
Collection.find(
{ where_this: "equals that" },
{ skip: 15, limit: 5 }
)
Cursor-based pagination :
Collection.find(
{ where_this: "equals that", _id: { $gt: cursor }},
{ sort: { _id: 1 }}
)
But is there a way to have cursor-based pagination without sorting the collection according to that cursor ? Like, telling Mongo "Alright, I want the 5 next items after that _id, no matter in which order the _ids are, just give me 5 items after you see that _id". Something along those lines :
Collection.find(
{ where_this: "equals that", _id: { $must_come_after: cursor }},
{ sort: { other_field: 1 }}
)
It is not always possible to use the field you're sorting with as the cursor. First of all, because these fields can be of different types and you might allow your app users to sort, for example, tables as they please. With a strongly-typed API framework like GraphQL, this would be a mess to handle. Secondly, you could have two or more equal values for that field following each other in the sorted collection. If your pages split in the middle, asking for the next page will either give you duplicates or ignore items.
Is there a way to do that ? How is it usually done to allow custom sorting fields without offset-based pagination ? Thanks.
When we talk about "paginating a result set", the implicit assumption is that the result set stays the same throughout the process. In order for the result set to stay the same, it must normally be sorted. Without specifying an order, the database is free to return the documents in any order, and this order may change from one retrieval to the next. Reordering of documents that a user is paginating through creates a poor user experience.
you could have two or more equal values for that field following each other in the sorted collection. If your pages split in the middle, asking for the next page will either give you duplicates or ignore items.
The database can return the documents which compare equal in any order, and this order can change between adjacent queries. This is why when sorting on a field that has a low cardinality, it is a good idea to add another field with a high cardinality to the sort expression to ensure the documents are returned in a stable order.
But is there a way to have cursor-based pagination without sorting the collection according to that cursor ?
You can encode offset in a cursor identifier and use skip/limit.
I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.
I am looking for a good way to implement a sort key, that is completely user definable. E.g. The user is presented with a list and may sort the elements by dragging them around. This order should be persisted.
One commonly used way is to just create an ascending integer type sort field within each element:
{
"_id": "xxx1",
"sort": 2
},
{
"_id": "xxx2",
"sort": 3
},
{
"_id": "xxx3",
"sort": 1
}
While this will surely work, it might not be ideal: In case the user moves an element from the very bottom to the very top, all the indexes in-between need to be updated. We are not talking about embedded documents here, so this will cause a lot of individual documents to be updated. This might be optimised by creating initial sort values with gaps in-between (e.g. 100, 200, 300, 400). However, this will create the need for additional logic an re-sorting in case the space between two elements is exhausted.
Another approach comes to mind: Have the parent document contain a sorted array, which defines the order of the children.
{
"_id": "parent01",
"children": ["xxx3","xxx1","xxx2"]
}
This approach would certainly make it easier to change the order, but will have it's own caveats: The parent documents must always keep track of a valid list of its children. As adding children will update multiple documents, this still might not be ideal. And there needs to be complex validation of the input received from the client, as the length of this list and the elements contained, may never be changed by the client.
Is there a better way to implement such a use case?
Hard to say which option is better without knowing:
How often the sort order is usually updated
Which queries you gonna run against the documents and how often
How many documents can be sorted at a time
I'm sure you gonna do much more queries than updates so personally I would go with the first option. It's easy to implement and it's simple which means it's gonna be rebust. I understand your concerns about updating multiple documents but the updates will be done in place, I mean no documents shifting will occur as you don't actually change the documents size. Just create a simple test. Generate 1k of documents, then just update each of them in a loop like that
db.test.update({ '_id': arrIds[i] }, { $set: { 'sort' : i } })
You will see it will be a pretty instant operation.
I like the second option as well, from programming perspective it looks more elegant but when it comes to practice you don't usually care much if your update takes 10 milleseconds instead of 5 if you don't do it often and I'm sure you don't, most applications are query oriented.
EDIT:
When you update multiple documents, even if it's an instant operation, one may come up with an inconsistency issue when some documents are updated and some not. In my case it wasn't really an issue in fact. Let's consider an example, assume there's a list:
{ "_id" : 1, "sort" : 1 },{ "_id" : 2, "sort" : 4 },{ "_id" : 3, "sort" : 2 },{ "_id" : 4, "sort" : 3 }
so the ordered ids should look like that 1,3,4,2 according to sort fields. Let's say we have a failure when we want to move id=2 to the top. The failure occurs when we only updated two documents, so we will come up with the following state as we only managed to update ids 2 and 1:
{ "_id" : 1, "sort" : 2 },{ "_id" : 2, "sort" : 1 },{ "_id" : 3, "sort" : 2 },{ "_id" : 4, "sort" : 3 }
the data is in inconsistent state but still we can display the list to fix the problem, the ids order will be 2,1,3,4 if we just order it by sort field. why is it not a problem in my case? because when a failure occurs a user is redirected to an error page or provided with an error message, it is obvious for him that something got wrong and he should try again, so he just goes to the page and fix the order which is only partially valid for him.
Just to sum it up. Taking into account that it's a really rare case and other benefits of the approach I would go with it. Otherwise you will have to place everything in one document both the elements and the array with their indexes. This might be a much bigger issue, especially when it come to querying.
Hope it helps!
I have browsed through various examples but have failed to find what I am looking for.. What I want is to search for a specific document by _id and skip multiple times between a collection by using one query? Or some alternative which is fast enough to my case.
Following query would skip first one and return second in advance:
db.posts.find( { "_id" : 1 }, { comments: { $slice: [ 1, 1 ] } } )
That would be skip 0, return 1 and leaves the rest out from result..
But what If there would be like 10000 comments and I would want to use same pattern, but return that array values like this:
skip 0, return 1, skip 2, return 3, skip 4, return 5
So that would return collection which comments would be size of 5000, because half of them is skipped away. Is this possible? I applied large number like 10000 because I fear that using multiple queries to apply this would not be performance wise.. (example shown in here: multiple queries to accomplish something similar). Thnx!
I went through several resources and concluded that currently this is impossible to make with one query.. Instead, I agreed on that there are only two options to overcome this problem:
1.) Make a loop of some sort and run several slice queries while increasing the position of a slice. Similar to resource I linked:
var skip = NUMBER_OF_ITEMS * (PAGE_NUMBER - 1)
db.companies.find({}, {$slice:[skip, NUMBER_OF_ITEMS]})
However, depending on the type of a data, I would not want to run 5000 individual queries to get only half of the array contents, so I decided to use option 2.) Which seems for me relatively fast and performance wise.
2.) Make single query by _id to row you want and before returning results to client or some other part of your code, skip your unwanted array items away by using for loop and then return the results. I made this at java side since I talked to mongo via morphia. I also used query explain() to mongo and understood that returning single line with array which has 10000 items while specifying _id criteria is so fast, that speed wasn't really an issue, I bet that slice skip would only be slower.
I have some scripts which update, mongoDb records which look like this :
{ "_id" : "c12345", "arr" : [
{
"i" : 270099850,
"a" : 772,
},
{
"i" : 286855630,
"a" : 622,
}
] }
The scripts append elements in the "arr" array of the object,using "pushAll" which
works fine and is very fast.
My requirement:
1. Keep modifying these objects, but process them once the size of arr exceeds 1000.
When arr exceeds 1000,I choose some important records, discard some less important ones,
and discard some old ones, and reduce the size of arr to 500 .
Current implementation:
1. Script A takes some data from somewhere and finds the object in another collection
using "_id" field, and appends that data into "arr" array.
The same script when finds the element,checks for size of "arr", if less than 1000, it does a normal append to arr, else proceeds to processing of PHP object retreived through find,modifies it, and updates the mongo record using "SET".
Current bottlenecks:
1. I want the updating script to run very fast. Upserts are fast, however the find and modifying operations are slower for each record.
Ideas in mind:
1. Instead of processing EXCEEDED items within the scripts, set a bool flag in the object, and process it using a seperate Data Cleaner script. ( but this also requires me to FIND the object before doing UPSERT ).
always maintain a COUNT variable in the object,which stores current length of "arr", and use it in Data cleaner script which cleans all the objects fetched through a mongodb
query "count" > 1000. ( As mongodb does not allow $size operator to have Ranges, and only equal condition currently, I need to have my own COUNT counter)
Any other clean and efficient ideas you can suggest ?
Thanks .
In version 2.3.2 of mongo a new feature has been added. There is now a $slice that can be used to keep an array to a fixed size.
E.g.:
t.update( {_id:7}, { $push: { x: { $each: [ {a:{b:3}} ], $slice:-2, $sort: {'a.b':1} } } } )
There's no easy way to do this, however, this is a good idea:
Instead of processing EXCEEDED items within the scripts, set a bool flag in the object, and process it using a seperate Data Cleaner script.
Running a separate script definitely makes sense for this.
MongoDB does not have a method for "fixed-length" arrays. But it definitely does not have a method for doing something like this:
choose some important records, discard some less important ones, and discard some old ones
The only exception I would make is the "bool" flag. You probably want just a straight counter. If you can index on this counter then it should be fast to find those arrays that are "too big".