Mongodb findAndModify on an array - mongodb

I have an array of objects:
[{
_id: 1,
data: 'one'
},{
_id: 2,
data: 'two'
}]
I am receiving a new array every so often. Is there a way to shove all the data back into mongo (without dups) in bulk.
I.E. I know that I can loop over each element and and do a findAndModify (with upsert true for new records coming in). But I cann do an insert with the array each time because the ids will collide.

At least in the shell if you try to insert the whole array in one step it cycles through each element of the array and works, so the instruction:
db.coll.insert([{ _id: 1, data: 'one' },{ _id: 2, data: 'two' }])
Works and inserts two different records.
_id check also works, if you try it again you will receive an error as expected.
Anyway, there's a downside and it's the fact Mongo cycles every single record for real, and in case you try something like:
db.col.insert([{ _id: 1, data: 'one again' },{ _id: 5, data: 'five' }])
It won't work to test duplicates due to the fact Mongo stops in the first record and the second is never processed.
There're a couple of other tricks to insert this into a collection with a single node named "data" for example and process it faster, however you're always limited to 16mb records, if your bulk data is too large no any method will work.
In case you use mongoimport it's possible to use the --jsonArray parameter but you're still limited to 16 mb.
There's not another way to do it if you need larger chunks of data.

Related

Insert field only if not null

I have a Mongodb document which I want to update with some information that I receive from a form. The original document in Mongodb has the following data:
{"id":1, "name": "James", "surname1": "Adams"} [id is univoque]
The information obtained from the form is the following (it's structure is always the same):
id = 1
name = "James"
surname1 = "Adams"
surname2 = ""
synced = 1
In order to update this document from the collection foo, I am doing the following operation:
mongoClient = MongoClient('localhost:27017').db
mongoClient["foo"].update({"id" : id}, {"$set" : {"name" : name, "surname1" : surname1, "surname2" : surname2, "synced" : synced}})
This will create an empty field for surname2. And here goes my question: which is the most pythonic way to insert in Mongodb only those fields which are not empty? Could this be done with inner Mongodb logic or this could only be performed with some Python checks of the input data?
The trick is to pop() the items from the record before loading. Try this as an example:
from pymongo import MongoClient
mongoClient = MongoClient('localhost:27017').db
# Setup some test data
for i in range(3):
mongoClient["foo"].insert({"id": i, "name": "James", "surname1": "Adams"})
# Simulate your incoming record
record = {"id": 1, "name": "James", "surname1": "Adams", "surname2": "", "synced": 1}
# Remove any empty items
for k, v in list(record.items()):
if v == '' or v is None:
record.pop(k)
mongoClient["foo"].update({"id": record['id']}, {"$set": record})
for item in mongoClient["foo"].find({}, {"_id": 0}):
print(item)
Result:
{'id': 0, 'name': 'James', 'surname1': 'Adams'}
{'id': 1, 'name': 'James', 'surname1': 'Adams', 'synced': 1}
{'id': 2, 'name': 'James', 'surname1': 'Adams'}
Wholesale document updates (where the entire database document is replaced) is expensive. If you have a way to detect the changes to a document within your application you can provide the database with targeted updates. This question was originally how to update documents in the database such that null or empty fields are removed to reduce the overall size of the document. Presumably this is to increase performance and reduce disk space consumption. Both admirable objectives. But here's the thing - not only do we want well structured documents on disk, we also must consider how we manage this data. Performing wholesale updates and replacing the entire document certainly makes development of the client easier, but it places burden on the database engine. MongoDB supports documents up to 16MB so the larger the document the more the impact of a wholesale document replacement becomes. Obviously targeted updates are preferable, but how do we do that? How do we know what fields are different between my incoming document and what is in the database? This is more of a philosophical question really. We could query the database and compare with the incoming document but then again that is an impact on the database. But consider this - for any document we want to update we must have originally sourced it from the database - right? Otherwise its a new document (an insert, not an update). If we did source it from the database then we have the ability to track what changes we made to the document. If that is true then our incoming data could be nothing more than a set of changes, rather than the whole document itself. Again this is philosophy, not science. So consider the scenario where the incoming data is changes, not the actual document. We could construct instructions to unset fields where they have been removed, and to update targeted fields where they are modified but remain. The orignal question was about how to create a 'pythonic' solution to strip blank fields but this question is really at the surface of a more in-depth problem of tracking change.

MongoDB indexing on variable query

I have a collection of user generated posts. They contain the following fields
_id: String
groupId: String // id of the group this was posted in
authorId: String
tagIds: [String]
latestActivity: Date // updated whenever someone comments on this post
createdAt: Date
numberOfVotes: Number
...some more...
My queries always look something like this...
Posts.find({
groupId: {$in: [...]},
authorId: 'xyz', // only SOMETIMES included
tagIds: {$in: [...]}, // only SOMETIMES included
}, {
sort: {latestActivity/createdAt/numberOfVotes: +1/-1, _id: -1}
});
So I'm always querying on the groupId, but only sometimes adding tagIds or userIds. I'm also switching out the field on which this is sorted. How would my best indexing strategy look like?
From what I've read so far here on SO, I would probably create multiple compound indices and have them always start with {groupId: 1, _id: -1} - because they are included in every query, they are good prefix candidates.
Now, I'm guessing that creating a new index for every possible combination wouldn't be a good idea memory wise. Therefore, should I just keep it like that and only index groupId and _id?
Thanks.
You are going in the right direction. With compound indexes, you want the most selective indexes on the left and the ranges on the right. {groupId: 1, _id: -1} satisfies this.
It's also important to remember that compound indexes are used when the keys are in the query from left to right. So, one compound index can cover many common scenarios. If, for example, your index was {groupId: 1, authorId:1, tagIds: 1} and your query was Posts.find({groupId: {$in: [...]},authorId: 'xyz'}), that index would get used (even though tagIds was absent). Also, Posts.find({groupId: {$in: [...]},tagIds: {$in: [...]}}) would use this index (the first and third field of the index was used, so if there isn't a more specific index found by Mongo, this index would be used) . However, Posts.find({authorId: 'xyz',tagIds: {$in: [...]}}) would not use the index because the first field in the index was missing.
Given all of that, I would suggest starting with {groupId: 1, authorId:1, tagIds: 1, _id: -1}. groupId is the only non-optional field in your queries, so it goes on the left before the optional ones. It looks like authorId is more selective than tagIds, so it should go on the left after groupId. You're sorting by _id so that should go on the right. Be sure to Analyze Query performance on the different ways you query the data. Make sure they are all choosing this index (otherwise you'll need to make more tweaks or possibly a second compound index). You could then make other indexes and force the query to use it to do some a-b testing on performance.

Iterating over distinct items in one field in MongoDB

I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.

Mongodb update limited number of documents

I have a collection with 100 million documents. I want to safely update a number of the documents (by safely I mean update a document only if it hasn't already been updated). Is there an efficient way to do it in Mongo?
I was planning to use the $isolated operator with a limit clause but it appears mongo doesn't support limiting on updates.
This seems simple but I'm stuck. Any help would be appreciated.
Per Sammaye, it doesn't look like there is a "proper" way to do this.
My workaround was to create a sequence as outlined on the mongo site and simply add a 'seq' field to every record in my collection. Now I have a unique field which is reliably sortable to update on.
Reliably sortable is important here. I was going to just sort on the auto-generated _id but I quickly realized that natural order is NOT the same as ascending order for ObjectId's (from this page it looks like the string value takes precedence over the object value which matches the behavior I observed in testing). Also, it is entirely possible for a record to be relocated on disk which makes the natural order unreliable for sorting.
So now I can query for the record with the smallest 'seq' which has NOT already been updated to get an inclusive starting point. Next I query for records with 'seq' greater than my starting point and skip (it is important to skip since the 'seq' may be sparse if you remove documents, etc...) the number of records I want to update. Put a limit of 1 on that query and you've got a non-inclusive endpoint. Now I can issue an update with a query of 'updated' = 0, 'seq' >= my starting point and < my endpoint. Assuming no other thread has beat me to the punch the update should give me what I want.
Here are the steps again:
create an auto-increment sequence using findAndModify
add a field to your collection which uses the auto-increment sequence
query to find a suitable starting point: db.xx.find({ updated: 0 }).sort({ seq: 1 }).limit(1)
query to find a suitable endpoint: db.xx.find({ seq: { $gt: startSeq }}).sort({ seq: 1 }).skip(updateCount).limit(1)
update the collection using the starting and ending points: db.xx.update({ updated: 0, seq: { $gte: startSeq }, seq: { $lt: endSeq }, $isolated: 1}, { updated: 1 },{ multi: true })
Pretty painful but it gets the job done.

Retrieving a sequential documents based on _id

I've got a scenario where documents are indexed in elastic search, and I need to retrieve the matched document in mongo along with the preceding and following documents as sorted by a timestamp. The idea being to retrieve the context of the document along with the original document.
I am able to do this successfully now if I use a sequential _id. As an example, using the following data:
[
{_id: 1, value: 'Example One' },
{_id: 2, value: 'Example Two' },
{_id: 3, value: 'Example Three' },
{_id: 4, value: 'Example Four' },
{_id: 5, value: 'Example Five' },
{_id: 6, value: 'Example Six' },
...
]
if I search for 'Four' in ES, I get back the document _id of 4, since it's sequential I can create a mongo query to pull a range between id - 2 and id + 2, in this case 2 - 6. This works well, as long as I do not ever delete documents. When I delete a document I'll have to re-index the entire series to eliminate the gap. I'm looking for a way of achieving the same results, but also being able to delete documents without having to update all of the documents.
I'm open to using other technologies to achieve this, I am not necessarily tied to mongodb.
I can get the desired results using something like the following:
collection.find( {_id: { $gte: matchedId } } ).limit(3);
collection.find( {_id: { $lt: matchedId } } ).sort({$natural: -1}).limit(2);
Not quite as nice as using an explicit range, but no need to recalculate anything on document deletion.
Yes, I am aware of the limitations of natural order, and it is not a problem for my particular use case.
This problem has nothing to do with MongoDB in particular and is not different from using a different database here (e.g. a RDBMS). You will have to loop for document ids smaller/larger than the current id and find the first two matching. Yes, this means that you need to perform multiple queries. The only other option is to implement a chained list on top of MongoDB where you store pointers to the right and left neighbour node. And yes, in case of a deletion you need to adjust pointers (basic datastructure algorithms....). The downside is: you will need multiple operations in order to perform the changes. Since MongoDB is not transaction you may run into inconsistent previous/next pointers....that's why MongoDB completely sucks here.