I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?
I noticed cursor.forEach but can't it be done just using python efficiently ?
A simple example would be to split the string in text and store the no. of words as a new attribute.
def split_count(text):
# some complex preprocessing...
return len(text.split())
# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)
But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.
I found a way to call any custom python function on a collection using parallel_scan in PyMongo.
def process_text(cursor):
for row in cursor.batch_size(200):
# Any complex preprocessing here...
split_text = row['text'].split()
db.collection.update_one({'_id': row['_id']},
{'$set': {'split_text': split_text,
'num_words': len(split_text) }},
upsert=True)
def preprocess(num_threads=4):
# Get up to max 'num_threads' cursors.
cursors = db.collection.parallel_scan(num_threads)
threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This is not really faster than cursor.forEach (but not that slow either), but it helps me execute any arbitrarily complex python code and save the results from within Python itself.
Also if I have an array of ints in one of the attributes, doing cursor.forEach converts them to floats which I don't want. So I preferred this way.
But I would be glad to know if there're any better ways than this :)
It is quite unlikely that it will ever be efficient to do this kind of thing in python. This is because the document would have to make a round trip and go through the python function on the client machine.
In your example code, you are passing the result of a function to a mongodb update query, which won't work. You can't run any python code inside mongodb queries on the db server.
As the answer to you linked question suggests, this type of action has to be performed in the mongo shell. e.g:
db.collection.find().snapshot().forEach(
function (elem) {
splitLength = elem.text.split(" ").length
db.collection.update(
{
_id: elem._id
},
{
$set: {
split: splitLength
}
}
);
}
);
Related
I am completely mystified (and supremely frustrated). How do I create this call using the mongoc library?
I have the following doc structure in the collection
{_id: myOID,
subsriptions: {
newProducts: true,
newBlogPosts: true,
pressReleases: true,
}
}
I want to remove one of the subscriptions, for example, the user no longer wants to receive press releases from me.
This works in the mongo shell. Now I need to do it in C code
updateOne({_id: myOID}, [{'$unset': 'subscriptions.pressReleases'}], {})
Note how the update parameter in the Mongo shell is an anonymous array. I need to do that for the bson passed in as the update parameter in the mongoc_collection_update_one() API call.
The C code for updateOne is
mongo_status = mongoc_collection_update_one (mongo_collection,
mongo_query,
mongo_update,
NULL, /* No Opts to pass in */
NULL, /* no reply wanted */
&mongo_error);
Also note that in the aggregate() API, this is done with
{"pipeline" : [{'$unset': 'elists.lunch' }] }
Neither the updateOne() shell function nor the mongoc_collection_update_one() API call accept that, they want just the array.
How do I create the bson to use as the second parameter for mongoc_collection_update_one() API call?
Joe's answer works and I am able to accomplish what I need to do.
The $unset update operator takes an object, just like $set.
updateOne({_id: myOID},{'$unset':{'subscriptions.pressReleases': true}})
OR perhaps even better
updateOne({_id: myOID},{'$unset':{'subscriptions.pressReleases': {'$exists': true}}})
which will remove the subscription flag no matter what the value is for that field.
Doing it this way does not require an anonymous array (which I still don't know how to create).
I have a ton of documents (around 10 million) and I need to change their field type. The usual forEach function (just looping through every value) seems to take forever and is clearly not viable in the timeframe I have (it basically took all night for one out of four updates)
I've heard that bulkwrites may be able to do it but I'm getting mixed messages. I saw a confusing answer on this site, for example, says that there's no written function to do it (you would have to do some workaround), others say that it can be done with updates in Python, using pymongo.
I was wondering if there was a quicker way to mass changes of field type (string->double, string -> int) using python? I can also work from the console but I find even less solutions there.
Thanks
You can try using aggregation query in the mongo shell
Something like
db.your_collection.aggregate([
{
$addFields: {
field1: {
$convert: {
input: "$field1",
to: "string"
}
}
}
},
{ $out: "your_collection" }
])
More info here https://docs.mongodb.com/manual/reference/operator/aggregation/convert/
I'm using mongo's findOneAndReplace() with upsert = true and returnNewDocument = true
as basically a way to not insert duplicate. But I want to get the _id of the new inserted document (or the old existing document) to be passed to a background processing task.
BUT I also want to log if the document was Added-As-New or if a Replacement took place.
I can't see any way to use findOneAndReplace() with these parameters and answer that question.
The only think I can think of is to find, and insert in two different requests which seems a bit counter-productive.
ps. I'm actually using pymongo's find_one_and_replace() but it seems identical to the JS mongo function.
EDIT: edited for clarification.
Is it not possible to use replace_one function ? In java I am able to use repalceOne which returns UpdateResult. That has method for finding if documented updated or not. I see repalce_one in pymongo and it should behave same. Here is doc PyMongo Doc Look for replace_one
The way I'm going to implement it for now (in python):
import pymongo
def find_one_and_replace_log(collection, find_query,
document_data,
log={}):
''' behaves like find_one_or_replace(upsert=True,
return_document=pymongo.ReturnDocument.AFTER)
'''
is_new = False
document = collection.find_one(find_query)
if not document:
# document didn't exist
# log as NEW
is_new = True
new_or_replaced_document = collection.find_one_and_replace(
find_query,
document_data,
upsert=True,
return_document=pymongo.ReturnDocument.AFTER
)
log['new_document'] = is_new
return new_or_replaced_document
I know this has been asked before, but I have yet to find a solution that works efficiently. I am working with the MongoDB C# driver, though this is more of a general question about MongoDB operations.
I have a document structure that looks something like this:
field1: value1
field2: value2
...
users: [ {...user 1 subdocument...}, {...user 2 subdocument...}, ... ]
Some facts:
Each user subdocument includes further sub-arrays & subdocuments (so they're fairly complex).
The average users array only contains about 5 elements, but in the worst case can surpass 100.
Several thousand update operations on multiple users may be conducted per day in this system, each on one document at a time. Larger arrays will receive more frequent updates due to their data size.
I am trying to figure out how to do this efficiently. From what I've heard, you cannot directly set several array elements to new values all at once, so I had to try something else.
I tried using the $pullAll / $AddToSet + $each operations to remove the old array and replace it with a modified one. I am aware that $pullall can remove only the elements that I need as well, but I would like to preserve the order of elements.
The C# code:
try
{
WriteConcernResult wcr = collection.Update(query,
Update.Combine(Update.PullAll("users"),
Update.AddToSetEach("users", newUsers.ToArray())));
}
catch (WriteConcernException wce)
{
return wce.Message;
}
In this case newUsers is aList<BsonValue>converted to an array. However I am getting the following exception message:
Cannot update 'users' and 'users' at the same time
By the looks of it, I can't have two update statements in use on the same field in the same write operation.
I also tried Update.Set("users", newUsers.ToArray()), but apparently the Set statement doesn't work with arrays, just basic values:
Argument 2: cannot convert from 'MongoDB.Bson.BsonValue[]' to 'MongoDB.Bson.BsonValue'
So then I tried converting that array to a BsonDocument:
Update.Set("users", newUsers.ToArray().ToBsonDocument());
And got this:
An Array value cannot be written to the root level of a BSON document.
I could try replacing the whole document, but that seems like overkill and definitely not very efficient.
So the only thing I can think of now is to run two separate write operations: one to remove the unwanted old users and another to replace them with their newer versions:
WriteConcernResult wcr = collection.Update(query, Update.PullAll("users"));
WriteConcernResult wcr = collection.Update(query, Update.AddToSetEach("users", newUsers.ToArray()));
Is this my best option? Or is there another, better way of doing this?
Your code should work with a minor change:
Update.Set("users", new BsonArray(newUsers));
BsonArray is a BsonValue, where as an array of documents is not and we don't implicitly convert arrays like we do other primitive values.
this extension method solve my problem:
public static class MongoExtension
{
public static BsonArray ToBsonArray(this IEnumerable list)
{
var array = new BsonArray();
foreach (var item in list)
array.Add((BsonValue) item);
return array;
}
}
My application has the following stack:
Sinatra on Ruby -> MongoMapper -> MongoDB
The application puts several entries in the database. In order to crosslink to other pages, I've added some sort of syntax. e.g.:
Coffee is a black, caffeinated liquid made from beans. {Tea} is made from leaves. Both drinks are sometimes enjoyed with {milk}
In this example {Tea} will link to another DB entry about tea.
I'm trying to query my mongoDB about all 'linked terms'. Usually in ruby I would do something like this: /{([a-zA-Z0-9])+}/ where the () will return a matched string. In mongo however I get the whole record.
How can I get mongo to return me only the matched parts of the record I'm looking for. So for the example above it would return:
["Tea", "milk"]
I'm trying to avoid pulling the entire record into Ruby and processing them there
I don't know if I understand.
db.yourColl.aggregate([
{
$match:{"yourKey":{$regex:'[a-zA-Z0-9]', "$options" : "i"}}
},
{
$group:{
_id:null,
tot:{$push:"$yourKey"}
}
}])
If you don't want to have duplicate in totuse $addToSet
The way I solved this problem is using the string aggregation commands to extract the StartingIndexCP, ending indexCP and substrCP commands to extract the string I wanted. Since you could have multiple of these {} you need to have a projection to identify these CP indices in one shot and have another projection to extract the words you need. Hope this helps.