MongoDB: Bulk changing all field types in python

MongoDB: Bulk changing all field types in python - mongodb

I have a ton of documents (around 10 million) and I need to change their field type. The usual forEach function (just looping through every value) seems to take forever and is clearly not viable in the timeframe I have (it basically took all night for one out of four updates)
I've heard that bulkwrites may be able to do it but I'm getting mixed messages. I saw a confusing answer on this site, for example, says that there's no written function to do it (you would have to do some workaround), others say that it can be done with updates in Python, using pymongo.
I was wondering if there was a quicker way to mass changes of field type (string->double, string -> int) using python? I can also work from the console but I find even less solutions there.
Thanks

You can try using aggregation query in the mongo shell
Something like
db.your_collection.aggregate([
{
$addFields: {
field1: {
$convert: {
input: "$field1",
to: "string"
}
}
}
},
{ $out: "your_collection" }
])
More info here https://docs.mongodb.com/manual/reference/operator/aggregation/convert/

Related

In this mongoDB code i have to find if their have drinks ordered is available in beverages. can i solve this without let can i take $drink in $match

db.orders.aggregate([ {
$lookup:{
from:'restaurants',
localField:'restaurant_name',
foreignField:'name',
let:{drink:'$drink'},
pipeline:[{$match:{$expr:{$in:['$$drink','$beverages']}}}],
as:'matches'
}
}
])

Totally agree with what #rickhg12hs suggested, both directly and in general. Giving a more complete description about what you've tried, what was unexpected, and what you are trying to achieve helps us provide better responses.
To your broad question, yes there are alternative ways to structure the aggregation to achieve similar results. I am assuming that this playground demonstrates your current approach correctly. If so, then one other way to do this would be to take the let/pipeline logic and put it in a subsequent $addFields stage to $filter the matches arrays that are generated:
db.orders.aggregate([
{
$lookup: {
from: "restaurants",
localField: "restaurant_name",
foreignField: "name",
as: "matches"
}
},
{
$addFields: {
matches: {
$filter: {
input: "$matches",
cond: {
$in: [
"$drink",
"$$this.beverages"
]
}
}
}
}
}
])
The above is demonstrated with this playground example that generates the same output as the earlier one.
But this begs the question of "Why are you trying to solve this without let?" Just because you can do something, doesn't mean that you should. Use cases like this are seemingly one of the things that the additional let/pipeline syntax was added for and it is more likely that it can take advantage of indexes more effectively than the alternative offered above.
So are you asking out of curiosity if there are other ways to do this, or is there a specific problem you're trying to address? For example, are you trying to improve performance, get different results, or something else? That reasoning would really be why you might consider changing approaches, otherwise the sample pipeline you've provided seems perfectly reasonable to me.

MongoDB aggregation crashes when cursor or explain is used

Since version 3.6 MongoDB requires the use of cursor or explain in aggregate queries. It's a breaking change so I have to modify my earlier code.
But when I add cursor or explain to my query, the request simply enters an endless loop and MongoDB never responds. It doesn't even seem to time out.
This simple aggregation just hangs the code and never responds:
db.collection('users').aggregate([{ $match: { username: 'admin' }}],
{ cursor: {} },
(err, docs) => {
console.log('Aggregation completed.');
});
I can replace { cursor: {} } with { explain: true } and the result is the same. It works perfectly under older MongoDB versions without this one parameter.
Without cursor or explain I get this error message:
The 'cursor' option is required, except for aggregate with the explain argument
I'm not the only one who ran into this:
https://github.com/nosqlclient/nosqlclient/issues/419

OK, this was a little tricky, but finally it works. Looks like there are some major breaking changes in MongoDB's Node.js driver which nobody bothered to tell me.
1. The Node.js MongoDB driver has to be upgraded. My current version is 3.0.7.
2. The way how MongoDB connects has been changed, breaking any old code. The client connection now returns a client object, not merely the db. It has to be handled differently. There is a SO answer explaining it perfectly:
db.collection is not a function when using MongoClient v3.0
3. Aggregations now return an AggregationCursor object, not an array of data. Instead of a callback now you have to iterate through it.
var cursor = collection.aggregate([ ... ],
{ cursor: { batchSize: 1 } });
cursor.each((err, docs) => {
...
});
So it seems you have to rewrite ALL your db operations after upgrading to MongoDB 3.6. Yay, thanks for all the extra work, MongoDB Team! Guess that's where I'm done with your product.

Find all where parameter is within an array - Waterline

In pseudo code, it'd be as so
Find all businesses where the outcodes array contains NG1
I'm having a hard time finding something that works, and waterline throws it's Invalid usage at everything I try.
Business.find({
or:{outcodes: {contains: 'NG1 4RQ' }}
})
For reference, my business model contains outcodes as an array:
outcodes: { type: 'array' },
Is anyone able to advise how I can achieve this. I'm stumped. Currently using SailsJS with Waterline ORM

The or is not working because it needs to be an array. With only 1 criteria, you don't need to use or, but here's an example using or and searching an array for a partial string.
Business.find({
or: [ { outcodes: { contains: 'NG1' }}]
}).exec(function(err, businesses){...});

call custom python function on every document in a collection Mongo DB

I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?
I noticed cursor.forEach but can't it be done just using python efficiently ?
A simple example would be to split the string in text and store the no. of words as a new attribute.
def split_count(text):
# some complex preprocessing...
return len(text.split())
# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)
But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.

I found a way to call any custom python function on a collection using parallel_scan in PyMongo.
def process_text(cursor):
for row in cursor.batch_size(200):
# Any complex preprocessing here...
split_text = row['text'].split()
db.collection.update_one({'_id': row['_id']},
{'$set': {'split_text': split_text,
'num_words': len(split_text) }},
upsert=True)
def preprocess(num_threads=4):
# Get up to max 'num_threads' cursors.
cursors = db.collection.parallel_scan(num_threads)
threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This is not really faster than cursor.forEach (but not that slow either), but it helps me execute any arbitrarily complex python code and save the results from within Python itself.
Also if I have an array of ints in one of the attributes, doing cursor.forEach converts them to floats which I don't want. So I preferred this way.
But I would be glad to know if there're any better ways than this :)

It is quite unlikely that it will ever be efficient to do this kind of thing in python. This is because the document would have to make a round trip and go through the python function on the client machine.
In your example code, you are passing the result of a function to a mongodb update query, which won't work. You can't run any python code inside mongodb queries on the db server.
As the answer to you linked question suggests, this type of action has to be performed in the mongo shell. e.g:
db.collection.find().snapshot().forEach(
function (elem) {
splitLength = elem.text.split(" ").length
db.collection.update(
{
_id: elem._id
},
{
$set: {
split: splitLength
}
}
);
}
);

Mongo find by regex: return only matching string

My application has the following stack:
Sinatra on Ruby -> MongoMapper -> MongoDB
The application puts several entries in the database. In order to crosslink to other pages, I've added some sort of syntax. e.g.:
Coffee is a black, caffeinated liquid made from beans. {Tea} is made from leaves. Both drinks are sometimes enjoyed with {milk}
In this example {Tea} will link to another DB entry about tea.
I'm trying to query my mongoDB about all 'linked terms'. Usually in ruby I would do something like this: /{([a-zA-Z0-9])+}/ where the () will return a matched string. In mongo however I get the whole record.
How can I get mongo to return me only the matched parts of the record I'm looking for. So for the example above it would return:
["Tea", "milk"]
I'm trying to avoid pulling the entire record into Ruby and processing them there

I don't know if I understand.
db.yourColl.aggregate([
{
$match:{"yourKey":{$regex:'[a-zA-Z0-9]', "$options" : "i"}}
},
{
$group:{
_id:null,
tot:{$push:"$yourKey"}
}
}])
If you don't want to have duplicate in totuse $addToSet

The way I solved this problem is using the string aggregation commands to extract the StartingIndexCP, ending indexCP and substrCP commands to extract the string I wanted. Since you could have multiple of these {} you need to have a projection to identify these CP indices in one shot and have another projection to extract the words you need. Hope this helps.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse