How to evaluate simple expressions - mongodb

When developing complex aggregations, I want the ability to test out simpler expressions as a sanity check. So I'm wondering if mongo shell has the ability to evaluate simple expressions.
For example, I want to do simple things like:
> { $hour: ISODate("2016-01-01T12:30:00Z") }
ISODate("2016-01-01T12:30:00Z")
In the example above it seems the shell isn't evaluating and returning the hour component as desired.
Is it possible to do what I want here?

If you're willing to use something other than Mongo Shell, NoSQLBooster will evaluate partial query operations. Just highlight the relevant section and click Run.
This is particularly useful for constructing pipelines with multiple stages. You can evaluate your pipeline one stage a time to verify which documents are passed to the next stage.

Not an ideal solution, but you can something like this:
var exprr = { "$hour": ISODate("2016-01-01T12:30:00Z") };
var tempCollection = "tempCollection";
db.getCollection(tempCollection).insert({});
db.getCollection(tempCollection).aggregate([
{"$project" : {"_id" : 0, "result" : exprr}}
]);
db.getCollection(tempCollection).drop();
Or you can wrap last part in a function and save it, for resue. The idea is we will make a temporary collection, insert a blank document on it and evaluate an expression the aggregation way. The downside is you can only evaluate those expressions which are supported in project aggregation operation.

Related

MongoDB: Bulk changing all field types in python

I have a ton of documents (around 10 million) and I need to change their field type. The usual forEach function (just looping through every value) seems to take forever and is clearly not viable in the timeframe I have (it basically took all night for one out of four updates)
I've heard that bulkwrites may be able to do it but I'm getting mixed messages. I saw a confusing answer on this site, for example, says that there's no written function to do it (you would have to do some workaround), others say that it can be done with updates in Python, using pymongo.
I was wondering if there was a quicker way to mass changes of field type (string->double, string -> int) using python? I can also work from the console but I find even less solutions there.
Thanks
You can try using aggregation query in the mongo shell
Something like
db.your_collection.aggregate([
{
$addFields: {
field1: {
$convert: {
input: "$field1",
to: "string"
}
}
}
},
{ $out: "your_collection" }
])
More info here https://docs.mongodb.com/manual/reference/operator/aggregation/convert/

MongoDB - find if a field value is contained in a given string

Is it possible to query documents where a specific field is contained in a given string?
for example if I have these documents:
{a: 'test blabla'},
{a: 'test not'}
I would like to find all documents that field a is fully included in the string "test blabla test", so only the first document would be returned.
I know I can do it with aggregation using $indexOfCP and it is also possible with $where and mapReduce. I was wandering if it's possible to do it in find query using the standard MongoDB operators (e.g., $gt, $in).
thanks.
I can think of 2 ways you could do this:
Option 1
Using $where:
db.someCol.find( { $where: function() { return "test blabla test".indexOf(this.a) > -1; } }
Explained: Find all documents whose value of field "a" is found WITHIN some given string.
This approach is universal, as you can run any code you like, but less recommended from a performance perspective. For instance, it cannot take advantage of indexes. Read full $where considerations here: https://docs.mongodb.com/manual/reference/operator/query/where/#considerations
Option 2
Using regex matching trickery, ONLY under certain circumstances; below is an example that only works with matching that the field value is found as a starting substring of the given string:
db.someCol.find( { a : /^(t(e(s(t( (b(l(a(b(l(a( (t(e(s(t)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?$/ } )
Explained: Break up the components of your "should-be-contained-within" string and match against all sub-possibilities of that with regex.
For your case, this option is pretty much insane, but it's worth noting as there may be specific cases (such as limited namespace matching), where you would not have to break up each letter, but some very finite set of predetermined parts. And in that case, this option could still make use of indexes, and not suffer the $where performance pentalties (as long as the complexity of the regex doesn't outweigh that benefit :)
You can use regex to search .
db.company.findOne({"companyname" : {$regex : ".*hello.*"}});
If you are using Mongo v3.6+, you can use $expr.
As you mentioned $indexOfCP can be used to get index, here it will be
{
"$expr": {
{$ne : [{$indexOfCP: ["test blabla test", "$a"]}, -1]}
}
}
The field name should be prefixed with a dollar sign ($), as $expr allows filters from aggregation pipeline.

call custom python function on every document in a collection Mongo DB

I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?
I noticed cursor.forEach but can't it be done just using python efficiently ?
A simple example would be to split the string in text and store the no. of words as a new attribute.
def split_count(text):
# some complex preprocessing...
return len(text.split())
# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)
But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.
I found a way to call any custom python function on a collection using parallel_scan in PyMongo.
def process_text(cursor):
for row in cursor.batch_size(200):
# Any complex preprocessing here...
split_text = row['text'].split()
db.collection.update_one({'_id': row['_id']},
{'$set': {'split_text': split_text,
'num_words': len(split_text) }},
upsert=True)
def preprocess(num_threads=4):
# Get up to max 'num_threads' cursors.
cursors = db.collection.parallel_scan(num_threads)
threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This is not really faster than cursor.forEach (but not that slow either), but it helps me execute any arbitrarily complex python code and save the results from within Python itself.
Also if I have an array of ints in one of the attributes, doing cursor.forEach converts them to floats which I don't want. So I preferred this way.
But I would be glad to know if there're any better ways than this :)
It is quite unlikely that it will ever be efficient to do this kind of thing in python. This is because the document would have to make a round trip and go through the python function on the client machine.
In your example code, you are passing the result of a function to a mongodb update query, which won't work. You can't run any python code inside mongodb queries on the db server.
As the answer to you linked question suggests, this type of action has to be performed in the mongo shell. e.g:
db.collection.find().snapshot().forEach(
function (elem) {
splitLength = elem.text.split(" ").length
db.collection.update(
{
_id: elem._id
},
{
$set: {
split: splitLength
}
}
);
}
);

Mongo find by regex: return only matching string

My application has the following stack:
Sinatra on Ruby -> MongoMapper -> MongoDB
The application puts several entries in the database. In order to crosslink to other pages, I've added some sort of syntax. e.g.:
Coffee is a black, caffeinated liquid made from beans. {Tea} is made from leaves. Both drinks are sometimes enjoyed with {milk}
In this example {Tea} will link to another DB entry about tea.
I'm trying to query my mongoDB about all 'linked terms'. Usually in ruby I would do something like this: /{([a-zA-Z0-9])+}/ where the () will return a matched string. In mongo however I get the whole record.
How can I get mongo to return me only the matched parts of the record I'm looking for. So for the example above it would return:
["Tea", "milk"]
I'm trying to avoid pulling the entire record into Ruby and processing them there
I don't know if I understand.
db.yourColl.aggregate([
{
$match:{"yourKey":{$regex:'[a-zA-Z0-9]', "$options" : "i"}}
},
{
$group:{
_id:null,
tot:{$push:"$yourKey"}
}
}])
If you don't want to have duplicate in totuse $addToSet
The way I solved this problem is using the string aggregation commands to extract the StartingIndexCP, ending indexCP and substrCP commands to extract the string I wanted. Since you could have multiple of these {} you need to have a projection to identify these CP indices in one shot and have another projection to extract the words you need. Hope this helps.

Mongodb - query with '$or' gives no results

There is extremely weird thing happening on my database. I have a query:
db.Tag.find({"word":"foo"})
this thing matches one object. it's nice.
Now, there's second query
db.Tag.find({$or: [{"word":"foo"}]})
and the second one does not give any results.
There's some kind of magic I obviously don't understand :( What is wrong in second query?
in theory, $or requires two or more parameters, so I can fake it with:
db.Tag.find({$or: [{"word":"foo"},{"word":"foo"}]})
but still, no results.
Your second query is perfectly fine, and it should work. Though the docs says, that $or performs logical operation on array of two or more expression, but it would work for single expression also.
Here's a sample that you can see, and try out, to get it to work: -
> db.col.insert({"foo": "Rohit"})
> db.col.insert({"foo": "Aman", "bar": "Rohit"})
>
> db.col.find({"foo": "Rohit"})
{ "_id" : ObjectId("50ed6bb1a401d9b4576417f7"), "foo" : "Rohit" }
> db.col.find({$or: [{"foo": "Rohit"}]})
{ "_id" : ObjectId("50ed6bb1a401d9b4576417f7"), "foo" : "Rohit" }
So, as you can see, both your query when used for my collection works fine. So, there is certainly something wrong somewhere else. Are you sure you have data in your collection?
Okaay, server admin installed mongodb from debian repo. Debian repo had 1.4.4 version of mongodb, aandd looks like $or is simply not yet supported out there :P