Performing regex queries with PyMongo - mongodb

I am trying to perform a regex query using PyMongo against a MongoDB server. The document structure is as follows
{
"files": [
"File 1",
"File 2",
"File 3",
"File 4"
],
"rootFolder": "/Location/Of/Files"
}
I want to get all the files that match the pattern *File. I tried doing this as such
db.collectionName.find({'files':'/^File/'})
Yet I get nothing back. Am I missing something, because according to the MongoDB docs this should be possible? If I perform the query in the Mongo console it works fine, does this mean the API doesn't support it or am I just using it incorrectly?

If you want to include regular expression options (such as ignore case), try this:
import re
regx = re.compile("^foo", re.IGNORECASE)
db.users.find_one({"files": regx})

Turns out regex searches are done a little differently in pymongo but is just as easy.
Regex is done as follows :
db.collectionname.find({'files':{'$regex':'^File'}})
This will match all documents that have a files property that has a item within that starts with File

To avoid the double compilation you can use the bson regex wrapper that comes with PyMongo:
>>> regx = bson.regex.Regex('^foo')
>>> db.users.find_one({"files": regx})
Regex just stores the string without trying to compile it, so find_one can then detect the argument as a 'Regex' type and form the appropriate Mongo query.
I feel this way is slightly more Pythonic than the other top answer, e.g.:
>>> db.collectionname.find({'files':{'$regex':'^File'}})
It's worth reading up on the bson Regex documentation if you plan to use regex queries because there are some caveats.

The solution of re doesn't use the index at all.
You should use commands like:
db.collectionname.find({'files':{'$regex':'^File'}})
( I cannot comment below their replies, so I reply here )

Related

MongoDB - find if a field value is contained in a given string

Is it possible to query documents where a specific field is contained in a given string?
for example if I have these documents:
{a: 'test blabla'},
{a: 'test not'}
I would like to find all documents that field a is fully included in the string "test blabla test", so only the first document would be returned.
I know I can do it with aggregation using $indexOfCP and it is also possible with $where and mapReduce. I was wandering if it's possible to do it in find query using the standard MongoDB operators (e.g., $gt, $in).
thanks.
I can think of 2 ways you could do this:
Option 1
Using $where:
db.someCol.find( { $where: function() { return "test blabla test".indexOf(this.a) > -1; } }
Explained: Find all documents whose value of field "a" is found WITHIN some given string.
This approach is universal, as you can run any code you like, but less recommended from a performance perspective. For instance, it cannot take advantage of indexes. Read full $where considerations here: https://docs.mongodb.com/manual/reference/operator/query/where/#considerations
Option 2
Using regex matching trickery, ONLY under certain circumstances; below is an example that only works with matching that the field value is found as a starting substring of the given string:
db.someCol.find( { a : /^(t(e(s(t( (b(l(a(b(l(a( (t(e(s(t)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?$/ } )
Explained: Break up the components of your "should-be-contained-within" string and match against all sub-possibilities of that with regex.
For your case, this option is pretty much insane, but it's worth noting as there may be specific cases (such as limited namespace matching), where you would not have to break up each letter, but some very finite set of predetermined parts. And in that case, this option could still make use of indexes, and not suffer the $where performance pentalties (as long as the complexity of the regex doesn't outweigh that benefit :)
You can use regex to search .
db.company.findOne({"companyname" : {$regex : ".*hello.*"}});
If you are using Mongo v3.6+, you can use $expr.
As you mentioned $indexOfCP can be used to get index, here it will be
{
"$expr": {
{$ne : [{$indexOfCP: ["test blabla test", "$a"]}, -1]}
}
}
The field name should be prefixed with a dollar sign ($), as $expr allows filters from aggregation pipeline.

Mongo query with regex fails when backslash\newline is there in a field

Hi I have a field in a user collection called "Address".User saving their address from a textarea in my application. mongodb convert it to new line like following.
{
"_id": ObjectId("56a9ba0ffbe0856d4f8b456d"),
"address": "HOUSE NO. 3157,\r\nSECTOR 50-D",
"pincode": "",
},
{
"_id": ObjectId("56a9ba0ffbe0856d4f8b456d"),
"address": "HOUSE NO. 3257,\r\nSECTOR 50-C",
"pincode": "",
}
So now When I am running a search query on the basis of "address".Like following:
guardianAdd = $dm->getRepository('EduStudentBundle:GuardianAddress')->findBy(array(
'address' => new \MongoRegex('/.*' .$data['address'] . '.*/i'),
'isDelete' => false
));
echo count($guardianAdd);die;
it does not give any result. My Searchi key word is : "HOUSE NO.3157 SECTOR 50-D".
However if I am searching using like: HOUSE NO. 3157 its giving correct result.
Please advice how to fix this.Thanks in advance
First of all, trailing .* are redundant. regexps /.*aaa.*/ and /aaa/ are identical and match the same pattern.
Second, you probably need to use multiline modifier /pattern/im
Finally, it is not quite clear what you want to fix. The best think you can do is to provide some basic explanation of regex syntax in the search form, so users can search properly, e.g. HOUSE NO.*3157.*SECTOR 50-D to get best results.
You can make some bold assumptions and build the pattern with something like
$pattern = implode('\W+',preg_split('/\W+/', $data['address']))
which will give you a regexp HOUSE\W+NO\W+3157\W+SECTOR\W+50\W+D for different kind of HOUSE NO.3157 SECTOR 50-D requests, but it will cut all the regex flexibility available with bare input, and eventually will result with unexpected response anyway. You can follow this slippery slope and end up with your own query DSL to compile to regex, but I doubt it can be any better or more convenient than pure regex. It will be more error prone for sure.
Asking right question to get right answers is true not only on SO, but also in your application. Unfortunately there is no general solution to search for something that people have in mind, but fail to ask. I believe that in your particular case best code is no code.

Mongo find by regex: return only matching string

My application has the following stack:
Sinatra on Ruby -> MongoMapper -> MongoDB
The application puts several entries in the database. In order to crosslink to other pages, I've added some sort of syntax. e.g.:
Coffee is a black, caffeinated liquid made from beans. {Tea} is made from leaves. Both drinks are sometimes enjoyed with {milk}
In this example {Tea} will link to another DB entry about tea.
I'm trying to query my mongoDB about all 'linked terms'. Usually in ruby I would do something like this: /{([a-zA-Z0-9])+}/ where the () will return a matched string. In mongo however I get the whole record.
How can I get mongo to return me only the matched parts of the record I'm looking for. So for the example above it would return:
["Tea", "milk"]
I'm trying to avoid pulling the entire record into Ruby and processing them there
I don't know if I understand.
db.yourColl.aggregate([
{
$match:{"yourKey":{$regex:'[a-zA-Z0-9]', "$options" : "i"}}
},
{
$group:{
_id:null,
tot:{$push:"$yourKey"}
}
}])
If you don't want to have duplicate in totuse $addToSet
The way I solved this problem is using the string aggregation commands to extract the StartingIndexCP, ending indexCP and substrCP commands to extract the string I wanted. Since you could have multiple of these {} you need to have a projection to identify these CP indices in one shot and have another projection to extract the words you need. Hope this helps.

MongoDB full text search with haskell driver

Is there a possibility to use the full text search of mongoDB with the haskell driver?
I found the 'runCommand' in the haskell API, but it expects a Document as parameter. That's fine for all the other commands that mongodb can run, but the syntax for a text command is:
db.collection.runCommand( "text", {search : "something"})
So I don't know how I'll get the "text" as first parameter in front of the Document.
Thanks
The text-command can be written in another structure:
{ text: your_collection
, search: your_text
, filter: your_filter
, limit: your_limit
, project: your_projection
}
I had my suspicion, since all "runCommand"-action have the same structure. So I tried to apply that structure to the text-command - but without success. Then I remembered that aggregate has another structure too and tried that, but that didn't work either. Finally, I found the answer in a google group entry of the Java driver.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)