mongo search sub string - mongodb

Hello all I am working on some application using MongoDB and I have to find sub string in the collection.
I have a collection Query, as shown below.
> db.Query.find();
>{ "_id" : ObjectId("54c9ec8ead38d420d87743b0"), "QueryID" : 1, "QueryString" : "
List my games", "QueryFrequency" : 9, "QueryResultset" : 3 }
>...
Now I want to search sub string in the QueryString.
e.g. here "games" in "QueryString" : "List my games"
For this I enabled indexing on QueryString and after running the following command I am getting some results also.
> db.Query.runCommand("text" , {search : "games"});
"text" is name given to my index.
Now the problem is that I get result only when the word that I am searching has length greater than 3 (i.e. has more than 3 characters)
For my example I get results when I search with word "List" or "games",
But when I use "my" or any other word having less than 4 character gives no result.
Is there any way to solve this or am I missing some settings.

You can use regex for this. Your query would be like
db.collection.find({"query_string":{$regex: your_regex }})

For case in-sensitive search you could use this
db.users.find ({ "name" : /my/i } )
Where the i stands for insensitive

Related

Confusion with the order of logical operators in Mongo

I am new to Mongo DB and while doing some practising, I came across a weird problem. The schema being:
{
"_id" : ObjectId("5c8eccc1caa187d17ca6ed29"),
"city" : "CLEVELAND",
"zip" : "35049",
"loc" : {
"y" : 33.992106,
"x" : 86.559355
},
"pop" : 2369,
"state" : "AL"
} ...
I want to find the number of cities, that have a population of more than 5000 but less than 1000000.
Both these queries, this:
db.zips.find({"$nor":[{"pop":{"$lt":5000}},{"pop":{"$gt":"1000000"}}]}).count()
and this:
db.zips.find({"$nor":[{"pop":{"$gt":1000000}},{"pop":{"$lt":"5000"}}]}).count()
give different results.
The first one gives 11193 and the second one gives 29470. Since I am from MySql background, both the queries are making no difference to me. According to me, both are the same and should return the number of zip codes with a population of less than 1000000 and more than 5000. Please help me understand.
Thanks in advance.
$gte and $lte should be used to compare same data type.
your first query quoted "100000" and your second query quoted "5000", the two queries ended up as not the same, since you are comparing Numeric data type in one, and string in another.

Mongo text search does not seem to use the initial condition to reduce the number of documents to search

I have a collection with text search enabled and and index of the FName column. If I perform a search like this:
db.items.find({ "FName" : /^customer\\rainfall\\geometry\ types\\customer\ 1\ manhole\ monitors\\region\ 10\\area\ 100\\catchment\ 1000\\/ }).count()
Then this is very fast and returns a value of 300 as there are 300 matching documents.
If, however, I run this query:
db.items.find([{ "FName" : /^customer\\rainfall\\geometry\ types\\customer\ 1\ manhole\ monitors\\region\ 10\\area\ 100\\catchment\ 1000\\/ }, "$text" : { "$search" : "\"\\average rainfall\"" }]).count(), it can take up to 20 seconds to complete. If I run the initial query, get all the documents and carry out the search client side it is much faster - a second or two.
How can this be faster given that the second query is actually being run on the database? It almost looks like the FName filter is not being used first but I cannot prove this.
Any ideas?
Thanks
Ian

Pymongo find document whose field is a substring of a given string

Let's say we have a collection with the following documents:
{_id : 1, str : 'hello'}
{_id : 2, str : 'hello world'}
{_id : 3, str : 'world'}
And I would like to find documents whose str field is a substring of hello world!. Is there a way to do this in pymongo?
I know the opposite - getting documents whose field contains a string can be done using $regex, but what I want is getting documents whose field is contained by a string.
You can use text indexes for this, which support text search queries on string content. Text indexes can include any field whose value is a string or an array of string elements.
Here's a minimal example using pymongo:
# Get database connection
conn = pymongo.MongoClient('mongodb://localhost:27017/')
coll = conn.get_database('test').get_collection('test')
# Create text index
coll.create_index([('str',pymongo.TEXT)])
# Text search
print list(coll.find({'$text': {'$search': 'hello world'}}))
With your example documents, this will result in:
[{u'_id': 3.0, u'str': u'world'},
{u'_id': 2.0, u'str': u'hello world'},
{u'_id': 1.0, u'str': u'hello'}]
For more information, please see:
Text Indexes
$text operator

Continuing a Query (paginating) on a compound index

I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.

How to improve query performance with operators like $nin, $in for Mongodb

I have a reasonably large dataset of over 3 million documents that have tags similar to StackOverflow that uses tags for each question. The schema that I use for storing the tags is as follows:
{"id": 12345, "tags":["tag1", "tag2", "tag3"]}, {"id": 12346, "tags":["tag2", "tag3"]}
I have a multi-key index created on tags field. When I am performing queries using $in or $nin operators to find the intersection, union of the tags, the performance is around 7 seconds on a server class machine. Is there anything that I can do to improve the speed of query search?
EDIT 1:
Here is the explain plan as requested. What I observed is that the queries returned much faster after I restarted my server and just ran just the mongodb server. The queries performed much faster(< 50ms). I suspect the indexes were not cached in memory, although I had ample unused ram available and my index (800MB) could easily fit in memory.
db.tagsCollection.find( { "tags" : { $in : ['tag1', 'tag2'], $nin : ['tag4', '
tag5', 'tag6', 'tag7'] } } ).explain();
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 6145193,
"nscannedObjects" : 6145192,
"n" : 969386,
"millis" : 19640,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"tags" : [
[
"tag1",
"tag1"
],
[
"tag2",
"tag2"
]
]
}
}
Note
This is what I thought of as an optimization ( though you might need to test it )
Instead of storing tags,store a small key which identifies all the tags particular document has.
say for post#125 the tags are : PHP, MongoDb , database .
a) clean the tags like convert all of them to small case etc
and then sort them alphabetically .
current tags will be : database,mongodb,php
b) Have a seperate collection which stores integer to tag mapping :
{ "_id" : 1 , "t" : "mongodb" }
{ "_id" : 2 , "t" : "php" } and so on store all the possible tags for your website
c) to store a document, create the tag key using tags to number map from previous collection.
so curent database,mongodb,php will become something like 1-12-2
d) store your document like :
{ "id" : 12345 , "tags" : [1,12,3] }
QUERYING :
The use of integers instead of strings on an indexed field would reduce the index size by great extent, and also make querying faster as compared to a string index.
Not sure about amount of performance gain, but still worth a try to compare to your current implementation.
Check the size of your multi-key tags index using db.col.stats(). If it doesn't fit in RAM then you might be disk-bound and incurring some disk IO cost. If the index fits entirely in memory then I'm not sure what else you can do, apart from throw more hardware at it, unless you can optimise the queries themselves.
Do you need to search through all the data, or can you query a subset that's filtered by another indexed field? Or can you eliminate the $nin queries, which will tend to be slower because the have to iterate every tag, where as $in only has to iterate until it finds a match.
If you want performance to be super fast and dont have space contraints, I would suggest to have separate collection of tags with video id array and have an index on tag name.
Here is another suggestion but I've had not a chance to test it.
{
tags:{
items:[ 'a', 'b', 'c' ],
mixed:{
a:1, // hash value for a tag
b:2, // hash value for b tag
c:3 // hash value for c tag
}
}
}
and search query is
db.demo.find({ 'tags.mixed.a':1, 'tags.mixed.b':2 })
if possible have to create compound index for tags.mixed