MongoDB text index search by multiple words is too slow - mongodb

Problem Description
MongoDB version: 3.4.4
Documents in the MongoDB collection were created from the XML files (not GridFS) and look like this one:
{
...
"СвНаимЮЛ" : {
"#attributes" : {
"НаимЮЛПолн" : "ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ \"КОНСАЛТИНГОВАЯ КОМПАНИЯ \"ГОТЛИБ ЛИМИТИД\"",
...
},
...
}
...
}
Language is Russian. Collection has about 10,000,000 documents and a text index on the field "СвНаимЮЛ.#attributes.НаимЮЛПолн".
Search by one word is very fast:
db.records.find({
$text: {
$search: "ГОТЛИБ"
}
})
But search by several words with logical AND is so slow that I can't even wait until it ends to get explain('executionStats') results.
E.g. next query is very slow. Find all documents which contain words "ГОТЛИБ" AND "ЛИМИТИД":
db.records.find({
$text: {
$search: "\"ГОТЛИБ\" \"ЛИМИТИД\""
}
})
Search by phrase is also slow. E.g find all documents which contain phrase "ГОТЛИБ ЛИМИТИД":
db.records.find({
$text: {
$search: "\"ГОТЛИБ ЛИМИТИД\""
}
})
getIndexes() output:
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "egrul.records"
},
...
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "СвНаимЮЛ.#attributes.НаимЮЛПолн_text",
"ns" : "egrul.records",
"default_language" : "russian",
"weights" : {
"СвНаимЮЛ.#attributes.НаимЮЛПолн" : 1
},
"language_override" : "language",
"textIndexVersion" : 3
}
]
Question
Can I somehow increase search-by-several-words (with logical AND) or search-by-phrase speed?
Edited
Just found that search by multiple words with logical OR is also slow:
db.records.find({
$text: {
$search: "ГОТЛИБ ЛИМИТИД"
}
})

Looks like the problem is not with slow search-by-multiple-words, but with slow search if search term appears in many documents.
E. g. the word "МИЦУБИСИ" appears only in 24 (from 10,000,000) documents so the query
db.records.find({
$text: {
$search: "МИЦУБИСИ"
}
}).count()
is very fast.
But the word "СЕРВИС" appears in 160,000 documents and the query
db.records.find({
$text: {
$search: "СЕРВИС"
}
}).count()
is very slow (takes about 40 minutes).
Query
db.records.find({
$text: {
$search: "\"МИЦУБИСИ\" \"СЕРВИС\""
}
}).count()
is also slow because (I suppose) MongoDB looks for terms "МИЦУБИСИ" (fast) and "СЕРВИС" (slow) and then make intersection or something.
Now I want to find a way to limit the number of results something like find 10 documents and stop because limit() doesn't work with text queries. .
Or maybe upgrade my server hardware.
Or look at the Elasticsearch.

Related

Is searching by _id in mongoDB more efficient?

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

MongoDB 3.4.5: Hyphen-Minus does not work with $text searches

I use MongoDB, Version 3.4.5 and I tried to exclude a term with -(minus).
For any reason it does not work.
These are my tries:
db.Product.find()
{ "_id" : ObjectId("59cbfcd01889a9fd89a3565c"), "name" : "Produkt Neu", ...
{ "_id" : ObjectId("59cc7d941889a4f4c2f43b14"), "name" : "Produkt2", ...
db.Product.find( { $text: { $search: 'Produkt -Neu' } } );
db.Product.find( { $text: { $search: "Produkt -Neu" } } );
db.Product.find( { $text: { $search: "Produkt2" } } );
{ "_id" : ObjectId("59cc7d941889a4f4c2f43b14"), "name" : "Produkt2", ...
db.Product.dropIndexes()
db.Product.createIndex({ name: "text" })
{
"nIndexesWas" : 2,
"msg" : "non-_id indexes dropped for collection",
"ok" : 1
}
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
db.Product.find( { $text: { $search: "Produkt -Neu" } } );
db.Product.find( { $text: { $search: "Produkt Neu" } } );
{ "_id" : ObjectId("59cbfcd01889a9fd89a3565c"), "name" : "Produkt Neu", ...
Does anyone know what I have to do in order to get it work with -(minus).
I created a collection: Product with the following documents ...
{
"_id" : ObjectId("59d0ada3c26584cd8b79fc51"),
"name" : "Produkt Neu"
}
{
"_id" : ObjectId("59d0adafc26584cd8b79fc54"),
"name" : "Produkt2"
}
... and I declared a text index on this collection as follows:
db.Product.createIndex({ name: "text" })
I ran the following queries which faithfully reproduce the situation described in your question:
// returns one document since there is one document
// which has the text indexed value: "Produkt Neu"
db.Product.find( { $text: { $search: "Produkt Neu" } } );
// returns no documents since there is no document
// which has the text indexed value: "Produkt2"
db.Product.find( { $text: { $search: "Produkt -Neu" } } )
You are, I think, expecting this query ...
db.Product.find( { $text: { $search: "Produkt -Neu" } } )
... to return the second document on the grounds that excluding Neu should allow a match on the document having name=Produkt2 but this is not how MongoDB $text searches work. MongoDB $text searches do not support partial matching so the search term Produkt -Neu (which evaluates as Produkt) will not match Produkt2. To verify this, I ran the following query:
db.Product.find( { $text: { $search: "Produkt2 -Neu" } } )
This query returns the second document (i.e. the one with name=Produkt2) which proves that the hyphen-minus (-) successfully negated the term: Neu.
On a side note; MongoDB text indexes do support language stemming, to verify this behaviour I added the following document...
{
"_id" : ObjectId("59d0b2b4c26584cd8b79fd7c"),
"name" : "Produkts"
}
...and then ran this query ...
db.Product.find( { $text: { $search: "Produkt -Neu" } } );
This query returns the document with name=Produkts because Product is a stem of Produkts.
In summary, a $text search will find matches where each search term has either (a) a match on a whole world in the text index or (b) is a recognised stem of a whole word in the text index. Note: there are also phrase matches but those are not relevant to the examples in your question. Use of the hyphen-minus serves to change the search terms but it does not change how the search term is evaluated.
More details in the docs and there is an open issue with MongoDB relating to supporting partial matching on text indexes.
If you really need to support partial matching then you'll probably want to discard the text index and use the $regex operator instead. Though it's worth noting that index coverage with the $regex operator is probably not what you expect, the brief summary is this: if your search value is anchored (i.e. Produk, rather than rodukt) then MongoDB can use an index but otherwise it cannot.

Mongo DB, document count mismatch for a collection

I have a collection User in mongo. When I do a count on this collection I got 13204951 documents
> db.User.count()
13204951
But when I tried to find the count of non-stale documents like this I got a count of 13208778
> db.User.find({"_id": {$exists: true, $ne: null}}).count()
13208778
> db.User.find({"UserId": {$exists: true, $ne: null}}).count()
13208778
I even tried to get the count of this collection using MongoEngine
user_list = set(User.objects().values_list('UserId'))
len(resume_list)
13208778
Here are the indexes of this User collection
>db.User.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "user_db.User"
},
{
"v" : 1,
"unique" : true,
"key" : {
"UserId" : 1
},
"name" : "UserId_1",
"ns" : "user_db.User",
"sparse" : false,
"background" : true
}
]
Any pointers on how to debug the mismatch in counts from different queries.
refer to this document
On a sharded cluster, db.collection.count() can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress.
Also, refer to this question
If you are not using sharding cluster, you can refer to this question
The basic idea is db.{collection}.count() might do some tricks to make it fast to return a count, and it might be not accurate, use a count() with query should be accurate.

MongoDB text search does not return anything

One of my collection no longer returns anything on some search values. Here is a console dump to illustrate the probleme :
meteor:PRIMARY> db['test'].insert({ sku: 'Barrière' });
WriteResult({ "nInserted" : 1 })
meteor:PRIMARY> db['test'].insert({ sku: 'Bannière' });
WriteResult({ "nInserted" : 1 })
meteor:PRIMARY> db['test'].createIndex({ sku: 'text' });
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
meteor:PRIMARY> db['test'].find({ sku: /ba/i });
{ "_id" : ObjectId("57bbb447fc77800b1e63ba64"), "sku" : "Barrière" }
{ "_id" : ObjectId("57bbb455fc77800b1e63ba65"), "sku" : "Bannière" }
meteor:PRIMARY> db['test'].find({ $text: { $search: 'ba' } });
meteor:PRIMARY> db['test'].find({ $text: { $search: 'Ba' } });
meteor:PRIMARY>
The search returned nothing, even though I clearly added two documents that should match. What's going on? What option/config am I missing?
** Edit **
I tried this query
meteor:PRIMARY> db['test'].find({ $or: [ { $text: { $search: 'ba' } }, { sku: { $regex: 'ba', $options: 'i' } } ] });
Error: error: {
"waitedMS" : NumberLong(0),
"ok" : 0,
"errmsg" : "error processing query: ns=meteor.testTree: $or\n sku regex /ba/\n TEXT : query=ba, language=english, caseSensitive=0, diacriticSensitive=0, tag=NULL\nSort:
{}\nProj: {}\n planner returned error: Failed to produce a solution for TEXT under OR - other non-TEXT clauses under OR have to be indexed as well.",
"code" : 2
}
But I'm not sure how I can make an index to search partial values (i.e. using $regex or other operator). Using a third party indexer seems overkill to me... Surely there is a way to perform a full-text search, as well as a pattern match at once?
Is my only solution to perform two queries and merge the results manually?
Try this:
db['test'].insert({ sku: 'ba ba' });
db['test'].find({ $text: { $search: 'ba' } });
Also refer to mongodb document:
If the search string is a space-delimited string, $text operator performs a logical OR search on each term and returns documents that contains any of the terms.
I think mongodb $text $search just split the string by space and match the whole word. If you need to search part of the word, you may need to use some other framework for help. Maybe you can also use $regex to do this.
If the only requirement is to query the word by prefix, you can use $regex, it can use index if you are only querying by the prefix. Otherwise if will scan the whole collection.

mongoDB prefix wildcard: fulltext-search ($text) find part with search-string

I have mongodb with a $text-Index and elements like this:
{
foo: "my super cool item"
}
{
foo: "your not so cool item"
}
If i do search with
mycoll.find({ $text: { $search: "super"} })
i get the first item (correct).
But i also want to search with "uper" to get the fist item - but if i try:
mycoll.find({ $text: { $search: "uper"} })
I dont get any results.
My Question:
If there is a way to use $text so its finds results with a part of the searching string? (e.g. like '%uper%' in mysql)
Attention: I dont ask for a regex only search - i ask for a regex-search within a $text-search!
It's not possible to do it with $text operator.
Text indexes are created with the terms included in the string value or in an array of strings and the search is based in those indices.
You can only group terms on a phrase but not take part of them.
Read $text operator reference and text indexes description.
The best solution is to use both a text index and a regex.
The index will provide excellent speed performances but won't match as many documents as a regex.
The regex will allow a fallback in case the index doesn't return enough results.
db.mycoll.createIndex({ foo: 'text' });
db.mycoll.createIndex({ foo: 1 });
db.mycoll.find({
$or: [
{ $text: { $search: 'uper' } },
{ foo: { $regex: 'uper' } }
]
});
For even better performances (but slightly different results), use ^ inside the regex:
db.mycoll.find({
$or: [
{ $text: { $search: 'uper' } },
{ foo: { $regex: '^uper' } }
]
});
What you are trying to do in your second example is prefix wildcard search in your collection mycoll on field foo. This is not something the textsearch feature is designed for and it is not possible to do it with $text operator. This behaviour does not include wildcard prefix search on any given token in the indexed field. However you can alternatively perform regex search as others suggested. Here is my walkthrough:
>db.mycoll.find()
{ "_id" : ObjectId("53add9364dfbffa0471c6e8e"), "foo" : "my super cool item" }
{ "_id" : ObjectId("53add9674dfbffa0471c6e8f"), "foo" : "your not so cool item" }
> db.mycoll.find({ $text: { $search: "super"} })
{ "_id" : ObjectId("53add9364dfbffa0471c6e8e"), "foo" : "my super cool item" }
> db.mycoll.count({ $text: { $search: "uper"} })
0
The $text operator supports search for a single word, search for one or more words or search for phrase. The kind of search you wish is not supported
The regex solution:
> db.mycoll.find({foo:/uper/})
{ "_id" : ObjectId("53add9364dfbffa0471c6e8e"), "foo" : "my super cool item" }
>
The answer to your final question: to do mysql style %super% in mongoDB you would most likely have to do:
db.mycoll.find( { foo : /.*super.*/ } );
It should work with /uper/.
See http://docs.mongodb.org/manual/reference/operator/query/regex/ for details.
Edit:
As per request in the comments:
The solution wasn't necessarily meant to actually give what the OP requested, but what he needed to solve the problem.
Since $regex searches don't work with text indices, a simple regex search over an indexed field should give the expected result, though not using the requested means.
Actually, it is pretty easy to do this:
db.collection.insert( {foo: "my super cool item"} )
db.collection.insert( {foo: "your not so cool item"})
db.collection.ensureIndex({ foo: 1 })
db.collection.find({'foo': /uper/})
gives us the expected result:
{ "_id" : ObjectId("557f3ba4c1664dadf9fcfe47"), "foo" : "my super cool item" }
An added explain shows us that the index was used efficiently:
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"foo" : /uper/
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"foo" : /uper/
},
"keyPattern" : {
"foo" : 1
},
"indexName" : "foo_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"foo" : [
"[\"\", {})",
"[/uper/, /uper/]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
// skipped
},
"ok" : 1
}
To make a long story short: No, you can not reuse a $text index, but you can do the query efficiently. Like written in Implement auto-complete feature using MongoDB search , one could probably be even more efficient by using a map/reduce approach, eliminating redundancy and unnecessary stop words from the indices, at the cost of being not real time any more.
As francadaval said, text index is searching by terms but if you combine regex and text-index you should be good.
mycoll.find({$or: [
{
$text: {
$search: "super"
}
},
{
'column-name': {
$regex: 'uper',
$options: 'i'
}
]})
Also, make sure that you have normal index applied to the column other than text index.
if you go with regex you can achieve search for "super cool" but not "super item", to achieve both request do an or request with $text and $regex for the search term.
make sure you index both text indexing and normal indexing to work.
You could have achieved is as-
db.mycoll.find( {foo: { $regex : /uper/i } })
Here 'i' is an option, denotes case-insensitive search