I went through many links like this: How to create full text search query in mongodb with spring-data?, but did not get the correct approach.
I've an Employee collection which holds 1000 documents. I want to give capability to perform search ignorecase where when I search for ra, I should get Ravi,Ram, rasika etc names.
I used below logic which works fine, but I wanted to understand from the perspective of performance. Is there any better solution than this?
Query query = new Query(Criteria.where("employeeName").regex("^"+employeeName, "i"));
You can create an index on the field you are applying the query filter using the regular expression. For example, consider the documents in a person collection:
{ "name" : "ravi" }
{ "name" : "ram" }
{ "name" : "John" }
{ "name" : "renu" }
{ "name" : "Raj" }
{ "name" : "peter" }
The following query (run from Mongo Shell) finds and fetches the four documents with the names starting with the letter "r" or "R":
db.person.find( { name: { $regex: "^r", $options: "i" } } )
But, the query performs a collection scan, without an index on the name field. So, create an index on the field.
db.person.createIndex( { name: 1 } )
Now, run the query and generate a query plan for the same query (using the explain()). The query plan shown that it is an IXSCAN (an indexed scan). And, this will be an efficiently performing query.
Note that prefix searches (as in the above query using the ^) on index fields results in faster performing queries.
From the documentation:
For case sensitive regular expression queries, if an index exists for
the field, then MongoDB matches the regular expression against the
values in the index, which can be faster than a collection scan.
Further optimization can occur if the regular expression is a “prefix
expression”, which means that all potential matches start with the
same string. This allows MongoDB to construct a “range” from that
prefix and only match against those values from the index that fall
within that range.
Though the documentation says the following (see below paragraph), the query I ran did use the index and the query plan generated using the explain() showed an index scan.
Case insensitive regular expression queries generally cannot use
indexes effectively. The $regex implementation is not collation-aware
and is unable to utilize case-insensitive indexes.
Related
what is wrong with this query ?
db.collection.find( { "name" : "/^test$/i", "group" : "/^Default$/i"} )
I am trying to find an object with name=test, group=default, but not case sensitive.
but I am not getting the result although I know I have this document in the database:
I used exactly as in mongo website it's explained:
In MongoDB, you can also use regular expression objects (i.e. /pattern/) to specify regular expressions:
{ <field>: /pattern/<options> }
The query in its essence is right, you just have a minor syntax error.
In javascript (Which Mongo shell is based on) a regex is of the form of /xxx/ and not "/xxx/", the ladder being a string expression.
So just change your query into this:
db.collection.find( { "name" : /^test$/i, "group" : /^Default$/i} )
I'm new with mongo
Entity:
{
"sender": {
"id": <unique key inside type>,
"type": <enum value>,
},
"recipient": {
"id": <unique key inside type>,
"type": <enum value>,
},
...
}
I need to create effective seach by query "find entities where sender or recipient equal to user from collection" with paging
foreach member in memberIHaveAccessTo:
condition ||= member == recipient || member == sender
I have read some about mongo indexes. Probably my problem can be solve by storing addional field "members" which will be array contains sender and recipient and then create index on this array
Is it possible to build such an index with monga?
Is mongo good choise to create indexes like?
Some thoughts about the issues raised in the question about querying and the application of indexes on the queried fields.
(i) The $or and two indexes:
I need to create effective search by query "find entities where sender
or recipient equal to user from collection...
Your query is going to be like this:
db.test.find( { $or: [ { "sender.id": "someid" }, { "recipient.id": "someid" } ] } )
With indexes defined on "sender.id" and "recipient.id", two individual indexes, the query with the $or operator will use both the indexes.
From the docs ($or Clauses and Indexes):
When evaluating the clauses in the $or expression, MongoDB either
performs a collection scan or, if all the clauses are supported by
indexes, MongoDB performs index scans.
Running the query with an explain() and examining the query plan shows that indexes are used for both the conditions.
(ii) Index on members array:
Probably my problem can be solve by storing addtional field "members"
which will be array contains sender and recipient and then create
index on this array...
With the members array field, the query will be like this:
db.test.find( { members_array: "someid" } )
When an index is defined on members_array field, the query will use the index; the generated query plan shows the index usage. Note that an index defined on an array field is referred as Multikey Index.
I am using mongodb 2.6, and I have the following query:
db.getCollection('Jobs').find(
{ $and: [ { RunID: { $regex: ".*_0" } },
{ $or: [ { JobType: "TypeX" },
{ JobType: "TypeY" },
{ JobType: "TypeZ" },
{ $and: [ { Info: { $regex: "Weekly.*" } }, { JobType: "YetAnotherType" } ] } ] } ] })
I have three different indexes: RunID, RunID + JobType, RunID + JobType + Info. Mongo is always using the index containing RunID only, although the other indexes seem more likely to produce faster results, it is even sometimes using an index consisting of RunID + StartTime while StartTime is not even in the list of used fields, any idea why is it choosing that index?
Note1:
You can drop your first 2 indexes, RunID and RunID + JobType. It is enough to use just the expanded compound index RunID + JobType + Info; this can be also used to query on RunID or RunID + JobType fields, info here:
In addition to supporting queries that match on all the index fields,
compound indexes can support queries that match on the prefix of the
index fields.
When you drop those indexes, mongo will choose the only remained index.
Note2:
You can always use hint, to tell mongo to use a specific index:
db.getCollection('Jobs').find().hint({RunID:1, JobType:1, Info:1})
Thanks to Sergiu's answer and Sammaye's comment, I think I found what I am looking for:
I got rid of RunID index, since RunID is a prefix in many other indexes, mongodb will use it if it needs only RunID.
Concerning $or, we have the following in the documentation:
When evaluating the clauses in the $or expression, MongoDB either
performs a collection scan or, if all the clauses are supported by
indexes, MongoDB performs index scans. That is, for MongoDB to use
indexes to evaluate an $or expression, all the clauses in the $or
expression must be supported by indexes. Otherwise, MongoDB will
perform a collection scan.
As I mentioned earlier, RunID is already indexed, so we need a new index for the other fields in the query: JobType and Info, since JobType needs to be the index's prefix so that it can be used in queries not containing Info field, so the second index I created is
{ "JobType": 1.0, "Info": 1.0}
As a result, mongodb will use a complex plan in which different indexes will be used.
I've read the MongoDB documentation on getting the indexes within a collection, and have also searched SO and Google for my question. I want to get the actual indexed values.
Or maybe my understanding of how MongoDB indexes is incorrect. If I've been indexing a field called text that contains paragraphs, am I right in thinking that what gets indexed is each word in the paragraph?
Either case I want to retrieve the values that were indexed, which db.collection.getIndexes() doesn't seem to be returning.
Well yes and no, in summary.
Indexes work on the "values" of the fields they are supplied to index, and are much like a "card index" in that there is a point of reference to look at to find the location of something that matches that term.
What "you" seem to be asking about here is "text indexes". This is a special index format in MongoDB and other databases as well that looks at the "text" content of a field and breaks down every "word" in that content into a value in that "index".
Typically we do:
db.collection.createIndex({ "text": "text" })
Where the "field name" here is "text" as you asked, but more importantly the type of index here is "text".
This allows you to then insert data like this:
db.collection.insert({ "text": "The quick brown fox jumped over the lazy dog" })
And then search like this, using the $text operator:
db.collection.find({ "$text": { "$search": "brown fox" } })
Which will return and "rank" in order the terms you gave in your query depending how they matched the given "text" of your field in the index on your collection.
Note that a "text" index and it's query does not interact on a specific field. But the index itself can be made over multiple fields. The query and the constraints on the "index" itself are that there can "only be one" text index present on any given collection otherwise errors will occur.
As per mongodb's docs:
"db.collection.getIndexes() returns an array of documents that hold index information for the collection. Index information includes the keys and options used to create the index. For information on the keys and index options, see db.collection.createIndex()."
You first have to create the index on the collection, using the createIndex() method:
db.records.createIndex( { userid: 1 } )
Queries on the userid field are supported by the index:
Example:
db.records.find( { userid: 2 } )
db.records.find( { userid: { $gt: 10 } } )
Indexes help you avoid scanning the whole document. They basically are references or pointers to specific parts of your collection.
The docs explain it better:
http://docs.mongodb.org/manual/tutorial/create-an-index/
I'm considering bundling time-sequence data together in session documents. Inside each session, there would be an array of events. Each event would have a timestamp. I know that I can create a multikey index on the timestamp of those events, but I'm curious what mechanism MongoDB uses to prevent the same document from showing up twice in one query.
To clarify, imagine a collection of sessions with the following documents:
{
_id: 'A',
events: [
{time: '10:00'},
{time: '15:00'}
]
}
{
_id: 'B',
events: [
{time: '12:00'}
]
}
If I add a multikey index with db.sessions.ensureIndex({'events.time' : 1}), I would expect the b-tree of that index to look like this:
'10:00' => 'A'
'12:00' => 'B'
'15:00' => 'A'
If I query the collection with {'events.time': {$gte: '10:00'}}, MongoDB scans the b-tree and returns:
{ "_id" : "A", "events" : [ { "time" : "10:00" }, { "time" : "15:00" } ] }
{ "_id" : "B", "events" : [ { "time" : "12:00" } ] }
How does Mongo prevent document A from showing up a second time as the third result in the cursor? For small index scans, it could just keep track of which documents had already been seen, but what happens if the index is enormous? Is there ever a case where the same document would show up more than once in a singe cursor?
My assumption is that it would not. Mongo could look at the document it is scanning and detect that it already would have matched earlier in the scan by inspecting earlier entries in the indexed array. However, I cannot find any mention of this behavior in the MongoDB documentation, and it is important to actually know what to expect.
(NOTE: I do know that it is possible for a document to show up in a single query more than once if the document is modified while the cursor is being scanned. That shouldn't pose a problem for queries on time-sequence data where timestamps are never edited. Even if a new event is added to a session during a scan, if Mongo uses something like the detection mechanism I mentioned above, it should be able to omit the moved document from query results.)
I cannot find any mention of this behavior in the MongoDB
documentation, and it is important to actually know what to expect.
Internals of implementation are seldom mentioned in the documentation, and after all, what you describe is the expected behavior.
There is code to deduplicate a result set and there are tests to make sure that it's working correctly. After all, a multi-key index isn't the primary use case for such functionality - if you have an $or clause in your query, the results must be de-duplicated as well.