Mongodb $in implementation and complexity

Mongodb $in implementation and complexity - mongodb

Here is our document schema
{
name: String
}
Here is our query
{
name: {$in: ["Jack", "Tom"]}
}
I believe even if there isn't an index on name, the query engine will turn the array in the $in into a hashset and then check for presence as it scans through each record with a COLSCAN which is O(n). It will never do a naive O(m*n) search, right?
I'm trying to find supporting documentation online but I've come up short. I've tried searching in the source code but I can't seem to find the exact section responsible for this either.
If the index exists I believe that it will use it directly instead and be faster. If I'm not wrong I think it will be O(m*log(n)) as it gets the result set in log(n) time from the b-tree for every element in the $in array and returns the union of them all. Though big Oh wise for large m it seems slower than the O(n) hashset approach, its faster in practice as the disk reads are much more expensive.
Is this line of thinking correct when there is an index?
And if there isn't an index does it do the COLSCAN with a naive search or will it use a hashset to fasten the process?

When setting up the query, the $in expression sorts the non-regex elements in the setEqualities function:
if (!std::is_sorted(_originalEqualityVector.begin(),
_originalEqualityVector.end(),
_eltCmp.makeLessThan())) {
std::sort(
_originalEqualityVector.begin(), _originalEqualityVector.end(), _eltCmp.makeLessThan());
}
It then tests the element from each document using the contains function, which uses a binary search:
bool InMatchExpression::contains(const BSONElement& e) const {
return std::binary_search(_equalitySet.begin(), _equalitySet.end(), e, _eltCmp.makeLessThan());
}

Related

Mongodb searching on array / indexing

I'm using the airbnb sample set and it has a field that looks like:
"amenities": ["TV", "Cable TV", "Wifi"....
So I'm trying to do a case-INsensitive, wildcard search (on one or more values passed in).
Only thing I've found that works is:
{ amenities: { $in: [ /wi/ ] }}
Is that the best way?
So I ran it in Compass as the dataset was imported (5600 docs), and the Explain says it took ~20ms on my machine and warned there was no index. I then created an index on the amenities column and the same search jumped up to ~100ms. I just created the index through the Compass UI, so not sure why its taking 5x as long with an index? Or if there is a better way to do this?

The way to run that query is:
{ amenities: /wi/i }
//better but not always useful
{ amenities: /wi/i }, { amenities:1, _id:0 }
It already traverses the array, and to be case insensitive it must be on the options.
For multikey indexes the second query won't be a covered query. Otherwise, it would be blazing fast.
I've tested a similar search with and without index though. Exec. time is reduced 10X. (1500ms to 150ms, in a huge collection). Measure with Mongo Hacker.
As you report executionTimeMilliseconds is not that different. But still smaller.
The reason why you don't see a huge decrease in time is because the index stores each array entry separately. When it finds a match, it comes back to collection to fetch the whole array field, instead of using the indexes.
Probably indexes aren't very useful for arrays.

When querying with an unanchored regex, the query executor will have to scan every index key to see if there is a match.
You might find a collated index to be helpful.
Create an index with the appropriate collation, like:
(strength 1 and 2 are case-insensitive)
db.collection.createIndex({amenities:1},{collation:{locale:"en",strength:1}})
Then query using the same collation:
db.collection.find({amenities:"wifi"}).collation({locale:"en",strength:1})
The search will be case insensitive, and it can efficiently use the index.

Sort results of search by regex in MongoDB

So, there's a web-server that has a number of methods which are used for autocompleting input fields on the client. Methods take a string and scan a specific property of mongodb collection using regexp.
Pretty common stuff, right? But here's a problem - these methods need to sort results based on how close the searched string is to the start of the result string. Like if I searched for countries and typed "ru", "Russia" should come before "Peru".
I don't see how I can sort results like this without performing multiple searches. Now I can only think of something like this
const limit = 20;
const resultsStartOfLine = db.countries.find({name: /^ru/i})
.limit(limit)
.toArray();
const resultsRest = db.countries.find({
name: /ru/i,
_id: {$nin: _.map(resultsStartOfLine, '_id')}
})
.limit(limit - resultsStartOfLine.length)
.toArray();
I know, that Mongo can't do this kind of sort by default, but maybe there's better way to do it?

As I've learned search by regex is usually a bad practice because it doesn't utilize indexes and as a result is pretty slow.
So I created an index for full-text search and sort results by weights.

MongoDB: Indexes, Sorting

After having read the official documentations on indexes, sort, intersection, i'm a little bit confuse on how everything work together.
I've trouble making my query use the indexes i've created. I work on a mongodb 3.0.3, on a collection having ~4millions of document.
To simplify, let's say my document is composed of 6 fields:
{
a:<text>,
b:<boolean>,
c:<text>,
d:<boolean>,
e:<date>,
f:<date>
}
The query I want to achieve is the following :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
So intuitively I've created two indexes
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1, e:1 }, {background: true,name: "test1"})
db.mycoll.createIndex({f:1}, {background: true,name: "test2"})
But the explain() give me that the first index is not used at all.
I known there is some kind of limitation when there is ranges in play in the filter (in the e field), but I can't find my way around it.
Also instead of having a single index on f, I try a compound index on {e:1,f:1} but it didn't change anything.
So What I have misunderstood?
Thanks for your support.
Update: also I find some time the following predicate for mongodb 2.6 :
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values.
Second, the field(s) on which you will sort.
Finally, field(s) on which you will query for a range of values (e.g., $gt, $lt, $in)
An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading.
Does this also apply for 3.X version?
Update 2: following above predicate, I created the following index
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1 , f:1, e:1}, {background: true,name: "test1"})
And for the same query :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
the index is indeed used. However too much keys seems to be scan, I may need to find a better order the fields in the query/index.

Mongo acts sometimes a bit strange when it comes to the index selection.
Mongo automagically decides what index to use. The smaller an index is the more likely it is used (especially indexes with only one field) - this is my experience. May be this happens because it is more often already loaded in RAM? To find out what index to use when Mongo performs test queries when it is idle. However the result is sometimes unexpected.
Therefore if you know what index to use you can force a query to use a specific index using the $hint option. You should try that.

Your two indexes used in the query and the sort does not overlap so MongoDB can not use them for index intersection:
Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

In MongoDB how do you query for records that contain ONLY certain fields and no others

In MongoDB,
To query for records that contain certain fields you can do:
collection.find({'field_name1': {'$exists': true}})
And that will return any record that has the 'field_name1' field set...
But how do you query mongo to find records that contains ONLY 'field_name1' (and no other fields)? I'd like to be able to do this for, say, a list of fields.

The sad answer, as you'll often find with MongoDB and other NoSQL databases is probably that it would be best to structure your data in a way that allows you to query it as simply as possible.
That said, there are ways of doing this, but as far as I know, it requires you execute JavaScript server side. This will be slow, and cannot possibly take advantage of indexes and other logical features of MongoDB, so use it only if it's absolutely necessary, if performance is at all important.
So, the easiest way to do this, is probably to create a function that returns the number of fields in an object, which we can use with the $where query syntax. This allows you to run arbitrary JavaScript queries against your data, and can be combined with normal syntax queries.
Sadly, my JavaScript-fu is a little weak, so I don't know how (or if) you can get at the count of members of an object in JS in a one-liner, so to do this, I would store a function server side.
From the mongo shell, execute the following:
db.system.js.save(
{
"_id" : "countFields",
"value" : function(x) { i=0; for(p in x) { i++; } return i}
}
)
With that, you have a saved JavaScript function, server side, called countFields that returns the number of elements in an object. Now, you need to execute your find-operation with the $where query:
db.collection.find({
'field_name1': {'$exists': True},
'$where' : 'countFields(this)==2'
})
This would give you only the documents that meet both the $exists condition, and the $where clause. Note that I'm comparing with 2 in the example, since the countFields function counts _id as a field.

Slow pagination over tons of records in mongodb

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse