Indexing on only part of a field in MongoDB

Indexing on only part of a field in MongoDB - mongodb

Is there a way to create an index on only part of a field in MongoDB, for example on the first 10 characters? I couldn't find it documented (or asked about on here).
The MySQL equivalent would be CREATE INDEX part_of_name ON customer (name(10));.
Reason: I have a collection with a single field that varies in length from a few characters up to over 1000 characters, average 50 characters. As there are a hundred million or so documents it's going to be hard to fit the full index in memory (testing with 8% of the data the index is already 400MB, according to stats). Indexing just the first part of the field would reduce the index size by about 75%. In most cases the search term is quite short, it's not a full-text search.
A work-around would be to add a second field of 10 (lowercased) characters for each item, index that, then add logic to filter the results if the search term is over ten characters (and that extra field is probably needed anyway for case-insensitive searches, unless anybody has a better way). Seems like an ugly way to do it though.
[added later]
I tried adding the second field, containing the first 12 characters from the main field, lowercased. It wasn't a big success.
Previously, the average object size was 50 bytes, but I forgot that includes the _id and other overheads, so my main field length (there was only one) averaged nearer to 30 bytes than 50. Then, the second field index contains the _id and other overheads.
Net result (for my 8% sample) is the index on the main field is 415MB and on the 12 byte field is 330MB - only a 20% saving in space, not worthwhile. I could duplicate the entire field (to work around the case insensitive search problem) but realistically it looks like I should reconsider whether MongoDB is the right tool for the job (or just buy more memory and use twice as much disk space).
[added even later]
This is a typical document, with the source field, and the short lowercased field:
{ "_id" : ObjectId("505d0e89f56588f20f000041"), "q" : "Continental Airlines", "f" : "continental " }
Indexes:
db.test.ensureIndex({q:1});
db.test.ensureIndex({f:1});
The 'f" index, working on a shorter field, is 80% of the size of the "q" index. I didn't mean to imply I included the _id in the index, just that it needs to use that somewhere to show where the index will point to, so it's an overhead that probably helps explain why a shorter key makes so little difference.
Access to the index will be essentially random, no part of it is more likely to be accessed than any other. Total index size for the full file will likely be 5GB, so it's not extreme for that one index. Adding some other fields for other search cases, and their associated indexes, and copies of data for lower case, does start to add up, and make paging and swapping more likely (it's an 8GB server) which I why I started looking into a more concise index.

MongoDB has no way to create an index on a portion of a field's value. Your best approach is to create the second field, as you've suggested.
Since you'll need the second field for efficient case-insensitive searching anyway, there's really no reason to not create it.
The indexes don't store the '_id' field of the document, they store a DiscLoc structure, which is a much lower-level structure: see here for details
http://www.10gen.com/presentations/MongoNYC-2012/storage-engine-internals
Also, note that the "ugly" is really an artifact of "relational thinking". (As a long-time SQL user myself, I often find that the hardest part about learning MongoDB is un-learning my relational thinking.) In a document-oriented database, denormalizing and duplicating data are actually Best Practices.

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?

Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.

BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

Text equality operator performance in Postgresql

How does this query work in terms of string comparison performance (assume there is a standard B-tree index on last_name ?
select * from employee where last_name = 'Wolfeschlegelsteinhausenbergerdorff';
So as it walks the B-Tree, I am assuming it it doesn't do a linear search on each character in the last_name field. EG, it doesn't start to check that the fist letter starts with a W... Assuming it doesn't do a linear comparison, what does it do?
I ask because I am considering to write my own duplicate prevention mechanism, but I want the performance to be sound. I was originally thinking of hashing each string (into some primitive datatype, probably a Long) that is coming in through an API, and storing the hash codes in a set/cache (each entry expires after 5 minutes). Any collisions would/could prompt a true duplicate check, where the already processed strings are stored in postgresql. But I'm thinking, would it be better to just simply query postgresql, in stead of maintaining my own memory based Set of Hashes that fluhses old entries after 5-10 minutes. I would probably use redis for scalability since multiple nodes will be reading different streams. Is my set of memory cached hash codes going to be faster than just querying indexed postgres String columns (full text matching not text searching) ?

When strings are compared for equality, the function texteq is called.
If you look up the function in src/backend/utils/adt/varlena.c, you will find that the comparison is made using the C library function memcmp. I doubt that you can get faster than that.
When you look up the value in a B-tree index, it will be compared to the values stored in each index page from the root page to the leaf page, that are at most 5 or 6 pages.
Frankly, I doubt that you can manage to be faster than that, but I wish you luck trying.

Choosing the right database index type

I have a very simple Mongo database for a personal nodejs project. It's basically just records of registered users.
My most important field is an alpha-numeric string (let's call it user_id and assume it can't be only numeric) of about 15 to 20 characters.
Now the most important operation is checking if the user exists at or all not. I do this by querying db.collection.find("user_id": "testuser-123")
if no record returns, I save the user along with some other not so important data like first name, last and signup date.
Now I obviously want to make user_id an index.
I read the Indexing Tutorials on the official MongoDB Manual.
First I tried setting a text index because I thought that would fit the alpha-numeric field. I also tried setting language:none. But it turned out that my query returned in ~12ms instead of 6ms without indexing.
Then I tried just setting an ordered index like {user_id: 1}, but I haven't seen any difference (is it only working for numeric values?).
Can anyone recommend me the best type of index for this case or quickest query to check if the user exists? Or maybe is MongoDB not the best match for this?

Some random thoughts first:
A text index is used to help full text search. Given your description this is not what is needed here, as, if I understand it well, you need to use an exact match of the whole field.
Without any index, MongoDB will use a linear search. Using big O notation, this is an O(n) operation. With an (ordered) index, the search is performed in O(log(n)). That means that an index will dramatically speed up queries when you will have many documents. But you will not necessary see any improvement if you have a small number of documents. In that case, O(n) can even be worst than O(log(n)). Some database management systems don't even bother using the index if the optimizer estimate that it will not provide enough benefits. I don't know if MongoDB does that, though.
Given your use case, I think the proper index is an unique index. This is an ordered index that would prevent insertion of two identical documents.
In your application, do not test before insert. In real application, this could lead to race condition when you have concurrent inserts. If you use an unique index, just try to insert -- and be prepared to gracefully handle an error caused by a duplicate key.

Any way to make Mongo filtering dynamic records faster?

Scenario
I have Mongo collection Items that have dynamic item objects in it. Currently I have over 3 million records. I'm using C# with MongoSharp but I don't think it has anything to do with my problem.
Here is an example Item (it has a lot more fields than just 3):
{
_id : "1234567890",
Code : 888596937,
RefNumber : "GHTZKL",
...
}
AFAIK there is no point in using TextSearch since it's not really words, just some codes so it won't give me anything beneficial. I also cannot index them all since then I will have to index every single field.
Problem
Right now when I filter data it takes about 1-3 seconds (on ssd). Is there any way I can make it filter my items faster or it is as fast as I can get?

You don't mention what field you want to search on, but it sounds like you want to search on any arbitrary attribute. This is a common design and borders on an antipattern for MongoDB. The only way to avoid the collection scan you're getting now is to index the fields you want to search on, but indexing every field when you don't know what the fields will be ahead of time isn't possible. The solution to this to name only the common fields (and index on them), then group the other fields into name/value pairs in an array in the the document. You can then index that array to get your fast searches.
A word of caution on NVP arrays: if you array gets very large (hundreds of attributes), your index size will blow up spectacularly. It's best to try to keep the array size fairly small.
For more information on this design pattern, see Asya's great writeup.

How complete should MongoDB indexes be?

For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?

Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.

The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse