Data model built on Mongo: store IDs as one massive string or array of strings? Is Mongo faster at using regular expressions or looking inside arrays?

Data model built on Mongo: store IDs as one massive string or array of strings? Is Mongo faster at using regular expressions or looking inside arrays? - mongodb

We could use help on structuring our Mongo database. We need to store country IDs then run queries to return documents containing matching countries. Assume the IDs are strings 6-10 chars long.
Two options:
1) Store the country IDs as one massive string separated by some delimiter
(e.g., /). Ex: "IDIDID1/IDIDID2/IDIDID3/IDIDID4/IDIDID5".
2) Store the IDs in an array.
Ex: ["IDIDID1", "IDIDID2", "IDIDID3", "IDIDID4", "IDIDID5"].
We want to optimize for queries like "Find all documents containing country ID IDIDID3."
For option 1, we plan to use a RegEx to query documents (e.g., /IDIDID3/).
For option 2, we will use the standard $in operator.
Which option yields better read performance?
Does using the string approach yield better performance because you can index strings (as opposed to the limitation of only one array indexable by Mongo)?
We're using MongoMapper.

From MongDB Manual
$regex can only use an index efficiently when the regular expression
has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match.
Additionally, while /^a/, /^a.*/, and /^a.*$/ match equivalent strings,
they have different performance characteristics.
All of these expressions use an index if an appropriate index exists;
however, /^a.*/, and /^a.*$/ are slower. /^a/ can stop scanning after matching the prefix.
So using an array and a multi key index makes more sense in terms of performance

Related

MongoDB multiple type of index on same field

Can I have multiple type of index on same field? Will it affect performance?
Example :
db.users.createIndex({"username":"text"})
db.users.createIndex({"username":1})

Yes, you can have different types of indexes on single field. You can create indexes of type e.g text, 2dsphere, hash
You can not create same index with sparse and unique options.
Every write operation is going to update a relevant index entry of all possible types in this case

The two index options are very different.
When you create a regular index on a string field it indexes the entire value in the string. Mostly useful for single word strings (like a username for logins) where you can match exactly.
A text index on the other hard will tokenize and stem the content of the field. So it will break the string into individual words or tokens, and will further reduce them to their stems so that variants of the same word will match ("talk" matching "talks", "talked" and "talking" for example, as "talk" is a stem of all three). Mostly useful for true text (sentences, paragraphs, etc).
Text Search
Text search supports the search of string content in documents of a collection. MongoDB provides the $text operator to perform text search in queries and in aggregation pipelines.
The text search process:
tokenizes and stems the search term(s) during both the index creation and the text command execution.
assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a
document to a given search query.
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
$regex searches can be used with regular indexes on string fields, to provide some pattern matching and wildcard search. Not a terribly effective user of indexes but it will use indexes where it can:
If an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.
http://docs.mongodb.org/manual/core/index-text/
http://docs.mongodb.org/manual/reference/operator/query/regex/

In Algolia, how do you construct records to allow for alphabetical sorting of query results?

As far as I know, you can only sort on numeric fields in Algolia, so how do you efficiently set up your records to allow for results to be returned alphabetically based on a specific string field?
For example, let's say in each record in an index you have a field called "title" that contains an arbitrary string value. How would you create a sibling field called "title_sort" that contains a number that allows for the the results to be sorted such that the records come out in alphabetical order by "title"? Is there a particularly well-accepted algorithm for creating such a number from the string in "title"?

If you have a static dataset, then you can just sort your data and put an index on it. This works as long as sorting data every time you update your indices.
I'm also thinking that if you can deal with a partial sorting, meaning that you can accept orc < orb but you need or < os, then you could derive an can use base64 as our index. You can then sort it to as many characters as you have precision for. It's only a partial sorting, but it might be acceptable for your use case. You just need to map your base64 -> base10 mappings to accomodate the sorting.
Additionally, if you don't care about the difference between capital and lowercase letters, then you can do base26 -> base10. The more I think about this the more limited it is, but it might work for your use case.

Distinguish array from single value in a document

I have two type of documents in a mongodb collection:
one where key sessions has a simple value:
{"sessions": NumberLong("10000000000001")}
one where key sessions has an array of values.
{"sessions": [NumberLong("10000000000001")]}
Is there any way to retrieve all documents from the second category, ie. only documents whose value is an arary and not a simple value?

You can use this kind of query for that:
db.collectionName.find( { $where : "Array.isArray(this.sessions)" } );
but you'd better convert all the records to one type to keep the things consistent.

This code can be simple like this:
db.c.find({sessions:{$gte:[]}});
Explanation:
Because you only want to retrieve documents whose sessions data type is array, and by the feature of $gte (if data types are different between tow operands, it returns false; Double, Integer32, Integer64 are considered as same data type.), giving an empty array as the opposite operand will help to retrieve all results by required.
Also , $gt, $lt, $lte for standard query (attention: different behaviors to operaors with same name in expression of aggregation pipeline) have the same feature. I proved this by practice on MongoDB V2.4.8, V2.6.4.

MongoDB - Difference between index on text field and text index?

For a MongoDB field that contains strings (for example, state or province names), what (if any) difference is there between creating an index on a string-type field :
db.ensureIndex( { field: 1 } )
and creating a text index on that field:
db.ensureIndex( { field: "text" }
Where, in both cases, field is of string type.
I'm looking for a way to do a case-insensitive search on a text field which would contain a single word (maybe more). Being new to Mongo, I'm having trouble distinguishing between using the above two index methods, and even something like a $regex search.

The two index options are very different.
When you create a regular index on a string field it indexes the
entire value in the string. Mostly useful for single word strings
(like a username for logins) where you can match exactly.
A text index on the other hand will tokenize and stem the content of
the field. So it will break the string into individual words or
tokens, and will further reduce them to their stems so that variants
of the same word will match ("talk" matching "talks", "talked" and
"talking" for example, as "talk" is a stem of all three). Mostly
useful for true text (sentences, paragraphs, etc).
Text Search
Text search supports the search of string content in documents of a
collection. MongoDB provides the $text operator to perform text search
in queries and in aggregation pipelines.
The text search process:
tokenizes and stems the search term(s) during both the index creation and the text command execution.
assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
The $text operator can search for words and phrases. The query matches
on the complete stemmed words. For example, if a document field
contains the word blueberry, a search on the term blue will not match
the document. However, a search on either blueberry or blueberries
will match.
$regex searches can be used with regular indexes on string fields, to
provide some pattern matching and wildcard search. Not a terribly
effective user of indexes but it will use indexes where it can:
If an index exists for the field, then MongoDB matches the regular
expression against the values in the index, which can be faster than a
collection scan. Further optimization can occur if the regular
expression is a “prefix expression”, which means that all potential
matches start with the same string. This allows MongoDB to construct a
“range” from that prefix and only match against those values from the
index that fall within that range.
http://docs.mongodb.org/manual/core/index-text/
http://docs.mongodb.org/manual/reference/operator/query/regex/

text indexes allow you to search for words inside texts. You can do the same using a regex on a non text-indexed text field, but it would be much slower.
Prior to MongoDB 2.6, text search operations had to be made with their own command, which was a big drawback because you coulnd't combine it with other filters, nor treat the result as a common cursor. As of now, the text search is just another another operator for the typical find method and that's super nice.
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
Keep in mind that the text index will grow along with your collection, and it can use a lot of space. I learnt this the hard way when using capped collections. There's no way to cap text indexes.
A regular index on a text field, such as
db.ensureIndex( { field: 1 } )
will be useful only if you search for the whole text. It's used for example to look for alphanumeric hashes. It doesn't make any sense to apply this kind of indexes when storing text paragraphs, phrases, etc.

Checking position of an entry in an index MongoDB

I have a query using pymongo that is outputting some values based on the following:
cursor = db.collect.find({"index_field":{"$regex":'\s'}}
for document in cursor:
print document["_id"]
Now this query has been running for a long time (over 500 million documents) as I expected. I was wondering though if there is a way to check where the query is in its execution by perhaps finding out where the last printed "_id" is in the indexed field. Like is the last printed _id halfway through the btree index? Is it near the end?
I want to know this just to see if I should cancel the query and reoptimize and/or let it finish, but I have no way of knowing where the _id exists in the query.
Also, if anyone has a way to optimize my whitespace query, that would be helpful to. Based on the doc, it seems if I would of used ignorecase it would of been faster, although it doesn't make sense for whitespace checking.
Thanks so much,
J

Query optimization
Your query cannot be optimized, because it's an inefficient$regex search that's looking for the space \s in the the document. What you can do, is to search $regex for a prefix of \s, e.g.
db.collect.find({"index_field": {"$regex": '^\\s'}})
Check out the notes in the link
Indexing problem
$regex can only use an index efficiently when the regular
expression has an anchor for the beginning (i.e. ^) of a string and is
a case-sensitive match. Additionally, while /^a/, /^a.*/, and
/^a.*$/ match equivalent strings, they have different performance
characteristics. All of these expressions use an index if an
appropriate index exists; however, /^a.*/, and /^a.*$/ are slower.
/^a/ can stop scanning after matching the prefix.
DB op's info
Use db.currentOp() to get info on all of your running ops.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse