Creating optimized mongodb indices for prefix + ignore case query - mongodb

I am creating a Mongo DB collection which will contain tens of trillions of records. The shape of documents will be like so:
{
"_id": ObjectId("AbCdhijk"),
"val": "hello world"
},
{
"_id": ObjectId("aBCDlmnop"),
"val": "goodbye world"
}
I have two query requirements:
query all values where id begins with a prefix string
query all values where id begins with a prefix string, ignore case
For example: querying for AbC should give one document: the document w/ val=hello world, whereas querying for abc ignore case should return both documents. The queries should take little to no time as possible. Ideally logarithmic performance (rather than needing to scan the whole collection) per query.
Some soft requirements would be supporting an endswith query as well.
What would be the ideal indices to add and query to use?
Changing the shape of documents to accomplish this and potentially having multiple documents per value even spread between multiple collections is acceptable.
For example I was considering making two inserts per value: one with the original ID and one with the ID transformed to all lowercase to aid with the ignore case lookup.

Related

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

Best shard key (or optimised query) for range query on sub-document array

Below is a simplified version of a document in my database:
{
_id : 1,
main_data : 100,
sub_docs: [
{
_id : a,
data : 100
},
{
_id: b,
data : 200
},
{
_id: c,
data: 150
}
]
}
So imagine I have lots of these documents with varied data values (say 0 - 1000).
Currently my query is something like:
db.myDb.find(
{ sub_docs.data : { $elemMatch: { $gte: 110, $lt: 160 } } }
)
Is there any shard key I could use to help this query? As currently it is querying all shards.
If not is there a better way to structure my query?
Jackson,
You are thinking about this problem the right way. The problem with broadcast queries in MongoDB is that they can't scale.
Any MongoDB query that does not filter on the shard key, will be broadcast to all shards. Also, range queries are likely to either cause broadcasts of at the very least cause your queries to be sent to multiple shards.
So here is some things to think about
Query Frequency -- Is the range query your most frequent query? What
is the expected workload?
Range Logic -- Is there any instrinsic logic to how you are going to
apply the ranges? Let's say, you would say 0-200 is small, 200 - 400
is medium. You could potentially add another field to your document
and shard on it.
Additional shard key candidates -- Sometimes there are other fields
that can be included in all or most of your queries and it would
provide good distribution. By combining filtering with your range
queries you could restrict your query to one or fewer shards.
Break array -- You could potentially have multiple documents instead
of an array. In this scenario, you would have multiple docs, one for
each occurrence of the array and main data would be duplicated across
mulitple documents. Range query on this item would still be a
problem, but you could involve multiple shards, not necessarily all
(it depends on your data demographics and query patterns)
It boils down to the nature of your data and queries. The sample document that you provided is very anonymized so it is harder to know what would be good shard key candidates in your domain.
One last piece of advice is to be careful on your insert/update query patterns if you plan to update your document frequently to add more entries to the array. Growing documents present scaling problems for MongoDB. See this article on this topic.

How to query all documents, filter for a specific field and return the value for each document in Elasticsearch?

I'm currently running an Elasticsearch instance which is synchronizing from a MongoDB via river. The MongoDB contains entries like this:
{field1: "value1", field2: "value2", cars: ["BMW", "Ford", "Porsche"]}
Not every entry in Mongo does have a cars field.
Now I want to create an ElasticSearch query which is searching over every document and return just the cars field from every single document indexed in Elasticsearch.
Is it even possible? Elasticsearch must touch every single document to return the cars field. Maybe querying with Mongo is just easier and as fast as Elasticsearch. What do you think?
The following query POSTed to hostname:9200/_search should get you started:
{
"filter": {
"exists": {
"field": "cars"
}
},
"fields": ["cars"]
}
The filter clause limits the results to documents with a cars field.
The fields clause says to only return the cars field. If you wanted the entire document returned, you would leave this section out.
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#_response_filtering
Make elasticsearch only return certain fields?
Elasticsearch (from my understanding) is not intended to be a SSoT database. It is very good at text searching, and analytics aggregations, but it isn't necessarily intended to be your primary database.
However, your use case isn't necessarily non performant in elasticsearch, it sounds like you just want to filter for your cars field, which you can do as documented here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fields.html
Lastly, I would actually venture that elasticsearch is faster than mongo in this case (assuming that the cars field is NOT indexed and elasticsearch is, which is their respective defaults), since you probably want to filter out the case in which the cars field is not set.
tl;dr elasticsearch isn't intended for your particularly use-case, but it probably is faster than mongo assuming you filter out the cars field being 'missing'

how to build index in mongodb in this situation

I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}

In MongoDB, when to use a simple subdocument, when an array with 2-field elements?

Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.