We are developing a C++ application development tool that uses mongoDB as the underlying database. Say the user has developed a patient collection with fields _id (OID) & patient# with a unique ascending index on patient#. Say there are 10 patients with patient#s 1, 5, 7, 13, 14, 20, 21, 22, 23, 25. Patient# 20 is displayed fetched with limit(1). The user presses PageDown and patient# 21 is displayed -- easy with $gt with patient# = 20.
When the user presses PageUp, patient# 14 should be displayed. My (hopefully wrong) solution is to create a parallel descending index on patient# and $lt. That implies every collection a user creates requires both indexes on the primary key fields to get bidirectional movement. That would apply also to secondary indexes such as name.
Additionally, the user presses F5, is prompted "# of records to move", they enter 3, patient# 23 should be displayed. Or they enter -3, patient# 7 should be displayed. My idea: First use a covered query and return the following 3 patient#s from the index and then fetch the 3rd document. This isn't at all ideal when a less simplified application has hundreds of thousands of documents and the user wants to transverse by 10s of thousands of records. And, again, to achieve the backward movement, I believe I would need that second descending index.
Finally, I need a way to have the user navigate to the last index entry (patient #25). Going to the beginning is trivial. Again, second index?
My question: Is there a way to transverse the ascending index using (say) an iterator or pointer to the current index element and then use iterator/pointer arithmetic to achieve what I want? I.e., +1 will get me the "next" index element from which I could fetch the "next" document; -1 the "previous", +3 the third following, -3 the third previous; index size to the last. Or is there another solution without so much overhead (multiple indexes, large covered queries).
The way to achieve what you want is to have an index on the relevant fields and then do simple queries to get you the records you need when you need them.
It's important to not over-think the problem. Rather than trying to break down how the query optimizer would traverse the index and trying to "reduce" somehow the work it does, just use the queries you need to get the job done.
That means in your case querying for records you need and when the user wants to jump to a particular record querying for that record. If someone is looking at record 27 and they want to go to the next one you can query for smallest record greater than 27 via descending sort and limit(1) qualifier on your find.
I would encourage you to revisit your choice to have basically two primary keys - instead of separate patientID field which has a unique index, you can store patientId in the _id field and get to use the already existing unique index on _id that MongoDB requires in every collection.
A more concrete description of what I am trying to do can be found here:
How to get the previous mongoDB document from a compound index
The answer is: It can't be done in the current version. The 2nd jira ticket is a possible future fix.
See: SERVER-9540
and
SERVER-9547
Related
I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.
The chat app schema that I have is something like below.
1. conversations {participants[user_1, user_2], convsersation_id}
2. messages {sender: user_1, sonversation_id, timestamps}
I want to map this relationship using existing _id:ObjectId which is already indexed.
But if I want to get all conversation of user_1 I have to first search in which conversation that user is involed and get that conversation's _id and again search for the messages in messages using that conversation _id.
So my questions are -
Does length of indexed field (here _id) matters while searching?
Should I create another shorter indexed fields?.
Also if there is any better alternative schema please suggest.
I would suggest you to maintain the data as sub documents instead of array. The advantage you have is you can build another index (only) on conversation_id field, which you want to query to know the user's involvement
When you maintain it as array, you cannot index the converstaion_id field separately, instead you will have to build a multi key index, which indexes all the elements of the array (sender and timestamps fields) which you are never going to use for querying and it also increases the index size
Answering you questions:
Does length of indexed field (here _id) matters while searching? - Not really
Should I create another shorter indexed fields? - Create sub-document and index converstaion_id
Also if there is any better alternative schema please suggest. - Maintain the array fields as sub-documents
I got a query like this that gets called 90% of the times:
db.xyz.find({ "ws.wz.eId" : 665 , "ws.ce1.id" : 665)
and another one like this that gets called 10% of the times:
db.xyz.find({ "ws.wz.eId" : 111 , "ws.ce2.id" : 111)
You can see that the id for the two collections in both queries are the same.
Now I'm wondering if I should just create a single index just for "ws.wz.eId" or if I should create two compound indexes: one for {"ws.wz.eId", "ws.ce.id"} and another one for {"ws.wz.eId", "ws.ce2.id"}
It seems to me that the single index is the best choice; however I might be wrong; so I would like to know if there is value in creating the compound index, or any other type.
As muratgu already pointed out, the best way to reason about performance is to stop reasoning and start measuring instead.
However, since measurements can be quite tricky, here's some theory:
You might want to consider one compound index {"ws.wz.eId", "ws.ce1.id"} because that can be used for the 90% case and, for the ten percent case, is equivalent to just having an index on ws.wz.eId.
When you do this, the first query can be matched through the index, the second query will have to find all candidates with matching ws.wz.eId first (fast, index present) and then scan-and-match all candidates to filter out those documents that don't match the ws.ce2.id criterion. Whether that is expensive or not depends on the number of documents with same ws.wz.eId that must be scanned, so this depends very much on your data.
An important factor is the selectivity of the key. For example, if there's a million documents with same ws.wz.eId and only one of those has the ws.ce2.id you're looking for, you might need the index, or want to reverse the query.
I have a collections of documents (named posts) that each contain a field named category.
Each category is part of a categories collection. There are a fixed number of them (say 15).
How do I fetch the last 10 tldrs from each category?
Another solution would be to set a "flag" in each post which is actually part of the result, like:
topTen: true
Defining a sparse index on that flag would give the fastest query - at the price, of course, of the maintenance of that flag:
set the flag at insertion time (impact: one more index to update)
if it is tolerable that for a certain period the query returns 11 posts instead of 10, then trigger a background process that deletes (unsets) the 11th flag for that category
if it is not tolerable, find and unset the 11th flag at insert time
if the category of an existing post is altered, make sure the flags get set all right (for the old and the new category)
if a post gets removed that has the flag set: find and set the flag for the new 10th post
may be you'd want to provide a periodically ran process, that makes sure the flags are all set as they should be
For more information on sparse indexes, see http://docs.mongodb.org/manual/core/indexes/#index-type-sparse
Probably it will be better to just at first get the list of all categories and then for each of them get their 10 latest posts by separate queries.
Basic question. Does mongodb find command will always return documents in the order they where added to collection? If no how is it possible to implement selection docs in the right order?
Sort? But what if docs where added simultaneously and say created date is the same, but there was an order still.
Well, yes and ... not exactly.
Documents are default sorted by natural order. Which is initially the order the documents are stored on disk, which is indeed the order in which the documents had been added to a collection.
This order however, is not deterministic, as document may be moved on disk once these documents grow after update operations, and can't be fit into current space anymore. This way the initial (insert) order may change.
The way to guarantee insert order sort is sort by {_id : 1} as long as the _id is of type ObjectId. This will return your documents sorted in ascending order.
Write operations do not take place simultaneously. Write locks are imposed in database level (V 2.4 and on). The first four bytes of _id is insert timestamp, and 3 last digits is a random counter used to distinguish (and sort) between ObjectId instances with same timestamp.
_id field is indexed by default