The most performant way to query against three collections, sort them and limit? - mongodb

I am in need of querying against 3 different collections.
Then I need to sort the collection results (each based on different field and order but every time a DateTime value), and then finally limit the number of results I want (10 in my case).
Currently I'm just doing three separate queries and limiting each by 10, then manually sorting based on the date times they have. Then I finally limit to 10 myself.
Is there a better way to do this?

As mongodb is no relational database where you can join multiple tables within one query, no. I'm not even sure if you could do such kind of sorting (taking each field equal on the order precedence) for relational DBMS.
What you're doing already sounds really good. You could possibly improve your sorting of these 3 results. Aborting early to iterate over one or more collections if no further element can be within the overal top 10. You could modify your queries accordingly to only return documents for the other two collections, whose date is lower than the last one (10th) of the first queried collection. But maybe you did this already...
While talking about performance you may consider to add indexes on these datetime fields used for your query to keep the fields presorted in memory.

Related

Query and sort on nultiple collections based on time

Our application is planning to use MongoDB for reports.
Our reports are time-based (where the raw data is different events).
We were thinking of creating a separate collection for each day, so we will not need to query a whole large collection when we need to query,aggregate and sort events for a specific day only.
One question is whether this design makes sense.
Another question is what will happen if we need to aggregate and sort event over more than one collection - for one week for example.
Does MongoDB supports this? If it does - how should it be done so it will be efficient n terms of performance?
We were thinking of creating a separate collection for each day, so we will not need to query a whole large collection when we need to query,aggregate and sort events for a specific day only.
One question is whether this design makes sense.
While using proper indexes, mongoDB should not have problems with a very big collection.
You could read more here: MongoDB indexes
Another question is what will happen if we need to aggregate and sort event over more than one collection - for one week for example.
Does MongoDB supports this? If it does - how should it be done so it will be efficient n terms of performance?
If you want to go your way, you could use aggregation pipelines and $facet to run multiple queries at once. However this could become a little tricky because you have to generate the collection names from your query parameters. Infact, i think this could be slower and more prone to errors. So i don't recomend this approach.

Mongo DB update query performance

I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().

All vs All comparisons on MongoDB

We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK

How complete should MongoDB indexes be?

For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.
The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.

Mongo DB scaling question ( does indexes affect the 'distinct' performance )?

I'm using Mongo to store, day by day, all the "ticks" of a set of about 40 equity. These ticks contains the trade info ( a document containing price and volume ) and book info ( a more complex document containing sell-buy proposal ). The magnitude order is about 5K trades+20K books *40 equity per day. Document are indexed both per Symbol ( the equity name ) date of insert, timeof day. After a week of collection one of my query does not scale anymore: looking for distinct date takes to long. So i decided to have a special document just to say that there is a "collection" for a certain day, is this a correct approach ? Furthermore, is correct to collect things as a separate little document, or would be better to collect ticks as an array on the equity document ?
Thanks all !
BTW this question is a consequence of this one: Using mongodb for store intraday equity data
Addition:
even if I explicitly say ( at the console )
db.books.ensureIndex({dateTag:1})
db.books.distinct("dateTag")
it reply slowly. So maybe a better question is: does index affect the distinct performance ?
Addition
After upgrading to 1.8.2 behavior is the same.
does index affect the distinct performance ?
It does indeed, however there's no "explain plan" so this can only be confirmed via the docs / code.
Document are indexed both per Symbol ( the equity name ) date of insert, timeof day
I'm not 100% clear on how many indexes you have or what type of memory footprint you have here. Just having an index does not necessarily mean that it's going to be really fast. If that index is not in memory, then you end up going to disk and slowing down your query.
If you're seeing slow performance on this query despite the index I would check two things:
Disk activity (during the query)
Data size relative to memory
However, it may just be easier to keep a list of "days stored". That distinct query is probably going to get worse, even with an index. So it's never going to be as fast as a document simply listing the days.
I don't think that your "collection for a certain day" approach would work out because you would run into MongoDb's limit of 24,000 namespaces per database. Storing the ticks in an array property of a document could make it harder to execute certain types of query (really depends on what types of reports you need to run on the ticks).
Are you sure that you have indexes in place for the properties you use in your problematic query? As last resort you could try sharding but I doubt that that is necessary at this point.
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Distinct
clearly states that distinct() can uses indexes starting MongoDB 1.7.3