MongoDB index effectiveness evaluation best practices

MongoDB index effectiveness evaluation best practices - mongodb

We would like to evaluate the effectiveness of our indexes in a MongoDB-based REST service setup. The idea is to populate a collection with a synthetic dataset (e.g. 10,000,000 documents) then run a load injector process doing random REST operations (each one involving a query at MongoDB layer) to evaluate which indexes are being used and statistical information about them (e.g. per index hit rate).
We have considered using explain() command or indexStats. However, regarding explain(), it has two problems: 1) it allows only evaluate the effectiveness of a single query, 2) it is difficult to use in a “black box” environment in which our load injector process interacts with the REST service on top on MongoDB but not MonoDB itself. Regarding indexStats, as far as I understand, it shows information about the index structure “on disk” but not about index usage.
Thus, which is the best way of doing that kind of test? Any procedure description or URL to information about the topic is highly welcomed.

You should read about performance profiling.
You can turn on the profiling on with:
db.setProfilingLevel(2);
Or if you don't want to much noise and you want to check only queries that took more than e.g. 20ms:
db.setProfilingLevel(1,20);
You can query the system.profile to get the information about slow queries, e.g. find all operations slower than 30 ms:
db.system.profile.find( { millis : { $gt : 30 } } ).pretty()
You can than manually profile each slow query with explain().
For real-time monitoring you can use mongotop and mongostat. You should also consider setting up MMS (free monitoring service). In MMS you can check btree hits/misses and compare them to other relevant statistics.
Edit
You can get the relevant indexCounters data by using serverStatus command:
db.serverStatus()

Related

how to choose best mechanism for delete logs saved to mongodb

I'm implementing a logger using MongoDB and I'm quite new to the concept.
The logger is supposed to log each request and Its response.
I'm facing the question of using the TTL Index of mongo or just using the query overnight approach.
I think that the first method might bring some overhead by using a background thread and probably rebuilding the index after each deletion but, it frees space as soon as the documents expire and this might be beneficial.
The second approach, on the other hand, does not have this kind of overhead but it frees up space just at the end of each day.
It seems to me that the second approach will suit my case better as it would not be the case that my server just goes on the edge of not having enough disk space, but it will always be the case that we need to reduce the overhead on the server.
I'm wondering if there are some aspects to the subject that I'm missing and also I'm not sure about the applications of the MongoDB TTL.

Just my opinion:
It seems to be best to store logs in monthly , daily or hourly collection depends on your applications write load , and at the end of the day to just drop() the oldest collections with custom script. From experience TTL indices not working well when there is heavy write load to your collection since they add additional write load based on expiration time.
For example imagine you insert at 06:00h log events with 100k/sec and your TTL index life time is set to 3h , this mean after 3h at 09:00h you will have those 100k/sec deletes applied to your collection that are also stored in the oplog ... , solution in such cases is to add more shards , but it become kind of expensive... , far easier is to just drop the exprired collection ...
Moreover depending on your project size for bigger collections to speed up searches you can additionally shard and pre-split the collections based on compound index hashed datetime field(every log contain timestamp) with another field which you will search often and this will allow you scalable search across multiple distributed shards.
Also note mongoDB is a general purpose document database and fulltext search is kind of limited to expensinve regex expressions , so in case you need to do fast raw fulltext search in your logs some inverse index search engine like elasticsearch on top of your mongoDB backand maybe a good solution to cover this functionality.

How is Result-Set Query Scale done in Google Cloud Datastore

It is mentioned that making queries in Google Cloud Datastore is as expensive, with regards to time, as the number of results, which means that, for example, the time it takes to run any query would be proportional only to the number of matching results.
Can anyone give some explanation about how is it done in GCD or NoSQL Documented database?
I know there is a possible that you can implement the distributed system and run queries in parallel. But it is mentioned that the Datastore uses indexing to accomplish this, how would the indexing be in this way?

Queries in Cloud Datastore must use an index. There are no queries that scan the entire database.
On how indexes work in general, the indexes in Cloud Datastore all ordered indexes, and for each indexed property there is a write to a separate index table which is used to answer a query. You can find details at https://cloud.google.com/datastore/docs/concepts/indexes .

How to apply/choose getPlanCache() and hint() depending on different situation

I already read official documentation to get the basic idea on getPlanCache() and hint().
getPlanCache()
Displays the cached query plans for the specified query shape.
The query optimizer only caches the plans for those query shapes that can have more than one viable plan.
Official Documentation: https://docs.mongodb.com/manual/reference/method/PlanCache.getPlansByQuery/
hint()
The $hint operator forces the query optimizer to use a specific index to fulfill the query. Specify the index either by the index name or by document.
Official Documentation: https://docs.mongodb.com/manual/reference/operator/meta/hint/
MyQuestion
If I can make sure the specific collection can cache the plan, I don't need to use hint() to ensure optimized performance. Is that correct?

I already read official documentation to get the basic idea on getPlanCache() and hint().
To be clear: these are troubleshooting aids for investigating query performance. The MongoDB query planner chooses the most efficent plan available based on a measure of "work" involved in executing a given query shape. If there is only a single viable plan, there is no need to cache the plan selection. If there are multiple query plans available for the same query shape, the query planner will periodically evaluate performance and update the cached plan selection if appropriate.
The query plan cache methods allow you to inspect and clear information in the plan cache. Generally you would only want to clear the plan cache while investigating issues in a development/staging environment as this could have a noticeable affect on a busy deployment.
If I can make sure the specific collection can cache the plan, I don't need to use hint() to ensure optimized performance. Is that correct?
In general you should avoid using hint (outside of testing query plans) as this bypasses the query planner and forces use of the hinted index even if there might be a more efficient index available.
If a specific query is not performing as expected, explain() output is the best starting point for insight into the query planning process. If you're not sure how to optimise a specific query, I'd suggest posting a question on DBA StackExchange including the output of explain(true) (verbose explain) and your MongoDB server version.
For a helpful presentation, see: Reading the .explain() Output - Charlie Swanson (June 2017).

MongoDB gets faster when processing repeated queries

I am developing a web application with MongoDB. I have noticed a phenomenon that when executing the same query repeatedly, the processing speed boosts and finally converges at a certain point. Like when querying all documents in the collection, the time usage per query will be 100000 ms -> 20000 ms -> 9000ms -> ... -> 500 ms.
I am wondering what is the reason behind the speed boost? And how to estimate the convergence point?

There are many reasons, i can give some points.
First of all, MongoDB is able to choose the best index for your query. To do so, MongoDB use a Query Plans. But this operation take time:
If there are no matching entries, the query planner generates
candidate plans for evaluation over a trial period. The query planner
chooses a winning plan, creates a cache entry containing the winning
plan, and uses it to generate the result documents.
https://docs.mongodb.com/manual/core/query-plans/
INDEX should be loaded into memory in order to speed up the performance, this also take time. Try to look at Touch:
https://docs.mongodb.com/manual/reference/command/touch/
The touch command loads data from the data storage layer into memory.
touch can load the data (i.e. documents) indexes or both documents and
indexes.
Another reason, the INDEX do not fit into memory in order to know if this is your case maybe you can check with totalIndexSize
https://docs.mongodb.com/manual/reference/method/db.collection.totalIndexSize/#db.collection.totalIndexSize
I'm more focused to improve the query-planner side since MongoDB doesn't take the best decision for you all the time.
All this topic should be carefully evaluated in my opinion to avoid performance degrade.
Good Luck!

Mongodb model for Uniqueness

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks

Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse