My mongo query read run more than 5000 milliseconds - mongodb

I have a mysql as primary db and mongodb as secondary database.
I run a query on production and it run for more than 5 seconds.
Here is my query when i run with explain
{
"cursor" : "BtreeCursor host_1_type_1 multi",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 313566,
"nscannedObjectsAllPlans" : 313553,
"nscannedAllPlans" : 627118,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 14,
"nChunkSkips" : 0,
"millis" : 6555,
"indexBounds" : {
"host" : [
[
"",
{
}
],
[
/shannisideup/i,
/shannisideup/i
]
],
"type" : [
[
"ambassador-profile",
"ambassador-profile"
]
]
},
"server" : "mongoserver:27017"
}
i've added an indexes
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "db.visitor",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"host" : 1,
"type" : 1
},
"ns" : "db.visitor",
"name" : "host_1_type_1"
}
]
But i still don't know why is it so slow when running query
db.visitor.find({ host:/shannisideup/i, type:"ambassador-profile" }).limit(1)
FYI i run a different server for my apps and my mongodb server in AWS cloud.
For which mongodb running EC2 m3.medium, i've tried raising the openfile limit as mongodb website suggested. My mongodb is running on separate 100GB disk mounted via /dev/sdf.
I run mongodb version 2.4.5
When my mongodb runs, the CPU load almost always 100%
My MMS stat for opcounter are:
command: 19.11
query: 13.79
update: 0.03
delete: 0.00001
getmore: 0
insert: 0.03
My highest Pagefaults is 0.02
What can i do to optimize my mongodb to less then 1 second?

I'll walk thru a few things that should help, although I want to make clear up front that a case insensitive regex search ("host:/shannisideup/i") cannot use an index so there are limits to what you can do with this data model and search. You can see that it's scanning a large number of objects ("nscanned" : 313566) just to return 1 document.
Things to do:
Upgrade your instance with a faster cpu - the 100% cpu load during
this case insensitive search is a clearly indicator that at times
your database is cpu bound. A faster cpu will help. More memory
(which typically comes with Amazon EC2 instances when you move to a
faster cpu) won't hurt either. Same with a faster disk (SSD).
Lower case your host field before storing it in MongoDB - if you
store all your hosts in lower case and then lowercase your search
string prior to search, you can get rid of the case-insensitive flag
for the regex. If you can combine that with a prefix operator in your
regex you would be able to use the index which would help.
Consider using a dedicated text search product (like Solr or
Elasticsearch) for the host query. MongoDB is limited in it's text
search capabilities, particularly with regard to wild card searches.
You may find that something like Elasticsearch or Solr may provide
better performance.
A few links that you may find useful:
http://docs.mongodb.org/manual/reference/operator/query/regex/
http://selectnull.leongkui.me/2014/02/02/mongodb-elasticsearch-perfect-nosql-duo/

Related

Insert operation became very slow for MongoDB

The client is pymongo.
The program has been running for one week. It's indeed very fast to insert data before: about 10 million / 30 minutes.
But today i found the insert operation became very very slow.
There are about 120 million records in the goods collection now.
> db.goods.count()
123535156
And the indexs for goods collection is as following:
db.goods.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "shop.goods",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"item_id" : 1,
"updated_at" : -1
},
"unique" : true,
"ns" : "shop.goods",
"name" : "item_id_1_updated_at_-1"
},
{
"v" : 1,
"key" : {
"updated_at" : 1
},
"ns" : "shop.goods",
"name" : "updated_at_1"
},
{
"v" : 1,
"key" : {
"item_id" : 1
},
"ns" : "shop.goods",
"name" : "item_id_1"
}
]
And there is enough RAM and CPU.
Someone told me because there are too many records. But didn't tell me how to solve this problem. I was a bit disappointed with the MongoDB.
There will be more data needs to be stored in future(about 50 million new records per day). Is there any solution?
Met same situation on another sever(Less data this time, total about 40 million), the current insert speed is about 5 records per second.
> db.products.stats()
{
"ns" : "c2c.products",
"count" : 42389635,
"size" : 554721283200,
"avgObjSize" : 13086.248164203349,
"storageSize" : 560415723712,
"numExtents" : 283,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0000000000132128,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 4257185968,
"indexSizes" : {
"_id_" : 1375325840,
"product_id_1" : 1687460992,
"created_at_1" : 1194399136
},
"ok" : 1
}
I don't know if it is your problem, but take in mind that MongoDB has to update index for each insert. So if you have many indexes, and many documents, performance could be lower than expected.
Maybe, you can speed up inserts operations using sharding. You don't mention it in your question, so I guess you are not using it.
Anyway, could you provide us more information? You can use db.goods.stats(), db.ServerStatus or any of theese other methods to gather information about performance of your database.
Another possible problem is IO. Depending on your scenario Mongo might be busy trying to grow or allocate storage files for the given namespace (i.e. DB) for the subsequent insert statements. If your test pattern has been add records / delete records / add records / delete records you are likely reusing existing allocated space. If your app is now running longer than before you might be in the situation I described.
Hope this sheds some light on your situation.
I had a very similar problem.
First you need to make sure which is your bottleneck (CPU, memory and Disk IO). I use several unix tools (such as top, iotop, etc) to detect the bottleneck. In my case I found insertion speed was lagged by IO speed because mongod often took 99% io usage. (Note: my original db used mmapv1 storage engine).
My work around was to change storage engine to wiredtiger. (either by mongodump your original db then mongorestore into wiredtiger format, or start a new mongod with wiredtiger engine and then resync from other replica set memebers.) My insertion speed went to normal after doing that.
However, I am still not sure why mongod with mmapv1 suddenly drained IO usages after the size of documents reached a point.

Slow range query on a multikey index

I have a MongoDB collection named post with 35 million objects. The collection has two secondary indexes defined as follows.
> db.post.getIndexKeys()
[
{
"_id" : 1
},
{
"namespace" : 1,
"domain" : 1,
"post_id" : 1
},
{
"namespace" : 1,
"post_time" : 1,
"tags" : 1 // this is an array field
}
]
I expect the following query, which simply filters by namespace and post_time, to run in a reasonable time without scanning all objects.
>db.post.find({post_time: {"$gte" : ISODate("2013-04-09T00:00:00Z"), "$lt" : ISODate("2013-04-09T01:00:00Z")}, namespace: "my_namespace"}).count()
7408
However, it takes MongoDB at least ten minutes to retrieve the result and, curiously, it manages to scan 70 million objects to do the job according to the explain function.
> db.post.find({post_time: {"$gte" : ISODate("2013-04-09T00:00:00Z"), "$lt" : ISODate("2013-04-09T01:00:00Z")}, namespace: "my_namespace"}).explain()
{
"cursor" : "BtreeCursor namespace_1_post_time_1_tags_1",
"isMultiKey" : true,
"n" : 7408,
"nscannedObjects" : 69999186,
"nscanned" : 69999186,
"nscannedObjectsAllPlans" : 69999186,
"nscannedAllPlans" : 69999186,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 378967,
"nChunkSkips" : 0,
"millis" : 290048,
"indexBounds" : {
"namespace" : [
[
"my_namespace",
"my_namespace"
]
],
"post_time" : [
[
ISODate("2013-04-09T00:00:00Z"),
ISODate("292278995-01--2147483647T07:12:56.808Z")
]
],
"tags" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost:27017"
}
The difference between the number of objects and the number of scans must be caused by the lengths of the tag arrays (which are all equal to 2). Still, I don't understand why post_time filter does not make use of the index.
Can you tell me what I might be missing?
(I am working on a descent machine with 24 cores and 96 GB RAM. I am using MongoDB 2.2.3.)
Found my answer in this question: Order of $lt and $gt in MongoDB range query
My index is a multikey index (on tags) and I am running a range query (on post_time). Apparently, MongoDB cannot use both sides of the range as a filter in this case, so it just picks the $gte clause, which comes first. As my lower limit happens to be the lowest post_time value, MongoDB starts scanning all the objects.
Unfortunately, this is not the whole story. Trying to solve the problem, I created non-multikey indexes too but MongoDB insisted on using the bad one. That made me think that the problem was elsewhere. Finally, I had to drop the multikey index and create one without the tags field. Everything is fine now.

MongoDB OR condition indexing

I have an OR query which I'm currently using for a semi large update. Essentially my collection is split into two data sets;
1 main repository and 1 subset of the main repository. This is just to allow quicker searching on a small subset of data.
I'm finding however my query which I create to pull things into the subset is timing out.. and when looking at the explain it looks like two queries are actually happening.
PRIMARY> var date = new Date(2012,05,01);
PRIMARY> db.col.find(
{"$or":[
{"date":{"$gt":date}},
{"keywords":{"$in":["Help","Support"]}}
]}).explain();
This produces:
{
"clauses" : [
{
"cursor" : "BtreeCursor ldate_-1",
"nscanned" : 1493872,
"nscannedObjects" : 1493872,
"n" : 1493872,
"millis" : 1035194,
"nYields" : 3396,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"ldate" : [
[
ISODate("292278995-01--2147483647T07:12:56.808Z"),
ISODate("2012-06-01T07:00:00Z")
]
]
}
},
{
"cursor" : "BtreeCursor keywords_1 multi",
"nscanned" : 88526,
"nscannedObjects" : 88526,
"n" : 2515,
"millis" : 1071902,
"nYields" : 56,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"keywords" : [
[
"Help",
"Help"
],
[
"Support",
"Support"
]
]
}
}
],
"nscanned" : 1582398,
"nscannedObjects" : 1582398,
"n" : 1496387,
"millis" : 1071902
}
Is there something I can be indexing better to make this faster? Seems just way to slow...
Thanks ahead of time!
An $or query will evaluate each clause separately and combine the results to remove duplicates .. so if you want to optimize the queries you should first try to explain() each clause individually.
It looks like part of the problem is that you are retrieving a large number of documents while actively writing to that collection, as evidenced by the high nYields (3396). It would be worth reviewing mongostat output while the query is running to consider other factors such as page faulting, lock %, and read/write queues.
If you want to make this query faster for a large number of documents and very active collection updates, two best practice approaches to consider are:
1) Pre-aggregation
Essentially this is updating aggregate stats as documents are inserted/updated so you can make fast real-time queries. The MongoDB manual describes this use case in more detail: Pre-Aggregated Reports.
2) Incremental Map/Reduce
An incremental Map/Reduce approach can be used to calculate aggregate stats in successive batches (for example, from an hourly or daily cron job). With this approach you perform a Map/Reduce using the reduce output option to save results to a new collection, and include a query filter that only selects documents that have been created/updated since the last time this Map/Reduce job was run.
I think you should create a compound index on both date and keywords. Refer to the below post for more specifics based on your use-case
how to structure a compound index in mongodb

Is there a way to force mongodb to store certain index in ram?

I have a collection with a relatively big index (but less than ram available) and looking at performance of find on this collection and amount of free ram in my system given by htop it's seems that mongo is not storing full index in the ram. Is there a way to force mongo to store this particular index in the ram?
Example query:
> db.barrels.find({"tags":{"$all": ["avi"]}}).explain()
{
"cursor" : "BtreeCursor tags_1",
"nscanned" : 300393,
"nscannedObjects" : 300393,
"n" : 300393,
"millis" : 55299,
"indexBounds" : {
"tags" : [
[
"avi",
"avi"
]
]
}
}
Not the all objects are tagged with "avi" tag:
> db.barrels.find().explain()
{
"cursor" : "BasicCursor",
"nscanned" : 823299,
"nscannedObjects" : 823299,
"n" : 823299,
"millis" : 46270,
"indexBounds" : {
}
}
Without "$all":
db.barrels.find({"tags": ["avi"]}).explain()
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 300393,
"nscannedObjects" : 300393,
"n" : 0,
"millis" : 43440,
"indexBounds" : {
"tags" : [
[
"avi",
"avi"
],
[
[
"avi"
],
[
"avi"
]
]
]
}
}
Also this happens when I search for two or more tags (it scans every item as if were no index):
> db.barrels.find({"tags":{"$all": ["avi","mp3"]}}).explain()
{
"cursor" : "BtreeCursor tags_1",
"nscanned" : 300393,
"nscannedObjects" : 300393,
"n" : 6427,
"millis" : 53774,
"indexBounds" : {
"tags" : [
[
"avi",
"avi"
]
]
}
}
No. MongoDB allows the system to manage what is stored in RAM.
With that said, you should be able to keep the index in RAM by running queries against the indexes (check out query hinting) periodically to keep them from getting stale.
Useful References:
Checking Server Memory Usage
Indexing Advice and FAQ
Additionally, Kristina Chodorow provides this excellent answer regarding the relationship between MongoDB Indexes and RAM
UPDATE:
After the update providing the .explain() output, I see the following:
The query is hitting the index.
nscanned is the number of items (docs or index entries) examined.
nscannedObjects is the number of docs scanned
n is the number of docs that match the specified criteria
your dataset is 300393 entries, which is the total number of items in the index, and the matching results.
I may be reading this wrong, but what I'm reading is that all of the items in your collection are valid results. Without knowing your data, it would seem that every item contains the tag "avi". The other thing that this means is that this index is almost useless; indexes provide the most value when they work to narrow the resultant field as much as possible.
From MongoDB's "Indexing Advice and FAQ" page:
Understanding explain's output. There are three main fields to look
for when examining the explain command's output:
cursor: the value for cursor can be either BasicCursor or BtreeCursor.
The second of these indicates that the given query is using an index.
nscanned: he number of documents scanned.
n: the number of documents
returned by the query. You want the value of n to be close to the
value of nscanned. What you want to avoid is doing a collection scan,
that is, where every document in the collection is accessed. This is
the case when nscanned is equal to the number of documents in the
collection.
millis: the number of milliseconds require to complete the
query. This value is useful for comparing indexing strategies, indexed
vs. non-indexed queries, etc.
Is there a way to force mongo to store this particular index in the ram?
Sure, you can walk the index with an index-only query. That will force MongoDB to load every block of the index. But it has to be "index-only", otherwise you will also load all of the associated documents.
The only benefit this will provide is to make some potential future queries faster if those parts of the index are required.
However, if there are parts of the index that are not being accessed by the queries already running, why change this?

What can cause the same query on identical collections which are indexed the same way to return different results?

Me and my teammates are using mongodb 1.8.2. Two out of three environments are working ok when running the following geospatial indexed query:
db.runCommand( {geoNear: "places", near: [-46.65069190000003, -23.5633661],
maxDistance: 0.0006278449223041908, spherical: true,
distanceMultiplier: 6371.0 });
against the following collection of 4032 documents:
{
"ns" : "places",
"count" : 4032,
"size" : 1645724,
"avgObjSize" : 408.1656746031746,
"storageSize" : 2785280,
"numExtents" : 4,
"nindexes" : 2,
"lastExtentSize" : 2097152,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 344064,
"indexSizes" : {
"_id_" : 180224,
"location_2d" : 163840
},
"ok" : 1
}
running it in two distinct mongodb instances (one OSX Lion, other Ubuntu 11.04 server) the resultset contains 100 records with the following execution stats:
"stats" : {
"time" : 0,
"btreelocs" : 522,
"nscanned" : 522,
"objectsLoaded" : 146,
"avgDistance" : 0.4824636947838318,
"maxDistance" : 0.00012637762666867466
},
(Ok so far:The query is using the index, as you can see by the amount of btree node walks)
BUT in one of the environments(another OS X Lion) the results are drastically different(3 instead 100 from other machines) with exactly the same dataset and indexes:
"stats" : {
"time" : 0,
"btreelocs" : 45,
"nscanned" : 50,
"objectsLoaded" : 6,
"avgDistance" : 0.865580980499049,
"maxDistance" : 0.0001845858750423995
},
It's noticeable that the query is running differently in this mongod instance. What i'd like to know is which factors can make this happen
What i tried so far:
Mongodb server and client versions are the same (including git hash)
The supposedly weird database has been wiped out, restored from a BSON dump and indexes recreated
Version info:
db version v1.8.2, pdfile version 4.5
Tue Aug 23 23:33:22 git version: 433bbaa14aaba6860da15bd4de8edf600f56501b
So I'm actually wondering about the data integrity here. The "bad" data set basically did about one tenth of the work of the good data set. Almost like it just decided to stop part-way through and not tell you what was going on.
MongoDB has a validate command that can double-check the integrity of a collection. Would you be able to run that and see if anything comes up?
Link to command here.
If the number returned is exactly 100 then I am suspicious that your results sets are different because there are more than 100 points with the specified distance. Therefore the DB is free to pick just the first 100 points that meet the criteria.
Is there a way to increase the result size to something like 2000 or reduce the distance to something smaller.