My system config : OSx RAM:-8GB ,2.5 Gz i5
Both database table has 1 mill rows and same data . I am executing same aggregate query at both database.
db.temp.aggregate([
{ "$match": { ITEMTYPE: 'like' } },
{ "$group" : {_id :{ cust_id2: "$ActorID", cust_id: "$ITEMTYPE"}, numberofActorID : {"$sum" : 1}}},
{ "$sort": { numberofActorID: -1 } },
{ "$limit" : 5 }
]);
I had created covering index
db.temp.ensureIndex( { "ITEMTYPE": 1, "ActorID": 1 } );
and selectivity of "like" is 80%
Time Results are
sqlWithout sqlWithIndex mongoWithout mongoWithIndex
958 644 3043 4243
I didn't upgrade system parameter(not even sharding) of MongoDB
Please suggest me why mongoDB is slow and how i can improve this problem.
{
"stages" : [
{
"$cursor" : {
"query" : {
"ITEMTYPE" : "like"
},
"fields" : {
"ActorID" : 1,
"ITEMTYPE" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BtreeCursor ",
"isMultiKey" : false,
"scanAndOrder" : false,
"indexBounds" : {
"ITEMTYPE" : [
[
"like",
"like"
]
],
"ActorID" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor ",
"isMultiKey" : false,
"scanAndOrder" : false,
"indexBounds" : {
"ITEMTYPE" : [
[
"like",
"like"
]
],
"ActorID" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
]
}
}
},
{
"$group" : {
"_id" : {
"cust_id2" : "$ActorID",
"cust_id" : "$ITEMTYPE"
},
"numberofActorID" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$sort" : {
"sortKey" : {
"numberofActorID" : -1
},
"limit" : NumberLong(5)
}
}
],
"ok" : 1
}
Structure of JSON
{ "_id" : ObjectId("5492ba51ff16cd9391a2c02d"), "POSTDBID" : 231041, "ITEMID" : 231041, "ITEMTYPE" : "post", "ITEMCREATIONDATE" : ISODate("2009-02-28T20:37:02Z"), "POSVal" : 0.327282, "NEGVal" : 0.315738, "NEUVal" : 0.356981, "LabelSentiment" : "Neutral", "ActorID" : NumberLong(1179444542), "QuarterLabel" : "2009-1\r", "rowid" : 2 }
Note: Some of the things I mention are simplified for the sake of this answer. However, to the best of my knowledge, they can be applied as described.
Misconceptions
First of all: aggregations can't utilize covered queries:
Even when the pipeline uses an index, aggregation still requires access to the actual documents; i.e. indexes cannot fully cover an aggregation pipeline.
(see the Aggregation documentation for details.)
Second: Aggregations are not meant to be used as real time queries
The aggregation pipeline provides an alternative to map-reduce and may be the preferred solution for aggregation tasks where the complexity of map-reduce may be unwarranted.
You would not want to use map/reduce for real time processing, would you? ;) While sometimes aggregations can be so fast that they can be used as real time queries, it is not the intended purpose. Aggregations are meant for precalculation of statistics, if you will.
Improvements on the aggregation
You might want to use a $project phase right after the match to reduce the data passed into the group phase to that what is processed there:
{ $project: { 'ActorID':1, 'ITEMTYPE':1 } }
This might improve the processing.
Hardware impact
As for your description, I assume you use some sort of MacBook. OSX and the program's you have running require quite some RAM. MongoDB, on the other hand, tries to keep as much of it's indices and the so called working set (most recently accessed documents, to keep it simple) in RAM. It is designed that way. It is supposed to run on one or more dedicated instances. You might want to use MMS to check wether you have a high number of page faults – which I'd expect. MySQL is much more conservative and less dependent on free RAM, although it will be outperformed by MongoDB when a certain amount of ressources is available (conceptually, because the two DBMS are very hard to compare reasonably), simply because it is not optimized for dealing with situations when a lot of RAM is available. We don't even touch resource competition between various processes here, which is a known performance killer for MongoDB, too.
Second, in case you have a spinning disk: MongoDB has – for various reasons – sub par read performance on spinning disks, the main problem being seek latency. Usually, the disks in MacBooks do 5400rpm, which further increases seek latency, worsening the problem, and making it a real pain in the neck for aggregations, which – as shown - access a lot of documents. The way the MongoDB storage engine works, two documents which follow each other in an index might well be saved at two entirely different locations, even in different data files. ( This is because MongoDB is heavily write optimized, so documents are written at the first position providing enough space for the document and it's padding. ) So depending on the number of documents in your collection, you can have a lot of disk seeks.
MySQL, on the other hand, is rather read optimized.
Data modelling
You did not show us your data model, but sometimes small changes in the model have a huge impact on performance. I'd suggest doing a peer review of the data model.
Conclusion
You are comparing two DBMS, which are designed and optimized for diametrical use cases on an environment which is pretty much the opposite of that for what one of these systems was specifically designed in a use case for which it wasn't optimized, expecting real time results from a tool which isn't made for that. That's might be the reasons why MongoDB is outperformed by MySQL. Side note: you didn't show us the corresponding (My)SQL query.
Related
document sample data followed like this,
{
"_id" : ObjectId("62317ae9d007af22f984c0b5"),
"productCategoryName" : "Product category 1",
"productCategoryDescription" : "Description about product category 1",
"productCategoryIcon" : "abcd.svg",
"status" : true,
"productCategoryUnits" : [
{
"unitId" : ObjectId("61fa5c1273a4aae8d89e13c9"),
"unitName" : "kilogram",
"unitSymbol" : "kg",
"_id" : ObjectId("622715a33c8239255df084e4")
}
],
"productCategorySizes" : [
{
"unitId" : ObjectId("61fa5c1273a4aae8d89e13c9"),
"unitName" : "kilogram",
"unitSize" : 10,
"unitSymbol" : "kg",
"_id" : ObjectId("622715a33c8239255df084e3")
}
],
"attributes" : [
{
"attributeId" : ObjectId("62136ed38a35a8b4e195ccf4"),
"attributeName" : "Country of Origin",
"attributeOptions" : [],
"isRequired" : true,
"_id" : ObjectId("622715ba3c8239255df084f8")
}
]
}
This collection has been indexed in "_id". without sub-documents execution time is reduced but all document fields are required.
db.getCollection('product_categories').find({})
This collection contains 30000 records and this query takes more than 30 seconds to execute. so how to solve this issue. Anybody ask me a better solution. Thanks.
Indexing and compound indexing will make it use cache instead of scanning document every time you query it. 30.000 documents is nothing to MongoDB, it can handle millions in a second. If these fields are populated in the process that's another heavy operation for the query.
See if your schema is efficiently structured or you're throttling your connection to the server. Other thing to consider is to project only the fields that you require, using aggregation pipeline.
Although the question is not very clear you can follow this article for some best practices.
I have a single standalone mongo installation on a Linux machine.
The database contains a collection with 181 million documents. This collection is by far the largest collection in the database (approx 90%)
The size of the collection is currently 3.5 TB.
I'm running Mongo version 4.0.10 (Wired Tiger)
The collection have 2 indexes.
One on id
One on 2 fields and it is used when deleting documents (see those in the snippet below).
When benchmarking bulk deletion on this collection we used the following snippet
db.getCollection('Image').deleteMany(
{$and: [
{"CameraId" : 1},
{"SequenceNumber" : { $lt: 153000000 }}]})
To see the state of the deletion operation I ran a simple test of deleting 1000 documents while looking at the operation using currentOp(). It shows the following.
"command" : {
"q" : {
"$and" : [
{
"CameraId" : 1.0
},
{
"SequenceNumber" : {
"$lt" : 153040000.0
}
}
]
},
"limit" : 0
},
"planSummary" : "IXSCAN { CameraId: 1, SequenceNumber: 1 }",
"numYields" : 876,
"locks" : {
"Global" : "w",
"Database" : "w",
"Collection" : "w"
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(877),
"w" : NumberLong(877)
}
},
"Database" : {
"acquireCount" : {
"w" : NumberLong(877)
}
},
"Collection" : {
"acquireCount" : {
"w" : NumberLong(877)
}
}
}
It seems to be using the correct index but the number and type of locks worries me. As I interpret this it aquires 1 global lock for each deleted document from a single collection.
When using this approach it has taken over a week to delete 40 million documents. This cannot be expected performance.
I realise there other design exists such as bulking documents into larger chunks and store them using GridFs, but the current design is what it is and I want to make sure that what I see is expected before changing my design or restructuring the data or even considering clustering etc.
Any suggestions of how to increase performance on bulk deletions or is this expected?
I have a MongoDB collection with a lot of indexes.
Would it bring any benefits to delete indexes that are barely used?
Is there any way or tool which can tell me (in numbers) how often a index is used?
EDIT: I'm using version 2.6.4
EDIT2: I'm now using version 3.0.3
Right, so this is how I would do it.
First you need a list of all your indexes for a certain collection (this will be done collection by collection). Let's say we are monitoring the user collection to see which indexes are useless.
So I run a db.user.getIndexes() and this results in a parsable output of JSON (you can run this via command() from the client side as well to integrate with a script).
So you now have a list of your indexes. It is merely a case of understanding which queries use which indexes. If that index is not hit at all you know it is useless.
Now, you need to run every query with explain() from that output you can judge which index is used and match it to and index gotten from getIndexes().
So here is a sample output:
> db.user.find({religion:1}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "meetapp.user",
"indexFilterSet" : false,
"parsedQuery" : {
"religion" : {
"$eq" : 1
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"religion" : NumberLong(1)
},
"indexName" : "religion_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"religion" : [
"[1.0, 1.0]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "ip-172-30-0-35",
"port" : 27017,
"version" : "3.0.0",
"gitVersion" : "a841fd6394365954886924a35076691b4d149168"
},
"ok" : 1
}
There are a set of rules that the queryPlanner field will use and you will need to discover and write for them but this first one is simple enough.
As you can see: the winning plan (in winningPlan) is a single (could be multiple remember, this stuff you will need to code around) IXSCAN (index scan) and the key pattern for the index used is:
"keyPattern" : {
"religion" : NumberLong(1)
},
Great, now we can match that the key output of getIndexes():
{
"v" : 1,
"key" : {
"religion" : NumberLong(1)
},
"name" : "religion_1",
"ns" : "meetapp.user"
},
to tells us that the religion index is not useless and is in fact used.
Unfortunately this is the best way I can see. It used to be that MongoDB had an index stat for number of times the index was hit but it seems that data has been removed.
So you would just rinse and repeat this process for every collection you have until you have removed the indexes that are useless.
One other way of doing this, of course, is to remove all indexes and then re-add indexes as you test your queries. Though that might be bad if you do need to do this in production.
On a side note: the best way to fix this problem is to not have it at all.
I make this easier for me by using a indexing function within my active record. Once every so often I run (from PHP) something of the sort: ./yii index/rebuild which essentially goes through my active record models and detects which indexes I no longer use and have removed from my app and removes them in turn. It will, of course, create new indexes.
I have a collection which is going to hold machine data as well as mobile data, the data is captured on channel and is maintained at single level no embedding of object , the structure is like as follows
{
"Id": ObjectId("544e4b0ae4b039d388a2ae3a"),
"DeviceTypeId":"DeviceType1",
"DeviceTypeParentId":"Parent1",
"DeviceId":"D1",
"ChannelName": "Login",
"Timestamp": ISODate("2013-07-23T19:44:09Z"),
"Country": "India",
"Region": "Maharashtra",
"City": "Nasik",
"Latitude": 13.22,
"Longitude": 56.32,
//and more 10 - 15 fields
}
Most of the queries are aggregation queries, as used for Analytics dashboard and real-time analysis , the $match pipeline is as follows
{$match:{"DeviceTypeId":{"$in":["DeviceType1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
or
{$match:{"DeviceTypeParentId":{"$in":["Parent1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
and many of my DAL layer find queries and findOne queries are mostly on criteria DeviceType or DeviceTypeParentId.
The collection is huge and its growing, I have used compound index to support this queries, indexes are as follows
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceType" : 1,
"Timestamp" : 1
},
"name" : "DeviceType_1_Timestamp_1",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceTypeParentId" : 1,
"Timestamp" : 1
},
"name" : "DeviceTypeParentId_1_Timestamp_1",
"ns" : "DB.channel_data"
}
]
Now we are going to add support for match criteria on DeviceId and if I follow same strategy as I did for DeviceType and DeviceTypeParentId is not good,as I fell by my current approach I'm creating many indexes and all most all will be same and huge.
So is their any good way to do indexing . I have read a bit about Index Intersection but not sure how will it be helpful.
If any wrong approach is followed by me please point it out as this is my first project and first time I am using MongoDB.
Those indexes all look appropriate for your queries, including the new one you're proposing. Three separate indexes supporting your three kinds of queries are the overall best option in terms of fast queries. You could put indexes on each field and let the planner use index intersection, but it won't be as good as the compound indexes. The indexes are not the same since they support different queries.
I think the real question is, are the (apparently) large memory footprint of the indices actually a problem at this point? Do you have a lot of page faults because of paging indexes and data out of disk?
I am doing many upserts to a collection that also receives a lot of find queries. My upserts have write concern unacknowledged. Many of these upserts appear in the mongo log with runtimes above 800ms and yields above 20. The number of inprog operations on the server seems stable around 20 with peaks around 40.
The collection contains ~15 million documents.
Does these long query times indicate that the mongo server cannot keep up with the incoming data, or is it just postponing the unacknowledged queries in a controlled manner?
The documents in the collection look like this:
{
"_id" : ObjectId("53c65f9f995bce51e4d84ecb"),
"items" : [
"53216cf7e4b04d3fa854a4d0",
"53218be4e4b0a79ba7fee19a"
],
"score" : 1,
"other" : [
"b09b2c99-e4f3-48a2-990d-4b2090cc9666",
"b09b2c99-e4f3-48a2-990d-4b2090cc9666"
]
}
I have the following indexes
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "dbname.stuff"
},
{
"v" : 1,
"key" : {
"items" : 1,
"score" : -1
},
"name" : "items_1_score_-1",
"ns" : "dbname.stuff",
"background" : true
}
]
The slow upserts look like this in the log
update dbname.stuff query: { items: [ "52ea4da1e4b035b15423f8f5", "53c7cf43e4b007135ca60114" ] } update: { $inc: { score: 6 }, $setOnInsert: { others: [ "64a7e6b1-2a0a-4374-ac9c-fbf2de7cbb48", "b9e07cda-14c8-45e4-95cc-f0f4c5bc410c" ] } } nscanned:0 nscannedObjects:0 nMatched:1 nModified:0 fastmodinsert:1 upsert:1 keyUpdates:0 numYields:16 locks(micros) w:46899 1752ms
Acknowledgement on write or "Write Concern" does not affect overall query performance, just really the time that could possibly be taken while the client waits for the acknowledgement. So if you are more or less in "fire and forget" mode, then your client is not held up but the write operations can still possible take as while.
In this case, it seems your working set is actually quite large. It is also worth considering that this is "server wide" and not just constrained to one collection or even database. The general case here is you cannot have enough RAM for what you are trying to load and you are running into paging. See the "yields" counter.
Upserts require to "find" the matching document in the index, so even if there is no match you still need to "scan" and find out if the item exists. This means loading the index into memory. As such you have a few choices:
Remodel to make these write "insert only", and aggregate "counter" type values in background processes, but basically not in real time.
Add more RAM within your means.
Time to shard so you can have several "shards" in your cluster that have a capable amount of RAM to deal with the working set sizes.
Nothing is easy here, and depending on what your application actually requires then all offer varying levels of solution. Indeed though if you are prepared to live without "write acknowledgements" in general, then you might need to work around the rest of your application bearing with "eventual consistency" of those writes actually being available to be read.