Latency on aws (m1.large) with MongoDB 64b 2.x

Latency on aws (m1.large) with MongoDB 64b 2.x - mongodb

I have deployed mongodb 64 bit 2.x version on aws m1.large instance.
I am trying to find best performance that mongo can give us on aws in-light of http://www.snailinaturtleneck.com/blog/tag/mongodb/ (and mongodb read/write performance and mongo hosting in the cloud)
I have created one db with one collection i.e. user and inserted 100,000 records/json object (each json object size is 4KB) using random number as suffix to “user-“. Also, created index on user id.
Further, I set db profiler to log slow query taking 20ms or more. I have executed java program with 10 threads. Each java class generates user id with random number and finds it in user collection in infinite loop. With such load I have observed latency in query/read up-to 60ms.
I also observed that when I run less number of threads say 3 or 4 (having query load on user collection 5K per second to find users) then I see no latency or less then 2ms latency.
I failed to understand why increasing load of finding user in collection is causing latency. I believe that mongo db can perform much more concurrent read then what I am trying and should not impact on performance as such.
One possibility I assume that would be - mongo is having performance issues if there are large queries executed on single collection like in our case, I expect to have 10K to 20K queries per second on single collection.
We would appreciate your thoughts / suggestion.

Some information is missing - what is your disk configuration? The EBS may contribute to the latency if everything is persisted to disk.
Amazon had released a white paper with best practices on how to install mongo on EC2: MongoDB on AWS. Here's its description
This whitepaper provides an overview of general best practices that apply to all major NoSQL systems and highlights one of popular NoSQL systems - MongoDB - and discusses how to best run it on the AWS cloud. It further examines different MongoDB configurations so you can optimize it for performance, durability, and security.

Related

Correct way to run an ETL on a live production MongoDB database

We have the following environment:
3 servers + 3 replica set servers
Each server has 3 shards
Our main collection has around 40,000,000 documents that average at around 6kb
Our shard key is hashed(_id) - _id being pure BsonID
We peak daily at around 25,000 I/OPS and low point at around 10,000
We want to run an ETL that loads all of the documents on the main collection, do some in memory calculation (in our application tier) and then dumps it into an external DB.
We took the very poor and naive approach and simply loaded documents without a query using limit, skip and batchSize - which was a complete failure (severly hurt our service level - Even though we set the readPreference to secondary)
db.Collection.find().skip(i * 5000).limit(5000).readPref(secondary)
Where i is the current iteration we're going through, which runs on multiple threads to speed up the process.
I was wondering what would be the best approach to be able to load all of our documents without hurting the performance of our database.
The data can be a bit stale (a few seconds delay from the actual data on the primary is fine).
I've posted this question on DB admins but it doesn't seem to attract much answers so I'm posting it here as well. Sorry if it's against the forum rules.
Thanks!

How to improve the performance of feed system using mongodb

I have a feed system using fan-out on write. I keep a list of feed ids in redis sorted set, and save the feed content in mongodb, so every time when i read 30 feeds, i have to do 30 query to mongodb, is there anyway to improve it ?

Its depend upon your setup of database. MongoDb has a vast documentation about how to increase simultaneous read and write MongoDb conncurrency
If you need so many writes in database with less latency starts using sharding Deployment Sharding.
If you need to increase number of reads in data base deploy each shards as replica set and rout your read query in secondary node Read Prefences
Also each query should covered by index Better indexing, you can check your query time by simply adding explain after a find it will show you the time and all facts
db.collection.find({a:"abcd"}).explain()
Make sure you have enough ram so that your data set fits with ram atleast your index should fit inside the ram coz each time a data fetched from disk is 10 times slower than RAM.
Check your sever status with running MongoStat it will measures your database performance , page fault , lock , query opertaion manny detail.
Also measure your hardware performance with program like iostat and make sure io wait is low and less than 1%.
Few good links to deployment of mongodb and performance tuning.
1. Production deployment of mongodb
2. Performance tuning of mongodb By 10gen
3. Using redis before mongodb to cache query and result object
4. Example of redis and mongo

Configure a Mongo replica set to only replicate certain collections

I have a ~3GB mongo database with several dozen collections. Three of these collections handle ~300 queries per second, while the rest sustain a much lower volume. I expect the traffic to continue to grow quickly.
I'd like to set up a replica set to handle the high-traffic collections. It isn't necessary for this new instance to replicate the rest of the database. Is this possible?

Seems like not possible at the moment by built-in features of mongodb and only way to do is to come up with your own manual replication algorithm or use some other tools written by third parties.
https://github.com/wordnik/wordnik-oss project might help you to achieve this according to the following post.
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/Ap9V4ArGuFo
Describes workaround to filter documents in replication.
Replicate only documents where {'public':true} in MongoDB
Or just replicate the data yourself manually which might worth trying.
Good luck.

No that isn't possible now. What you could do is move those collections into another unreplicated database. But this will cause headaches once these collections see higher traffic too, so you would need to move them into your "replication"-db.
But in general Replication isn't the way to go if you need to scale, it's more considered for DR/failover. Replicaset Secondaries can only (optionally) answer read queries but no write queries, this is something you should keep in mind. So if you have high write load this may not cure your problem.
Once you allow your application to read from secondaries you need to live with eventual consistency, meaning that your application isn't guaranteed to see always the latest data. This is caused due to the asynchronous replication to the secondaries.
Indeed you can cure this problem if you configure your writeconcern, so that the write needs to succeeded on all replicas, before it's considered written and your driver returns. But this may slow down your write operations significant.
So for scaling query execution capabilities I would go with Sharding. This is possible on a per collection level, all unsharded collections will remain on a "default-shard".

Not possible but then if the data size is so small and these collections aren't updated, then the only overhead of having them replicated is the small storage size on the secondary. That is a relatively small price to pay, especially since the collections won't grow in size, compared with writing your own replication logic.

Instead of that archive the data, and have only the latest data set on the production server and the rest of the data can archive on the new server.

Scaling MongoDB on EC2 or should I just switch to DynamoDB?

I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on
Minimal changes to my code (the code is in perl)
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
Low cost
In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).
I can think of a few options
Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
Move everything to SimpleDB
Move everything to DynamoDB
I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?

You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.
We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.

I currently run my website on a single server with MongoDB.
First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.
Replication provides automatic redundancy and fail-over.
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.
If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.
The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.
This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.
You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.

Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.

I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.
Three is a good white paper on how to set up MongoDB on Amazon EC2: http://d36cz9buwru1tt.cloudfront.net/AWS_NoSQL_MongoDB.pdf
I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.
Best of luck!

MongoDB: Sharding on single machine. Does it make sense?

created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?

Yes, it does make sense to shard on a single server.
At this time, MongoDB still uses a global lock per mongodb server.
Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can
also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.
Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.

In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.
The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.
Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.

This doesn't have to be a mongo question, it's a general operating system question. There are three possible bottlenecks for your database use.
network (i.e. you're on a gigabit line, you're using most of it at peak times, but your database isn't really loaded down)
CPU (your CPU is near 100% but disk and network are barely ticking over)
disk
In the case of network, rewrite your network protocol if possible, otherwise shard to other machines. In the case of CPU, if you're 100% on a few cores but others are free, sharding on the same machine will improve performance. If disk is fully utilized add more disks and shard across them -- way cheaper than adding more machines.

No, it does not make sense to shard a on a single server.
There are a few exceptional cases but they mostly come down to concurrency issues related to things like running map/reduce or javascript.

This is answered in the first paragraph of the Replica set tutorial
http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse