I've been playing with mongo 3.0 rc7 and rc8 and I've discovered that mongostat doesn't show lock rate column whether I use MMAPv1 or WiredTiger engine. Similarly in MMS, "lock %" chart is unavailable for 3.0 systems
We've been using lock rate in our monitoring systems, and also as one of the measurement during performance tests (we've been running same sets of heavy load tests via Gatling or Tsung and observing if recent optimizations in our usage of DB have some real impact, and also to discover if some new features doesn't have regression in this area).
Is there a way to find this value some way in mongo 3? Now we mainly want to run comparison tests on 2.6.7 and 3.0.0-rc8 to see what the difference is, and while we of course get nice set of data from the application performance standpoint, we'd also like to compare some DB stats and lock rate was one of them. Or are we completely missing the point and collection level locks in v3 MMAPv1 or document level locks in WiredTiger are now pointless to measure or compare? If so, how can we measure, what is the DB limit at heavy load (in < 2.6.7 it was fairly easy, usually lock rate was the first thing to fire and once it got above 70-80% we knew that it's the upper limit), or test regressions/improvements in how we use DB?
Many thanks
It's not pointless to compare some kind of lock statistics for mmapv1 and WiredTiger, but I think the situation right now is that it's unclear what you should be looking at in WiredTiger for comparison. The operation of the storage engine is very different from mmapv1. Presently, I think you'll want to look at other statistics like throughput, and you can expect more statistics and more guidance on using them in future versions of MongoDB with WiredTiger.
Related
To give you an idea of the data:
DB has a collections/tables that has over a hundred million documents/records each containing more than 100 attributes/columns. The data size is expected to grow by hundred times soon.
Operations on the data:
There are mainly the following types of operations on the data:
Validating the data and then importing the data into the DB, that happens multiple times daily
Aggregations on this imported data
Searches/ finds
Updates
Deletes
Tools/softwares used:
MongoDB for database: PSS architecture based replicaset, indexes (most of the queries are INDEX scans)
NodeJS using Koa.js
Problems:
HOWEVER, the tool is very badly slow when it comes to aggregations, finds, etc.
What have I implemented for performance so far?:
DB Indexing
Caching
Pre-aggregations (using MongoDB aggregate to aggregate the data before hand and store it in different collections during importing to avoid aggregations at runtime)
Increased RAM and CPU cores on the DB server
Separate server for NodeJS server and Front-end build
PM2 to manage NodeJS server application and for spawning clusters
However from my experience, even after implementing all the above, the application is not performant enough. I feel that the reason for this is that the data is pretty huge. I am not aware of how Big Data applications are managed to deliver high performance. Please advise.
Also, is the selection of technology not suitable or will changing the technology/tools help? If yes, what is advised under such scenarios?
I'm requesting your advise to help me improve the performance of the application.
Not easy to give a correct answer because we do not really have that much details. What I would do is a detailed monitoring, at least the following:
Machine Level:
monitor the overall CPU load (for all cores) and RAM usage on your DB machine
monitor disk IO on the disks where the data is stored
this should show, if the machine specs are a bottleneck
Database & DB Process Level (my first guess, that this is the critical part):
what is the overall size of your data at the moment (I know, it will increase drastically but if it is already to slow now, this could be an interesting information - especially in relation to the current RAM size and number of CPU cores)
monitor memory usage and CPU load for your mongo DB process...
did a look on the query plans (while doing aggregations) guided you, what improvements can be done?
have look at the caching strategy. What strategy are you using?
this should give more detailed results on where to make improvements on a DB level. Is it just because of hardware bottlenecks or is it a aggregation problem...
Node.JS APP Level:
node.js app: how much RAM and CPU usage does this one take ...?
if there are multiple instances of the node.js app, track this for all instances
is the data import also happens through the nodejs app. Does the load on the app increases drastically while importing data?
if you see that you have a high load on this app that there is a need to act here (increasing instances, splitting it into seperate apps (e.g. import as a seperate app)
I have a small confusion regarding pymongo connections . I have a situation in which I have to insert continuous stream of data into MongoDB for which I am using pymongo. The problem is once I create a connection to the DB I need to continuously insert into different collections? Is it good to switch between different collections very quickly to insert data (and this happens for very long may be for hours) . I mean will it not create any problems ? I have tried it seems to work properly but I still want to know even if I insert documents into different collections very quickly will it cause any problem?
It won't cause any problems till your active set fits in the RAM, else you will start to see faults in mongostat.
The concept remains the same irrespective of your storage engine WiredTiger or MMAPv1, though you will achieve a better write and storage size performance with WiredTiger engine.
Best way to figure out long term performance is to simulate your write load, and observe mongostat for that duration.
I'm looking to utilize MongoDB for session data storage, so we don't need sticky sessions in our load balanced environment.
As of 3.0, we can use different storage engines within MongoDB.
While MMapV1 and WiredTiger come out of the box, it's also possible to run other storage engines (RocksDB?).
What I would like to do is test out my website using MongoDB with the different storage engines backed behind it.
I currently have a JMeter script that will hit multiple pages on the site for many different users.
Between tests I can switch out the Mongo connection, to different Mongod instances on different storage engines.
All I can really take out of this is the average latency for the page loads in JMeter.
Is there better results I can find, possibly using different tools or techniques?
Or, for session data, which is heavily read/write, is there one storage engine that would be preferred over another?
I'm not sure if this question is too open-ended or not, but I thought I'd ask here to maybe get more direction about how to test this out.
An important advantage of WiredTiger over the default MMAP storage engine is that while MMAP locks the whole collection for a write, WiredTiger locks only the affected document(s). That means multiple users can change multiple documents at the same time. This is especially interesting in your case of session data, because you will likely have many website visitors at the same time, each one regularly updating their own session document. But when you want to test if this feature really provides a benefit in your use-case, you will have to build a more sophisticated test setup which simulates many simultaneous updates and requests from multiple users.
Another interesting feature of WiredTiger is that it compresses both data and indexes, which greatly reduces filesize. But this feature does of course cost performance. So when you only want to compare performance, you should switch off compression to have a fair comparison. The relevant config keys are:
storage.wiredTiger.collectionConfig.blockCompressor = none
storage.wiredTiger.indexConfig.prefixCompression = false
Keep in mind that changes to these keys will only take effect on newly created collections and indexes.
Another factor which could skew your results is cache size. The MMAP engine always uses all the RAM it can get to cache data. But WiredTiger is far more conservative and only uses half of the available RAM, unless you set a different value in
storage.wiredTiger.engineConfig.cacheSizeGB
So when you want a fair comparision, you should set this to the RAM size of the machine it runs on, minus the ram required by other processes running on the same machine. But this will of course only make a difference when your test uses more test data than fits into memory, so that the cache handling of both engines starts to matter.
This is a basic question, but very important, and i am not sure to really get the point.
On the official documentation we can read
MongoDB keeps all of the most recently used data in RAM. If you have created indexes for your queries and your working data set fits in RAM, MongoDB serves all queries from memory.
The part i am not sure to understand is
If you have created indexes for your queries and your working data set fits in RAM
what does mean "indexes" here?
For example, if i update a model, then i query it, because i have updated it, it's now in RAM so it will come from the memory, but this is not very clear in my mind.
How can we be sure that datas we query will come from the memory or not? I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
How do you globally advise to use MongoDB for huge traffic?
Note: This was written back in 2013 when MongoDB was still quite young, it didn't have the features it does today, while this answer still holds true for mmap, it does not for the other storage technologies MongoDB now implements, such as WiredTiger, or Percona.
A good place to start to understand exactly what is an index: http://docs.mongodb.org/manual/core/indexes/
After you have brushed up on that you will udersand why they are so good, however, skipping forward to some of the more intricate questions.
How can we be sure that datas we query will come from the memory or not?
One way is to look at the yields field on any query explain(). This will tell you how many times the reader yielded its lock because data was not in RAM.
Another more indepth way is to look on programs like mongostat and other such programs. These programs will tell you about what page faults (when data needs to be paged into RAM from disk) are happening on your mongod.
I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
This is actually incorrect. It is easier to just say that MongoDB does this but in reality it does not. It is in fact the OS and its own paging algorithms, usually the LRU, that does this for MongoDB. MongoDB does cache index plans for a certain period of time though so that it doesn't have to constantly keep checking and testing for indexes.
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
Not sure how you expect that to work...I mean the two do quite different things and if you intend to read your data from MongoDB into your application on startup into that var then I definitely would not recommend it.
Besides OS algorithms for memory management are extremely mature and fast, so it is ok.
How do you globally advise to use MongoDB for huge traffic?
Hmm, this is such a huge question. Really I would recommend you Google a little in this subject but as the documentation states you need to ensure your working set fits into RAM for one.
Here is a good starting point: What does it mean to fit "working set" into RAM for MongoDB?
MongoDB attempts to keep entire collections in memory: it memory-maps each collection page. For everything to be in memory, both the data pages, and the indices that reference them, must be kept in memory.
If MongoDB returns a record, you can rest assured that it is now in memory (whether it was before your query or not).
MongoDB doesn't keep a "cache" of records in the same way that, say, a web browser does. When you commit a change, both the memory and the disk are updated.
Mongo is great when matched to the appropriate use cases. It is very high performance if you have sufficient server memory to cache everything, and declines rapidly past that point. Many, many high-volume websites use MongoDB: it's a good thing that memory is so cheap, now.
What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.