I seem to be having real issues trying to get performance anywhere near that stated in the docs (~700 - 2000 tps with a VM of: 2 vCPUs 4GB RAM). I have tried on a local VM, a local machine and a few AWS VMs and I can't get anywhere close. - The maximum I have achieved is 80 tps on an AWS VM.
I have tried changing the -dbPoolSize and the -reqPoolSize for orion and playing with ulimit to set it to that suggested by MongoDB - but everything I change doesn't seem to get me anywhere close.
I have set indexes on the _id.id, _id.type and _id.servicePath as suggested in the docs - the latter of which gave me an increase from 40 tps to 80 tps.
Are there any config options for Orion or Mongo that I should be setting away from the default which will get me any closer? Are there any other tips for performance? The link in the docs to the test scripts doesn't work so I haven't been able to see the examples.
I have created my own test scripts using Node.js and I have tested update and queries using a variable amount of concurrent connections and between 1 and 2 load injectors.
From looking at the output from "top" the load is with Mongo as it almost maxes out the CPU but adding more cores to the VM doesn't change the stats. The VM has 7.5GB or 15GB of RAM so mongo should be able to put all the data into memory for blazing fast performance?
I have used mongostat to see that the connections from orion to mongo change with the -dbPoolSize option, but this doesn't yield any better performance.
Any help you can provide would be much appreciated.
I have tried using CentOS 6.5 and 6.7 with Orion 0.25 and 0.26 and MongoDB 2.6 with ~500,000 entities
My test scripts and data are on GitHub
I have only tested without subscriptions so far, but I have scripts ready to test with subscriptions - but I wanted to get a good baseline before adding subscriptions.
My data is modeled around parking spaces in the UK countries their regions and their outcodes (first part of the postcode). This is using servicePaths to split them down to parking lot in an outcode.
Here is a gist with the requests and mongo shell output
Performance is a complex topic which depends on many factor (deployment setup of Orion and MongoDB, startup configuration of Orion and MongoDB, hardware profile in the systems hosting the processes, network communications, overprovisioning level in the case of virtualization, injected load, etc.) so there isn't any general answer to deal with this kind of problems. However, I'd try to provide some hints and recommendations that I hope may help.
Regarding versions, Orion 0.26.0 (or 0.26.1) is recommended over 0.25.0. We have included a lot of improvements related with performance in Orion 0.26.0. Regarding MongoDB, we have also found that 3.0 could be much better than 2.6, specially in update intensive scenarios.
Having said that, first of all you should locate the bottleneck. Useful tools to do this are top, mongostat and mongotop. It could be either Orion, MongoDB or the network connecting them. If the bottleneck is CB, maybe the performance tuning hints provided in this document may help. Slow queries information in MongoDB could be also pointing to bottlenecks at Orion. If the bottleneck is MongoDB, taking into account the large number of entities you have (500,000) maybe you should consider to implement sharding. If the bottleneck is the network, colocation both Orion and MongoDB may help.
Finally, some things you can also try in order to get more insight into the problem:
Run some tests outside AWS (i.e. virtual machines in local premises) to compare. I don't know too much about the overprovising policy in AWS but based in my previous experiences with other cloud providers the VM overprovisioning (specially if it varies along time) could impact in performance.
Analyze if the peformance is related with the number of entities. E.g. run test with 500, 5,000, 50,000 and 500,000 entities and get the performance figure in each case.
Analyze if the performance is related with the usage of servicePath, e.g. put all the 500,000 entities in the default service path / (moving the current content of the servicePath to another place, e.g. an entity attribute or part of the entity ID string) and test. Currently Orion uses a regex to filter for servicePath and that could be slow.
Related
To give you an idea of the data:
DB has a collections/tables that has over a hundred million documents/records each containing more than 100 attributes/columns. The data size is expected to grow by hundred times soon.
Operations on the data:
There are mainly the following types of operations on the data:
Validating the data and then importing the data into the DB, that happens multiple times daily
Aggregations on this imported data
Searches/ finds
Updates
Deletes
Tools/softwares used:
MongoDB for database: PSS architecture based replicaset, indexes (most of the queries are INDEX scans)
NodeJS using Koa.js
Problems:
HOWEVER, the tool is very badly slow when it comes to aggregations, finds, etc.
What have I implemented for performance so far?:
DB Indexing
Caching
Pre-aggregations (using MongoDB aggregate to aggregate the data before hand and store it in different collections during importing to avoid aggregations at runtime)
Increased RAM and CPU cores on the DB server
Separate server for NodeJS server and Front-end build
PM2 to manage NodeJS server application and for spawning clusters
However from my experience, even after implementing all the above, the application is not performant enough. I feel that the reason for this is that the data is pretty huge. I am not aware of how Big Data applications are managed to deliver high performance. Please advise.
Also, is the selection of technology not suitable or will changing the technology/tools help? If yes, what is advised under such scenarios?
I'm requesting your advise to help me improve the performance of the application.
Not easy to give a correct answer because we do not really have that much details. What I would do is a detailed monitoring, at least the following:
Machine Level:
monitor the overall CPU load (for all cores) and RAM usage on your DB machine
monitor disk IO on the disks where the data is stored
this should show, if the machine specs are a bottleneck
Database & DB Process Level (my first guess, that this is the critical part):
what is the overall size of your data at the moment (I know, it will increase drastically but if it is already to slow now, this could be an interesting information - especially in relation to the current RAM size and number of CPU cores)
monitor memory usage and CPU load for your mongo DB process...
did a look on the query plans (while doing aggregations) guided you, what improvements can be done?
have look at the caching strategy. What strategy are you using?
this should give more detailed results on where to make improvements on a DB level. Is it just because of hardware bottlenecks or is it a aggregation problem...
Node.JS APP Level:
node.js app: how much RAM and CPU usage does this one take ...?
if there are multiple instances of the node.js app, track this for all instances
is the data import also happens through the nodejs app. Does the load on the app increases drastically while importing data?
if you see that you have a high load on this app that there is a need to act here (increasing instances, splitting it into seperate apps (e.g. import as a seperate app)
Suppose I want to host my e-commerce website on GCP/AWS/Digitalocean. I am using MongoDB as database.
Then which will be better?
Installing mongodb on the remote machine (GCP/AWS/Digitalocean)
or
Using MongoDB Atlas for deploying, managing the database
Can you mention the pros and cons of both and in which situation I should use these methods?
This is a really tricky question to answer as every other person will have a different opinion, ill try to answer based on my own experience (aws ec2 instances).
Atlas vs EC2 cons:
I would say the number one "con" of atlas is the cost, it is far more expensive than running your own ec2 (at-least at my scale which around 1TB of data), this varies according to amount of data, instances size, backup routine and more.
you have less "control", clearly you wont be able to access the actual server on which the instance is running on.
Pros:
you have less "control", matching argument 2 in the cons this can be seen as a pro, you don't have to maintain the server and all that comes with it.
do your own math on wether this is a pro or not.
I'm gonna stop here as i can keep listing endless minor differences that are at the end of the day mostly opinion oriented.
I would say though (again based on my own experience) that atlas is great, i would greatly consider using it especially if the following conditions are met:
i don't have a person experienced with aws ec2 instances on my team. (this will cause you to spend alot of time just starting out).
small scale, atlas cost is not that high on small scale and it gives you alot of power and saves you some headaches allowing you to push your product foward.
A part of the application, I am working on, allows a user to build a query (visually) to export data from the system. Currently, I am building a Elasticsearch query based on the user input and writing a web service that will collect all the results of that query (after applying some basic filters).
After doing some basic benchmarking on the production Elasticsearch cluster, I am starting to doubt if Elasticsearch is the right tool for this. It takes about 22 minutes to export 1 million contacts (from index size of 11 million). There are 3 nodes in the cluster - each with 4 cores, 16 GB ram (heap size 8GB), and EBS storage (I know this is not the most efficient storage for ES).
Is Elasticsearch really not well suited for these kind of large volume (and frequent) exports? In my setup Elasticsearch follows Mongo (using transporter plugin), so its possible to get data out from Mongo as well. Would Mongo be a better option? I am bit skeptical to use Mongo considering its not-so-good memory management and also potentially polluting the working set when running export(s).
Currently I am pulling data from Elasticsearch over HTTP REST using scroll (and scan api). Its possible that I might get more throughput using Elasticsearch's Java node client or using a plugin like (https://github.com/jprante/elasticsearch-knapsack). Although with knapsack plugin I lose the ability of recording time (and other book-keeping) for each export.
Any suggestions would be highly appreciated.
You can experiment with the scroll and size parameters. It depends on the size of your documents, but you could try something like 30s for scroll and 1000 for size.
You retrieve docs at a rate of ~760 docs/s which is not great but as you say, EBS is not fast. Did you try to calculate what the current throughput is and compare it to EBS bandwidth?
A Java client could be a bit faster but I would guess the overhead is somewhere else in this case.
I am working on an application in which there is a pretty dramatic difference in usage patterns between "hot" data and other data. We have selected MongoDB as our data repository, and in most ways it seems to be a fantastic match for the kind of application we're building.
Here's the problem. There will be a central document repository, which must be searched and accessed fairly often: it's size is about 2 GB now, and will grow to 4GB in the next couple years. To increase performance, we will be placing that DB on a server-class mirrored SSD array, and given the total size of the data, don't imagine that memory will become a problem.
The system will also be keeping record versions, audit trail, customer interactions, notification records, and the like. that will be referenced only rarely, and which could grow quite large in size. We would like to place this on more traditional spinning disks, as it would be accessed rarely (we're guessing that a typical record might be accessed four or five times per year, and will be needed only to satisfy research and customer service inquiries), and could grow quite large, as well.
I haven't found any reference material that indicates whether MongoDB would allow us to place different databases on different disks (were're running mongod under Windows, but that doesn't have to be the case when we go into production.
Sorry about all the detail here, but these are primary factors we have to think about as we plan for deployment. Given Mongo's proclivity to grab all available memory, and that it'll be running on a machine that maxes out at 24GB memory, we're trying to work out the best production configuration for our database(s).
So here are what our options seem to be:
Single instance of Mongo with multiple databases This seems to have the advantage of simplicity, but I still haven't found any definitive answer on how to split databases to different physical drives on the machine.
Two instances of Mongo, one for the "hot" data, and the other for the archival stuff. I'm not sure how well Mongo will handle two instances of mongod contending for resources, but we were thinking that, since the 32-bit version of the server is limited to 2GB of memory, we could use that for the archival stuff without having it overwhelm the resources of the machine. For the "hot" data, we could then easily configure a 64-bit instance of the database engine to use an SSD array, and given the relatively small size of our data, the whole DB and indexes could be directly memory mapped without page faults.
Two instances of Mongo in two separate virtual machines Would could use VMWare, or something similar, to create two Linux machines which could host Mongo separately. While it might up the administrative burden a bit, this seems to me to provide the most fine-grained control of system resource usage, while still leaving the Windows Server host enough memory to run IIS and it's own processes.
But all this is speculation, as none of us have ever done significant MongoDB deployments before, so we don't have a great experience base to draw upon.
My actual question is whether there are options to have two databases in the same mongod server instance utilize entirely separate drives. But any insight into the advantages and drawbacks of our three identified deployment options would be welcome as well.
That's actually a pretty easy thing to do when using Linux:
Activate the directoryPerDB config option
Create the databases you need.
Shut down the instance.
Copy over the data from the individual database directories to the different block devices (disks, RAID arrays, Logical volumes, iSCSI targets and alike).
Mount the respective block devices to their according positions beyond the dbpath directory (don't forget to add the according lines to /etc/fstab!)
Restart mongod.
Edit: As a side note, I would like to add that you should not use Windows as OS for a production MongoDB. The available filesystems NTFS and ReFS perform horribly when compared to ext4 or XFS (the latter being the suggested filesystem for production, see the MongoDB production notes for details ). For this reason alone, I would suggest Linux. Another reason is the RAM used by rather unnecessary subsystems of Windows, like the GUI.
What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.