MongoDB on machine or MongoDB Atlas - mongodb

Suppose I want to host my e-commerce website on GCP/AWS/Digitalocean. I am using MongoDB as database.
Then which will be better?
Installing mongodb on the remote machine (GCP/AWS/Digitalocean)
or
Using MongoDB Atlas for deploying, managing the database
Can you mention the pros and cons of both and in which situation I should use these methods?

This is a really tricky question to answer as every other person will have a different opinion, ill try to answer based on my own experience (aws ec2 instances).
Atlas vs EC2 cons:
I would say the number one "con" of atlas is the cost, it is far more expensive than running your own ec2 (at-least at my scale which around 1TB of data), this varies according to amount of data, instances size, backup routine and more.
you have less "control", clearly you wont be able to access the actual server on which the instance is running on.
Pros:
you have less "control", matching argument 2 in the cons this can be seen as a pro, you don't have to maintain the server and all that comes with it.
do your own math on wether this is a pro or not.
I'm gonna stop here as i can keep listing endless minor differences that are at the end of the day mostly opinion oriented.
I would say though (again based on my own experience) that atlas is great, i would greatly consider using it especially if the following conditions are met:
i don't have a person experienced with aws ec2 instances on my team. (this will cause you to spend alot of time just starting out).
small scale, atlas cost is not that high on small scale and it gives you alot of power and saves you some headaches allowing you to push your product foward.

Related

FIWARE Orion/MongoDB Performance on AWS

I seem to be having real issues trying to get performance anywhere near that stated in the docs (~700 - 2000 tps with a VM of: 2 vCPUs 4GB RAM). I have tried on a local VM, a local machine and a few AWS VMs and I can't get anywhere close. - The maximum I have achieved is 80 tps on an AWS VM.
I have tried changing the -dbPoolSize and the -reqPoolSize for orion and playing with ulimit to set it to that suggested by MongoDB - but everything I change doesn't seem to get me anywhere close.
I have set indexes on the _id.id, _id.type and _id.servicePath as suggested in the docs - the latter of which gave me an increase from 40 tps to 80 tps.
Are there any config options for Orion or Mongo that I should be setting away from the default which will get me any closer? Are there any other tips for performance? The link in the docs to the test scripts doesn't work so I haven't been able to see the examples.
I have created my own test scripts using Node.js and I have tested update and queries using a variable amount of concurrent connections and between 1 and 2 load injectors.
From looking at the output from "top" the load is with Mongo as it almost maxes out the CPU but adding more cores to the VM doesn't change the stats. The VM has 7.5GB or 15GB of RAM so mongo should be able to put all the data into memory for blazing fast performance?
I have used mongostat to see that the connections from orion to mongo change with the -dbPoolSize option, but this doesn't yield any better performance.
Any help you can provide would be much appreciated.
I have tried using CentOS 6.5 and 6.7 with Orion 0.25 and 0.26 and MongoDB 2.6 with ~500,000 entities
My test scripts and data are on GitHub
I have only tested without subscriptions so far, but I have scripts ready to test with subscriptions - but I wanted to get a good baseline before adding subscriptions.
My data is modeled around parking spaces in the UK countries their regions and their outcodes (first part of the postcode). This is using servicePaths to split them down to parking lot in an outcode.
Here is a gist with the requests and mongo shell output
Performance is a complex topic which depends on many factor (deployment setup of Orion and MongoDB, startup configuration of Orion and MongoDB, hardware profile in the systems hosting the processes, network communications, overprovisioning level in the case of virtualization, injected load, etc.) so there isn't any general answer to deal with this kind of problems. However, I'd try to provide some hints and recommendations that I hope may help.
Regarding versions, Orion 0.26.0 (or 0.26.1) is recommended over 0.25.0. We have included a lot of improvements related with performance in Orion 0.26.0. Regarding MongoDB, we have also found that 3.0 could be much better than 2.6, specially in update intensive scenarios.
Having said that, first of all you should locate the bottleneck. Useful tools to do this are top, mongostat and mongotop. It could be either Orion, MongoDB or the network connecting them. If the bottleneck is CB, maybe the performance tuning hints provided in this document may help. Slow queries information in MongoDB could be also pointing to bottlenecks at Orion. If the bottleneck is MongoDB, taking into account the large number of entities you have (500,000) maybe you should consider to implement sharding. If the bottleneck is the network, colocation both Orion and MongoDB may help.
Finally, some things you can also try in order to get more insight into the problem:
Run some tests outside AWS (i.e. virtual machines in local premises) to compare. I don't know too much about the overprovising policy in AWS but based in my previous experiences with other cloud providers the VM overprovisioning (specially if it varies along time) could impact in performance.
Analyze if the peformance is related with the number of entities. E.g. run test with 500, 5,000, 50,000 and 500,000 entities and get the performance figure in each case.
Analyze if the performance is related with the usage of servicePath, e.g. put all the 500,000 entities in the default service path / (moving the current content of the servicePath to another place, e.g. an entity attribute or part of the entity ID string) and test. Currently Orion uses a regex to filter for servicePath and that could be slow.

ElasticSearch vs. MongoDB for Caching User Data

Up to this point, I have been using MongoDB (Node.js + Mongoose) to save posts which belong to a user, so that I can later retrieve them to display in a stream (just like Facebook, Twitter, etc.)
It recently became necessary to allow the user to deeply search his stream; MongoDB's search was insufficient, so I implemented ElasticSearch on my servers (Amazon EC2 m1.large instances running CentOS, FWIW).
My question: I'm now in a position that I'm duplicating the data between MongoDB (where the user's stream is cached) and ElasticSearch (where it is searched).
Is there any disadvantage to moving my cache ENTIRELY into ElasticSearch, getting rid of the MongoDB all together? It seems a waste to double the storage, and there's no other place that I'm accessing this data (it is only used when presenting/searching the stream of posts).
Specifically, I want to make sure I'm not overlooking anything re: performance. I like the idea of reducing MongoDB as a bottleneck, yet I worry about the memory overhead of ElasticSearch. MongoDB runs on its own server in my cloud setup, whereas ElasticSearch is running on the same instances as node.js. This means I would have MORE ElasticSearch servers (the node.js servers are in an auto-scaling array), but they each are not DEDICATED servers (unlike MongoDB).
The only big obstacle to using ES as a "primary datasource" is that there isn't a good backup mechanism right now. The ES team is working on it and expect it to be out by the end of the year, but in the mean time, you'll have to implement your own backup scripts.
As far as performance, it's really hard to say because almost every situation is unique. ES benefits from memory - so more is always better. In particular, sorts/filters/facets/geo all like to eat memory. If you aren't doing much in the way of faceting, for example, you may be fine with less memory.
ES doesn't need to run on a dedicated node...but it will happily use as many resources as you give it.
Another option is to use just the elastic search indexes. You can choose to not save data in a readable format, so you search in ES and then retrieve documents from MongoDB to your user as needed.
The question bellow comments exactly on that.
Storing only selected fields and not storing _all in pyes/elasticsearch

Extremely high QPS - DynamoDB vs MongoDB vs other noSQL?

We are building a system that will need to serve loads of small requests from day one. By "loads" I mean ~5,000 queries per second. For each query we need to retrieve ~20 records from noSQL database. There will be two batch reads - 3-4 records at first and then 16-17 reads instantly after that (based on the result of first read). That would be ~100,000 objects to read per second.
Until now we were thinking about using DynamoDB for this as it's really easy to start with.
Storage is not something I would be worried about as the objects will be really tiny.
What I am worried about is cost of reads. DynamoDB costs $0.0113 per hour per 100 eventually consistent (which is fine for us) reads per second. That is $11,3 per hour for us provided that all objects are up to 1KB in size. And that would be $5424 per month based on 16 hours/day average usage.
So... $5424 per month.
I would consider other options but I am worried about maintenance issues, costs etc. I have never worked with such setups before so your advice would be really valuable.
What would be the most cost-effective (yet still hassle-free) solution for such read/write intensive application?
From your description above, I'm assuming that your 5,000 queries per second are entirely read operations. This is essentially what we'd call a data warehouse use case. What are your availability requirements? Does it have to be hosted on AWS and friends, or can you buy your own hardware to run in-house? What does your data look like? What does the logic which consumes this data look like?
You might get the sense that there really isn't enough information here to answer the question definitively, but I can at least offer some advice.
First, if your data is relatively small and your queries are simple, save yourself some hassle and make sure you're querying from RAM instead of disk. Any modern RDBMS with support for in-memory caching/tablespaces will do the trick. Postgres and MySQL both have features for this. In the case of Postgres make sure you've tuned the memory parameters appropriately as the out-of-the-box configuration is designed to run on pretty meager hardware. If you must use a NoSQL option, depending on the structure of your data Redis is probably a good choice (it's also primarily in-memory). However in order to say which flavor of NoSQL might be the best fit we'd need to know more about the structure of the data that you're querying, and what queries you're running.
If the queries boil down to SELECT * FROM table WHERE primary_key = {CONSTANT} - don't bother messing with NoSQL - just use an RDBMS and learn how to tune the dang thing. This is doubly true if you can run it on your own hardware. If the connection count is high, use read slaves to balance the load.
Long-after-the-fact Edit (5/7/2013):
Something I should've mentioned before: EC2 is a really really crappy place to measure performance of self-managed database nodes. Unless you're paying out the nose, your I/O perf will be terrible. Your choices are to either pay big money for provisioned IOPS, RAID together a bunch of EBS volumes, or rely on ephemeral storage whilst syncing a WAL off to S3 or similar. All of these options are expensive and difficult to maintain. All of these options have varying degrees of performance.
I discovered this for a recent project, so I switched to Rackspace. The performance increased tremendously there, but I noticed that I was paying a lot for CPU and RAM resources when really I just need fast I/O. Now I host with Digital Ocean. All of DO's storage is SSD. Their CPU performance is kind of crappy in comparison to other offerings, but I'm incredibly I/O bound so I Just Don't Care. After dropping Postgres' random_page_cost to 2, I'm humming along quite nicely.
Moral of the story: profile, tune, repeat. Ask yourself what-if questions and constantly validate your assumptions.
Another long-after-the-fact-edit (11/23/2013): As an example of what I'm describing here, check out the following article for an example of using MySQL 5.7 with the InnoDB memcached plugin to achieve 1M QPS: http://dimitrik.free.fr/blog/archives/11-01-2013_11-30-2013.html#2013-11-22
By "loads" I mean ~5,000 queries per second.
Ah that's not so much, even SQL can handle that. So you are already easily within the limits of what most modern DBs can handle. However they can only handle this with the right:
Indexes
Queries
Server Hardware
Splitting of large data (you might require a large amount of shards with relatively low data each, dependant here so I said "might")
That would be ~100,000 objects to read per second.
Now that's more of a high load scenario. Must you read these in such a fragmented manner? If so then (as I said) you may require to look into spreading the load across replicated shards.
Storage is not something I would be worried about as the objects will be really tiny.
Mongo is aggresive with disk allocation so even with small objects it will still pre-allocate a lot of space, this is something to bare in mind.
So... $5424 per month.
Oh yea the billing thrills of Amazon :\.
I would consider other options but I am worried about maintenance issues, costs etc. I have never worked with such setups before so your advice would be really valuable.
Now you hit the snag of it all. You can setup your own cluster but then you might end up paying that much in money and time (or way more) for the servers, people, admins and your own mantenance time. This is one reason why DynamoDB really shines here. For large setups who are looking to take the load and pain and stress of server management (trust me it is really painful, if your a Dev you might as well change your job title to server admin from now on) off of the company.
Considering to setup this yourself you would need:
A considerable amount of EC instances (dependant upon data and index size but I would say close to maybe 30?)
A server admin (maybe 2, maybe freelance?)
Both of which could set you back 100's of thousands of pounds a year, I would personally bet for the managed approach if it fits your needs and budget. When your need grows beyond what managed Amazon DB can give you then move to your infrastructure.
Edit
I should amend that the cost effectiveness was done with quite a few black holes for example:
I am unsure of the amount of data you have
I am unsure of writes
Both of these contribute me to place a scenario of:
Massive writes (about as much as your reading)
Massive data (lots)
Here is what I recommend in sequence.
Identify your use case and choose the correct db. We test MySQL and MongoDb regularly for all kinds of workloads (OLTP, Analytics, etc). In all cases we have tested with, MySQL outperforms MongoDb and is cheaper ($/TPS) compared to MongoDb. MongoDb has other advantages but that is another story ... since we are talking about performance here.
Try to cache your queries in RAM (by provisioning adequate RAM).
If you are bottle necked on RAM, then you can try a SSD caching solution which takes advantage of ephemeral SSD. This works if your workload is cache friendly. You can save loads of money as ephemeral SSD is typically not charged by the cloud provider.
Try PIOPS/RAID or a combination to create adequate IOPS for your application.

MongoDb vs CouchDb: write speeds for geographically remote clients

I would like all of my users to be able to read and write to the datastore very quickly. It seems like MongoDb has blazing reads, but the writes seem like they could be very very slow if the one master db needs to be located very far away from the client.
Couchdb seems that it has slow reads, but how about the writes in the case when the client is very far away from the master.
With couchdb, we can have multiple masters, meaning we can always have a write node close to the client. Could couchdb actually be faster for writes than mongodb in the case when our user base is spread very far out geographically?
I would love to use mongoDb due to its blazing fast speed, but some of my users very far away from the only master will have a horrible experience.
For worldwide types of systems, wouldn't couchDb be better. Isn't mongodb completely ruled out in the case where you have users all around the world?
MongoDb, if you're listening, why don't you do some simple multi-master setups, where conflict resolution can be part of the update semantic?
This seems to be the only thing standing in between mongoDb completely dominating the nosql marketshare. Everything else is very impressive.
Disclosure: I am a MongoDB fan and user, i have zero experience with CouchDB.
I have a heavy duty app that is very read write intensive. I'd say reads outnumber writes by a factor of around 30:1. The way mongo is designed reads are always going to be much faster than writes the trick (in my experience) is to make your writes so efficient that you can dedicate a higher percentage of your system resources to the writes.
When building a product on top of mongo the key thing to remember is the _id field. This field is automatically generated and added to all of your JSON objects it will look something like 47cc67093475061e3d95369d when you design your queries (Find's) try and query on this field wherever possible as it contains the machine location (and i think also disk location??? - i should check this) where the object lives so when you use a find or update using this field will really speed up your machine. Consider this in the design of your system.
Example:
2 of the clusters in my database are "users" and "posts". A user can create multiple posts. These two collections have to reference each other alot in the implementation of my app.
In each post object i store the _id of the parent user.
In each user object i store an array of all the posts the user has authored.
Now on each user page I can generate a list of all the authored posts without a resource stressful query but with a direct look up of the _id. The bigger the mongo cluster the bigger the difference this is going to make.
If you're at all familiar with oracle's physical location rowids you may understand this concept only in mongo it is much more awesome and powerful.
I was scared last year when we decided to finally ditch MySQL for mongo but I can tell you the following about my experience:
- Data porting is always horrible but it went as well as I could have imagined.
- Mongo is probably the best documented NoSQL DB out there and the Open Source community is fantastic.
- When they say fast and scalable there not kidding, it flies.
- Schema design is very easy and much more natural and orderly than key/value type db's in my opinion.
- The whole system seems designed for minimal user complexity, adding nodes etc is a breeze.
Ok, seriously I swear mongo didn't pay me to write this (I wish) but apologies for the love fest.
Whatever your choice, best of luck.
Here is a detailed article that 10gen has created, and gives examples of when you should choose MongoDB or CouchDB, with reasons as well.
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
Edit
The above link was removed, but can be viewed here: http://web.archive.org/web/20120614072025/http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
Your question as of now, is full with speculation and guessing.
...why can't we opt out of consistency for certain writes, so long as we're sure that the person that wrote the data will be able to read it consistently, whereas others will observe eventual consistency
What if those writes effect other writes? What if those writes would prevent other people from doing stuff. It's hard to tell the possible side effect if since you didn't tell us any specifics.
My main suggestion to you is that you do some testing. Unless you've tested it, speculation about bottle necks is a complete waste of time. You don't need to test it via remote machines, set up some local DBs and add some artificial lag, then run your tests.
This way you can test the different options you've got, see where MongoDB is better, or where CouchDB excels at. Then you can either take one of them and go with the contras, or you can try and tweak your Database Model it self and do more tests.
Nobody here will be able to give you a general solution to your specific problem (well unless you give us all your code and you pay us for working on it :P ) databases aren't easy especially if you need to scale them under certain requirements.
For worldwide types of systems, wouldn't couchDb be better. Isn't mongodb completely ruled out in the case where you have users all around the world?
MongoDB supports sharding. So you don't need a single master. In fact, it looks like you have a ready shard key (region).
MongoDB also supports replica sets along with sharding. So if you need to run in multiple data centers (DCs) you put a master and one of the replicas in the same DC. In fact, they also suggest adding a 3rd node to a separate DC as a hot backup failover.
You will have to drill into the more detailed configuration of MongoDB, but you can definitely control where data is stored and you can prioritize that other replicas in a DC are "next in line" for promotion to Master.
At this point however, you're well into the details of MongoDB and you'll need to dig around and "play" quite a bit. However, you'll need lots of "play time" for any solution that's really going to handle masters across data centers.

MongoDB on EC2 server or AWS SimpleDB?

What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.