MongoDB - Store Indexes on SSD & Data Collections on Magnetic? - mongodb

Is it possible to store index collections on separate high-performance storage (i.e. flash/SSD) while keeping data collections on lower-cost conventional storage? The performance issues I am seeing using MongoDB appear to be related to index maintenance operations, and I am having to partition my data across database buckets on a single instance in order to avoid drastic dips in write performance - a solution that will only scale for so long. Therefore I would like to use SSD for indexes, but it doesn't make sense to pay for high-performance storage where it's not warranted (data collections).
The only discussion I have found on this subject is somewhat dated:
https://jira.mongodb.org/browse/SERVER-965

You cannot do what you are asking with MongoDB.
The work around is splitting your collections into multiple databases. MongoDB cannot do joins, so there is not really a trade off at the data level. Your application will have to manage multiple connections to multiple databases.
Put your high-performance collections on SSD servers with larger RAM
allocations.
Put your lower performance collections on slower spindle
disk or AWS EBS servers.

Related

MongoDB Atlas performance / collection and index limits

I am building a multi tenant app, where each tenant will have its own database with its own collections. All databases are stored in the same M10 cluster.
For now, a tenant represents around 56 collections and 208 indexes.
I have seen there is a recommended maximum for M10 cluster of 5000 collections and indexes (https://www.mongodb.com/docs/atlas/reference/atlas-limits/)
So if my understanding is correct, M10 cluster suits best for 18 maximum tenants (5000/(56+208)=18,93).
The documentation says The performance of a cluster might degrade if it serves a large number of collections and indexes. Does anyone have tried to exceed this limit? How big are these decreases in performance?
Apart from a hard limit on the number of collections and indexes you can have, the performance impact of having a large number of collections and indexes comes from the adverse impact of having too many data handles open. Not only this, but the maintenance also becomes a nightmare.
On the other hand, having large number of indexes will adversely impact the write operations in those collections, and will either continuously occupy the space in memory or if the memory is insufficient, will lead to continuous eviction and loading of indexes from and into the memory. To know more about the internal cache, see the official documentation here.
In conclusion, having more than 3500 indexes (for 18 tenants) and ~1000 collections will have a serious adverse impact on the performance of your overall cluster. You can monitor the same via the Cache Activity Metric, and others. Since it's anyway logical separation even if you create separate databases, you're advised to instead implement multi-tenancy via a tenant_id field in collections, instead of having different collections for different tenants.

Is there any performance impact to have multiple Mongo databases?

We are currently working on an application using Mongo and we try to evaluate benefits and constraints on each differents architecture choices related to spreading data on multiple databases/collections or using a single shared one.
Is there any performance penalties between one single database with a lot of collections or many databases with less collections per database ?
From what I understand it does not seem to have any impact because sharding is done per collection basis but I would like some confirmations.
Regards
By performance, I guess you mean read/write speed. Using multiple databases with fewer collections would definitely increase your read/write speed since each database can handle more read/write operations on the collections associated with them. 
However, spreading data across databases this way I believe can bring about extra complexity to your project, depending on how your codebase is structured, it might introduce complexity to your application logic, things like backup and other admin database operations won't be straight forward, cross collection ad-hoc queries for collection that lives in different databases would be next to impossible.
If the goal of the architecture design is to ensure high read/write speed, you can still go with using a single DB that can be auto-scaled at the deployment level. I don't know much about it but I think Replication is a MongoDB feature that can help you achieve such auto-scaling and if you are in for database-as-a-service, you should check out MongoDB Atlas, auto-scaling is out of the box.

Choosing a Database to store Report Json

I am trying to figure out which DB to use for a project with the following requirements,
Requirements:
scalability should be high, availability should be high
Data format is Json Document of several MBs in size
Query capabilities are my least concern, More of a key-value usecase
High performance/ low latency
i considered MongoDb, Cassandra, Redis, postgres (jsonb), a few other document oriented DBs, embedded databases ( small footprint will be a plus ).
Please help me find out which DB will be the best choice.
i wont need document/row wise comparison queries at all. at most requirement will be subset pick from the document. What i am looking for is a lightweight db with smaller footprint and low latency with high scalability. very low query capabilities are acceptable. should i be choosing embedded DBs? What are the points to consider here?
thanks for the help!.
If you use documents (json) use a document database. Especially if the documents differ in structure.
PostgreSQL does not scale horizontally. Have a look at CockroachDB if you like.
Cassandra can do key-value at scale as redis, but both are not really document databases.
I would suggest MongoDB or CouchDB - which one would be a good match depends on your needs. MongoDB is consistent and partition tolerant while CouchDB is partition tolerant and available.
If you can live with some limits for querying and like high availability try out CouchDB.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

MongoDB: Sharding on single machine. Does it make sense?

created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?
Yes, it does make sense to shard on a single server.
At this time, MongoDB still uses a global lock per mongodb server.
Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can
also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.
Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.
In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.
The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.
Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.
This doesn't have to be a mongo question, it's a general operating system question. There are three possible bottlenecks for your database use.
network (i.e. you're on a gigabit line, you're using most of it at peak times, but your database isn't really loaded down)
CPU (your CPU is near 100% but disk and network are barely ticking over)
disk
In the case of network, rewrite your network protocol if possible, otherwise shard to other machines. In the case of CPU, if you're 100% on a few cores but others are free, sharding on the same machine will improve performance. If disk is fully utilized add more disks and shard across them -- way cheaper than adding more machines.
No, it does not make sense to shard a on a single server.
There are a few exceptional cases but they mostly come down to concurrency issues related to things like running map/reduce or javascript.
This is answered in the first paragraph of the Replica set tutorial
http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial