MongoDB sharding/partitioning 100+ million rows in one table on one machine - mongodb

Does MongoDB support something akin to ordinary database partitioning on a single machine? I’m looking for very fast queries (writes are less important) on a single large table on one machine.
I tried sharding data using only one machine, in the hopes that it would behave similarly to ordinary database partitioning, but the performance was subpar. I also did a quick custom-code partitioning implementation, which was much faster. But I would prefer to use built-in MongoDB functionality if possible.
What are the best practices for MongoDB for this scenario?

Related

Is there any performance impact to have multiple Mongo databases?

We are currently working on an application using Mongo and we try to evaluate benefits and constraints on each differents architecture choices related to spreading data on multiple databases/collections or using a single shared one.
Is there any performance penalties between one single database with a lot of collections or many databases with less collections per database ?
From what I understand it does not seem to have any impact because sharding is done per collection basis but I would like some confirmations.
Regards
By performance, I guess you mean read/write speed. Using multiple databases with fewer collections would definitely increase your read/write speed since each database can handle more read/write operations on the collections associated with them. 
However, spreading data across databases this way I believe can bring about extra complexity to your project, depending on how your codebase is structured, it might introduce complexity to your application logic, things like backup and other admin database operations won't be straight forward, cross collection ad-hoc queries for collection that lives in different databases would be next to impossible.
If the goal of the architecture design is to ensure high read/write speed, you can still go with using a single DB that can be auto-scaled at the deployment level. I don't know much about it but I think Replication is a MongoDB feature that can help you achieve such auto-scaling and if you are in for database-as-a-service, you should check out MongoDB Atlas, auto-scaling is out of the box.

Cassandra Database Performance and Default GUID

I've read in the Cassandra docs that it's recommended that you stay with the random generated GUI for IDs to prevent hotspots instead of implementing my own IDs for each document. From what I know it is much slower (see this presentation). How Cassandra can help me achieve very high WRITE performance while still following this guideline?
PlayOrm uses it's own generater that keeps the ids small for cassandra. The important thing is that your generator be random, that is all, so it provides a good distribution of keys. We are doing 10,000 writes / second with PlayOrm and cassandra ourselves and that is while stuff is being indexed so we can query using PlayOrm Scalable SQL.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

Scaling MongoDB on EC2 or should I just switch to DynamoDB?

I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on
Minimal changes to my code (the code is in perl)
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
Low cost
In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).
I can think of a few options
Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
Move everything to SimpleDB
Move everything to DynamoDB
I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?
You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.
We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.
I currently run my website on a single server with MongoDB.
First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.
Replication provides automatic redundancy and fail-over.
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.
If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.
The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.
This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.
You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.
Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.
I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.
Three is a good white paper on how to set up MongoDB on Amazon EC2: http://d36cz9buwru1tt.cloudfront.net/AWS_NoSQL_MongoDB.pdf
I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.
Best of luck!

What operations are cheap/expensive in mongodb?

I'm reading up on MongoDB, and trying to get a sense of where it's best used. One question that I don't see a clear answer to is which operations are cheap or expensive, and under what conditions.
Can you help clarify?
Thanks.
It is often claimed that mongodb has insanely fast writes. While they are not slow indeed, this is quite an overstatement. Write throughput in mongodb is limited by global write lock. Yes, you heard me right, there can be only ONE* write operation happening on the server at any given moment.
Also I suggest you take advantage of schemaless nature of mongodb and store your data denormalized. Often it is possible to do just one disk seek to fetch all required data (because it is all in the same document). Less disk seeks - faster queries.
If data sits in RAM - no disk seeks are required at all, data is served right from memory. So, make sure you have enough RAM.
Map/Reduce, group, $where queries are slow.
It is not fast to keep writing to one big document (using $push, for example). The document will outgrow its disk boundaries and will have to be copied to another place, which involves more disk operations.
And I agree with #AurelienB, some basic principles are universal across all databases.
Update
* Since 2011, several major versions of mongodb were released, improving situation with locking (from server-wide to database-level to collection-level). A new storage engine was introduced, WiredTiger, which has document-level locks. All in all, writes should be significantly faster now, in 2018.
From my practice one thing that should mentioned is that mongodb not very good fit for reporting, because usual in reports you need data from different collections ('join') and mongodb does not provide good way to aggregate data multiple collections (and not supposed to provide). For sure for some reports map/reduce or incremental map/reduce can work well, but it rare situations.
For reports some people suggest to migrate data into relations databases, that's have a lot of tools for reporting.
This is not very different than all database systems.
Query on indexed data are fast. Query on a lot of data are... slow.
Due to denormalization, if there is no index, writing on the base will be fast, that's why logging is the basic use case.
At the opposite, reading data which are on disk (not in RAM) without index can be very slow when you have billion of document.