HBase, mongoDB, Cassandra - overhead on small cluster, small data - mongodb

Ok,this systems are scalable with respect to nr of nodes and big amount of data.
But how about the overhead involved, if I use this systems on a small cluster (5-10 nodes), and on a small amount of data, processing/storing on a scale of a couple of gigabytes? Or on a smaller data, like hundreds of MB ?
Are there better database systems to use for my cluster and my amount of data?

A scalable solution usually pays a penalty required to scale over large data. The penalty is paltry compared to large data that you get to process. If you do not envisage processing data in Terabytes then you could do with a more responsive system that does not pay that penalty.
Use Sqlite database for smaller data. Frankly it depends on what other requirements/constraints you have.

You can probably just use a single node mySQL server for this kind of data with the advantage of a full SQL capabilities, full ACID, mature tools etc.

Related

Terabyte scale database in Greenplum

I am currently using greenplum for little scale of data like 1GB to test it.
As greenplum is said to be "petabytes-scale", I was wondering if having a volume of data like one or ten terabytes is worth going into this MPP processing instead of a normal PostgreSQL database.
All my network interfaces have 10 Mb/s for slaves and master.
Best practices don't include these considerations. The problem is that having maybe a "little database" will have poor result due to network processing.
Did you already implement a database with this scale?
The workloads for PostgreSQL and Greenplum are different. PostgreSQL is great for OLTP, queries with index lookups, referential integrity, etc. You typically know the query patterns in an OLTP database too. It can certainly take on some data warehouse or analytical needs but it scales by buying a bigger machine with more RAM and more cores with faster disks.
Greenplum, on the other hand, is designed for data warehousing and analytics. You design the database without knowing how the users will query the data. This means sequential reads, no indexes, full table scans, etc. It can do some OLTP work but it isn't designed for it. You scale Greenplum by adding more nodes to you cluster. This gives you more CPU, RAM, and disk throughput.
What is your use case? That is the biggest determinant in picking Greenplum vs PostgreSQL.

How many requests can mongodb handle before sharding is necessary?

Does anybody know (from personal experience or official documentation) how many concurrent requests can a single MongoDb server handle before sharding is advised?
If your working set exceeds the RAM you can afford for a single server, or your disk I/O requirements exceed what you can provide on a single server, or (less likely) your CPU requirements exceed what you can get on one server, then you'll need to shard. All these depend tremendously on your specific workload. See http://docs.mongodb.org/manual/faq/storage/#what-is-the-working-set
One factor is hardware. Although for this you have replica sets. They reduce the load from the master server by answering read-only queries with replicated data. Another option would be memcaching for very frequent and repetitive queries, which would be even faster.
A factor for whether sharding is necessary is the data size & variation. When you have a wide range of varying data you need to access, which would render a server's cache uneffective by distributing the access to the data to the wide range, then you would consider using sharding. Off-loading work is merely a side-effect of this.

what constitutes "large amount of write activity" for Mongodb?

I am currently working on an online ordering application using Mongodb as the backend. In looking into sharding, the Mongo docs say that you should consider sharding if
"your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other approaches have not reduced contention."
So my question is: what constitutes a large amount of write activity? are we talking 1000's of writes per second? 100's?
I know that sharding introduces a level of infrastructure complexity that I'd rather not get into if I don't have to.
thanks!
R
The "large amount of write activity" is not defined in terms of a specific number .. but rather when your common usage pattern exceeds the resources of your server hardware. For example, when average I/O flush time or iowait indicates that I/O has become a significant limiting factor.
You do have other options to consider before sharding:
if your working set is larger than RAM and you have significant page faults, upgrade your RAM
if your disk I/O isn't keeping up, consider upgrading to faster disks, RAID, or SSD
review and adjust your readahead settings
look into optimization of slow or inefficient queries
review your indexes and remove unnecessary ones

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively:
put(key, value)
where value is a map of <column name, column value>.
I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper) and Redis “HSET” command could do that. But what’s the pros and cons of both way? Any performance and scalability difference there?
EDIT :
My requirement is something like an IM server --- I need to store session data , and I want all of them to be in memory so that low latency can be easily achieved. The session last for at most 2 hours. No consistency requirement to consider yet. And disk is only for fail-over. Lost of data is not terrible. All i need is lower latency. Operations per second --- the more, the better.
Both redis and cassandra can be used as a key value store. The difference is in speed, scale and reliability.
Redis works best as a single server, where the entire data set resides in memory.
Cassandra can handle data sets that don't fit in memory, and data sets that don't fit on a single machine. As part of distributing over multiple machines, cassandra is much more reliable. Cassandra can handle machine failures, rebuilding machines, adding capacity to the cluster when needed.
Because redis is entirely in memory, and reads/writes are served by a single machine (a single cassandra write will typically talk to multiple machines), redis will most likely be faster.
If your primary goal is speed, and you don't need to store data reliably, and your data set fits in memory, then redis would probably be a better solution.

MongoDB - Cluster of small/many or few/big nodes

Can it be said what is best in a general case (where the database size is really big): To have a MongoDB cluster consisting of a larger number of smaller blade servers, or a few, really fat, servers?
Given is that the shard key has a quite fine granularity, so splitting should not be a problem.
If there are no "golden bullet", what is the pros and cons with either setup?
Best in what aspect? From a financial point of view, I'd go for lots of cheap hardware :)
MongoDB has been built to easily scale across nodes, so why not take advantage of this? The reason you'd want just one or a few beefy servers for a SQL server is to minimize the spread of relational data across physical nodes. But since MongoDB uses documents, most of your related data is stored in a single document. This means that it's all stored at the same physical location and you don't have to do costly lookups on other nodes to reconstruct the 'complete picture' of your data.
Another thing to keep in mind is that map-reduce jobs can only run in parallel in a sharded environment. So if you plan to do a lot of map-reducing, more shards/servers will result in better performance.
What if your database outgrows your beefy servers? Are you going to invest in another beefy server that handles that small amount of extra growth? Or what if one of them crashes? With smaller and cheaper servers, you can scale up (or down) more gradually if the need arises. Also, the impact of a server crash is much smaller, as it will affect only a small portion of your data.
To summarize: a large cluster of smaller servers isn't a silver bullet, as managing such a cluster has its own challenges, but it is significantly cheaper and possibly faster as well if you're doing map-reduce.