What is a cluster in a RDBMS? - rdbms

Please explain what a cluster is in a RDBMS?

In SQL, a cluster can also refer to a specific physical ordering of rows.
For example, consider a database with two tables: INVOICES and INVOICE_ITEMS. If many INVOICE_ITEMs are inserted concurrently, chances are that items of the same invoice end up on multiple physical blocks of the underlying storage. When reading such an invoice, unneeded data will be read together with the interesting rows. Clustering INVOICE_ITEMS over the foreign key to INVOICES groups rows of items the same invoice together in the same block, thus reducing the amount of necessary read operations when accessing the invoice.
Read about clustered index on wikipedia.
In system administration, a "cluster" is a number of servers configured to provide the same service, but look like one server to the users.
This can be done for performance reasons (two servers can answer more requests than a single one) or redundancy (if one server crashes, the others still work).
Such configurations often need special software or setup to work. Some services, like serving static web content, can be clustered very easily. Others, like RDBMS, need complicated replication schemes to coordinate.
Read about computer clusters on wikipedia.
In statistics, a cluster is a "group of items so that objects from the same cluster are more similar to each other than objects from different clusters."
Read about Cluster analysis on wikipedia.

From here:
High-availability clusters (also known
as HA Clusters or Failover Clusters)
are computer clusters that are
implemented primarily for the purpose
of providing high availability of
services which the cluster provides.
They operate by having redundant
computers or nodes which are then used
to provide service when system
components fail. Normally, if a server
with a particular application crashes,
the application will be unavailable
until someone fixes the crashed
server. HA clustering remedies this
situation by detecting
hardware/software faults, and
immediately restarting the application
on another system without requiring
administrative intervention, a process
known as Failover

In database context it can have two completely different meanings:
may either mean data clustering or index clustering, which is grouping of similar rows. This is useful for data mining, some databases (e.g. Oracle) also use it to optimize physical data organization;
or cluster as database running on many closely linked servers.

Clustering, in the context of databases.
It refers to the ability of several servers or instances to connect to a single database.
An instance is the collection of memory and processes that interacts with a database, which is the set of physical files that actually store data.

Related

How Mongo DB or any nosql DB (Hbase, Cassandra) is scalable and having advantage over traditional RDBMS?

I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.

GridFS: what it gives us

I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!

What did "share nothing" mode real mean?

We called Greenplum,Redshift MPP or share nothing.
but i really dont understand why?
if it mean during a mutil-level join query,one host are computing all the time,no hosts exchange data each other?,there is no shuffle?
and or other situation.
whats the crux means of "share nothing"?
Shared nothing means that no two servers share the same data (aside from mirrors for high availability). A simple example would be a two node cluster where the data is distributed by gender_code. Node1 would have all of the males and node2 would have all of the females.
In the real world, you have way more nodes than just two so you distribute the data by something like an ID column. This gives you an even distribution of data across all of the nodes.
As you can probably guess, the optimizer has to be pretty smart to reduce the amount of data movement needed to execute a query. It also needs to slice the query into multiple parts so that it can execute the multiple slices of the query at once. Greenplum has been around for over 10 years and has a mature optimizer which can execute a wide variety of queries pretty well.
"Shared Nothing" is a description of what resources are shared between the processes running in parallel. So you may have shared memory approaches running on a single host, shared storage between multiple hosts or self contained systems with their own processing, RAM and storage. A deployment based a a few of these self contained systems would be described as "shared nothing".
In a shared-nothing system each node will store a subset of the data. Query planners in these systems try to do as much work as possible on the same host the data is stored on and move or shuffle as little data as possible (on Greenplum systems these steps in the query plan are called motions).
We call MPP 'Shared Nothing' as a way to compare Greenplum to something with a 'Shared Everything' architecture like Oracle RAC which also has multiple servers in a cluster but they all connect back to the same set of datafiles.

Sitecore xDB (MongoDB) deployment in separate geographic locations

Sitecore uses MongoDB for tracking and analytics. If the production environment is split into several geographic locations, particularly in different continents, how should xDB be configured? If xDB can only have one writeable primary instance in any replica set, does this not force all front-end CD servers globally to write to the same node in one particular data centre? This doesn't seem ideal.
Voted to move this question to https://dba.stackexchange.com/
You are correct, eventually all of the data comes down to a "one" Mongo, but that replica set itself can be geographically distributed as well. You can also shard MongoDB if you wish to. See the MongoDB manual on geographic redundancy for information on this type of scaling. From the manual:
While replica sets provide basic protection against single-instance failure, replica sets whose members are all located in a single facility are susceptible to errors in that facility. Power outages, network interruptions, and natural disasters are all issues that can affect replica sets whose members are colocated. To protect against these classes of failures, deploy a replica set with one or more members in a geographically distinct facility or data center to provide redundancy.
Also note that with xDB in a geograpically distributed environment, you'll need to have a session state server for each CD cluster. This gathers all the user's information during the session, prior to the session completion flush to the collection database. The Sitecore guide on 'clustered environments' has a diagram and some information on this geographic configuration. From the guide:
Each cluster could contain two or more content delivery instances, each with its own session state server. You could also group clusters together in the same place or spread them across different geographical locations.
I have asked platform developers exactly this question on Sitecore User Group London a week ago, they responded that all the data for xDB is internally kept in UTC format.
We also have had problem of servers in different time zones previously, but that time it affected Event Queue (they did not work when in different time zones), so keeping all your servers in the same time zone with time synchronized would do the job.
Here is official guidance for that from Sitecore:
https://doc.sitecore.net/sitecore%20experience%20platform/utc/settings%20supporting%20utc%20implementation

Data Synchronization in a Distributed system

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?
Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.
I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.