Is the storage and compute decoupled in modern cloud data warehouses? - amazon-redshift

In Redshift, Snowflake, and Azure SQL DW, do we have storage and compute decoupled?
If they are decoupled, is there any use of "External Tables" still or they are gone?
When Compute and Storage were tightly coupled, and when we wanted to scale, we scaled both compute and storage. But under the hoods, was it a virtual machine and we scaled the compute and the VMs disks? Do you guys have maybe some readings on this?
Massive thanks, I am confused now and it would be a blessing if someone could jump in to explain!

You have reason to be confused as there is a heavy layer of marketing being applied in a lot of places. Let's start with some facts:
All databases need local disk to operate. This disk can store permanent versions of the tables (classic locally stored tables and is needed to store the local working set of data for the database to operate. Even in cases where no tables are permanently stored on local disk the size of the local disks is critical as this allows for date fetched from remote storage to be worked upon and cached.
Remote storage of permanent tables comes in 2 "flavors" - defined external tables and transparent remote tables. While there are lots of differences in how these flavors work and how each different database optimizes them they all store the permanent version of the table on disks that are remote from the database compute system(s).
Remote permanent storage comes with pros and cons. "Decoupling" is the most often cited advantage for remote permanent storage. This just means that you cannot fill up the local disks with the storage of "cold" data as only "in use" data is stored on the local disks in this case. To be clear you can fill up (or brown out) the local disks even with remote permanent storage if the working set of data is too large. The downside of remote permanent storage is that the data is remote. Being across a network to some flexible storage solution means that getting to the data takes more time (with all the database systems having their own methods to hide this in as many cases as possible). This also means that the coherency control for the data is also across the network (in some aspect) and also comes with impacts.
External tables and transparent remote tables are both permanently stored remotely but there are differences. An external table isn't under the same coherency structure that a fully-owned table is under (whether local or remote). Transparent remote just implies that the database is working with the remote table "as if" it is locally owned.
VMs don't change the local disk situation. An amount of disk is apportioned to each VM in the box and an amount of local disk is allocated to each VM. The disks are still local, it's just that only a portion of the physical disks are addressable by any one VM.
So leaving fact and moving to opinion. While marketing will tell you why one type of database storage is better than the other in all cases this just isn't true. Each has advantages and disadvantages and which is best for you will depend on what your needs are. The database providers that offer only one data organization will tell you that this is the best option, and it is for some.
Local table storage will always be faster for those applications where speed of access to data is critical and caching doesn't work. However, this means that DBAs will need to do the work to maintain the on-disk data is optimized and fits is the available local storage (for the compute size needed). This is real work and takes time an energy. What you gain in moving remote is the reduction of this work but it comes at the cost of some combination of database cost, hardware cost, and/or performance. Sometimes worth the tradeoff, sometimes not.

When it comes to the concept of separating (or de-coupling) Cloud Compute vs. Cloud Storage, the concepts can become a little confusing. In short, true decoupling generally requires object level storage vs. faster traditional block storage (traditionally on-premises and also called local storage). The main reason for this is that object storage is flat, without a hierarchy and therefore scales linearly with the amount of data you add. It therefore winds up also being cheaper as it is extremely distributed, redundant, and easily re-distributed and duplicated.
This is all important because in order to decouple storage from compute in the cloud or any large distributed computing paradigm you need to shard (split) your data (storage) amongst your compute nodes... so as your storage grows linearly, object storage which is flat -- allows that to happen without any penalty in performance -- while you can (practically) instantly "remaster" your compute nodes so that they can evenly distribute the workload again as you scale your compute up or down or to withstand network/node failures.

Related

Cloud SQL disk size is much larger than actual database

Cloud SQL reports that I've used ~4TB of SSD storage, but my database is only ~225 GB. What explains this discrepancy? Is there something I can delete to free up space? If I moved it to a different instance, would the required storage go down?
There are a couple of options about why your Cloud SQL storage has increase:
-Did you enable Point-in-time recovery? PITR uses write-ahead logs and if you enabled this feature, that could be the reason why of your increases.
-Have you used temporary tables and you have not deleted them?
If none of the above applies to you, I highly recommend you to open a case with GCP support team so that they take a look at your Cloud SQL instance.
On the other hand, you should open a case to decrease the disk size to a smaller one so it won’t be necessary to create a new instance and copy all the data to that new instance in addition that shrinking the disk is done at Google's end making the effort from you the lowest possible.
A maintenance window can be scheduled where Google can proceed with this task and you may want to schedule a maintenance window to minimize the impact of the downtime. For this case it is necessary to know the new disk size and when you would like to perform this operation.
Finally, if you prefer to use the migration method, you should export the DB, then create the new instance, import the DB and synchronize the old one with the new one to have all the data in both instances to which can take several hours to complete those four steps.
You do not specify what kind of database. In my case, for a MySQL database, there were several hundred GB as binary logs (mysql flag).
You could check with:
SHOW BINARY LOGS;

How Mongo DB or any nosql DB (Hbase, Cassandra) is scalable and having advantage over traditional RDBMS?

I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.

GridFS: what it gives us

I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!

How does the storage backend influence Datomic?

How should I pick the backend storage service for Datomic?
Is it a matter of preference to select, say, DynamoDB instead of Postgres, or does each option have different tradeoffs? If so, what are they?
Storage Services Requirements
Datomic' storage services should generally meet 3 requirements:
Implement key-value store semantics: efficient read/write access using indexed keys’ values
Support consistent reads. e.g. read your own writes. Ideally, no-contention/lock-free reads.
Support conditional puts. e.g. optimistic locking + snapshot isolation.
Datomic uses storages services to store blocks of sorted, compressed datoms, similar to the way traditional database systems use file systems and the requirements above are pretty much the API between the underlying storage service and Datomic. So the choice in storage services depend on how well they support those three requirements.
Write Scalability
Datomic doesn't usually put a lot of write pressure on the underlying storage service since there's only one component writing to it, the Transactor. Also, Datomic uses a background indexing job to integrate novelty into storage once enough of it has been accumulated (by default ~32MB but can be configured) which further reduces the constant write load. The only thing Datomic immediately writes is the transaction log.
Read Scalability
Datomic uses multiple layers of caching i.e. memcached and peers cache so in ideal circumstances i.e. when the working set fits in memory, the systems won't put a lot o read pressure either.
System Load
If your system doesn't require huge write scalability and your application data tends to fit in memory, then the choice of a particular storage service is irrelevant except, of course, for their operational capabilities (backups, admin tools, etc.) which have nothing to do with Datomic.
If, on the other hand, you system does require huge write scalability or you have a great number of peers, each of them working with more data than can fit in their memory (forcing a lot of data segments to be brought from storage), you'll require a storage system that can horizontally scale e.g. DynamoDB. As mentioned in one of the comments, if you need arbitrary write scalability, Datomic is not the right system for you anyway.

MongoDB: Can different databases be placed on separate drives?

I am working on an application in which there is a pretty dramatic difference in usage patterns between "hot" data and other data. We have selected MongoDB as our data repository, and in most ways it seems to be a fantastic match for the kind of application we're building.
Here's the problem. There will be a central document repository, which must be searched and accessed fairly often: it's size is about 2 GB now, and will grow to 4GB in the next couple years. To increase performance, we will be placing that DB on a server-class mirrored SSD array, and given the total size of the data, don't imagine that memory will become a problem.
The system will also be keeping record versions, audit trail, customer interactions, notification records, and the like. that will be referenced only rarely, and which could grow quite large in size. We would like to place this on more traditional spinning disks, as it would be accessed rarely (we're guessing that a typical record might be accessed four or five times per year, and will be needed only to satisfy research and customer service inquiries), and could grow quite large, as well.
I haven't found any reference material that indicates whether MongoDB would allow us to place different databases on different disks (were're running mongod under Windows, but that doesn't have to be the case when we go into production.
Sorry about all the detail here, but these are primary factors we have to think about as we plan for deployment. Given Mongo's proclivity to grab all available memory, and that it'll be running on a machine that maxes out at 24GB memory, we're trying to work out the best production configuration for our database(s).
So here are what our options seem to be:
Single instance of Mongo with multiple databases This seems to have the advantage of simplicity, but I still haven't found any definitive answer on how to split databases to different physical drives on the machine.
Two instances of Mongo, one for the "hot" data, and the other for the archival stuff. I'm not sure how well Mongo will handle two instances of mongod contending for resources, but we were thinking that, since the 32-bit version of the server is limited to 2GB of memory, we could use that for the archival stuff without having it overwhelm the resources of the machine. For the "hot" data, we could then easily configure a 64-bit instance of the database engine to use an SSD array, and given the relatively small size of our data, the whole DB and indexes could be directly memory mapped without page faults.
Two instances of Mongo in two separate virtual machines Would could use VMWare, or something similar, to create two Linux machines which could host Mongo separately. While it might up the administrative burden a bit, this seems to me to provide the most fine-grained control of system resource usage, while still leaving the Windows Server host enough memory to run IIS and it's own processes.
But all this is speculation, as none of us have ever done significant MongoDB deployments before, so we don't have a great experience base to draw upon.
My actual question is whether there are options to have two databases in the same mongod server instance utilize entirely separate drives. But any insight into the advantages and drawbacks of our three identified deployment options would be welcome as well.
That's actually a pretty easy thing to do when using Linux:
Activate the directoryPerDB config option
Create the databases you need.
Shut down the instance.
Copy over the data from the individual database directories to the different block devices (disks, RAID arrays, Logical volumes, iSCSI targets and alike).
Mount the respective block devices to their according positions beyond the dbpath directory (don't forget to add the according lines to /etc/fstab!)
Restart mongod.
Edit: As a side note, I would like to add that you should not use Windows as OS for a production MongoDB. The available filesystems NTFS and ReFS perform horribly when compared to ext4 or XFS (the latter being the suggested filesystem for production, see the MongoDB production notes for details ). For this reason alone, I would suggest Linux. Another reason is the RAM used by rather unnecessary subsystems of Windows, like the GUI.