Use Azure Blob storage for postgresql database

Use Azure Blob storage for postgresql database - postgresql

Azure blob storage has a maximum limit of 500TB. This is sufficient as I have (or will have) a lot of data.
I'd like to set up a server in Azure with PostGres installed, but I do not want to use the local SSD/HDD, because the size is limited... and the only way I can resize the disk (if I need to), is to shut down the VM and bang out some cryptic codes into PowerShell.
Is it possible to tell Postgres to use Azure blob storage?
If so, how can I do this?
If no, what options do I have? How can I make my HDD/SDD scale as I need more space, without any intervention on my part?

You cannot directly read/write from/to blob storage, via Postgres, as it expects a normal file system (with normal file I/O operations). To do that, there'd need to be modifications made to the Postgres storage system to accomplish such a thing.
The maximum storage footprint, from a normal OS file system perspective, would then be limited by the number of disks (which sit in page blobs) attached to your vm. Maximum 2 disks per core, maximum 4TB per disk (the larger disk sizes are new as of May 2017). That gives you a range of 8TB (single core) to 128TB (32 core). And you'd likely need to RAID those disks (unless Postgres supports a JBOD model).
As suggested by #Gaurav, you can also use Postgres-as-a-Service. Currently, while in Preview, that service has a limit of 1TB per database. Larger sizes will eventually be available.

Related

Is the storage and compute decoupled in modern cloud data warehouses?

In Redshift, Snowflake, and Azure SQL DW, do we have storage and compute decoupled?
If they are decoupled, is there any use of "External Tables" still or they are gone?
When Compute and Storage were tightly coupled, and when we wanted to scale, we scaled both compute and storage. But under the hoods, was it a virtual machine and we scaled the compute and the VMs disks? Do you guys have maybe some readings on this?
Massive thanks, I am confused now and it would be a blessing if someone could jump in to explain!

You have reason to be confused as there is a heavy layer of marketing being applied in a lot of places. Let's start with some facts:
All databases need local disk to operate. This disk can store permanent versions of the tables (classic locally stored tables and is needed to store the local working set of data for the database to operate. Even in cases where no tables are permanently stored on local disk the size of the local disks is critical as this allows for date fetched from remote storage to be worked upon and cached.
Remote storage of permanent tables comes in 2 "flavors" - defined external tables and transparent remote tables. While there are lots of differences in how these flavors work and how each different database optimizes them they all store the permanent version of the table on disks that are remote from the database compute system(s).
Remote permanent storage comes with pros and cons. "Decoupling" is the most often cited advantage for remote permanent storage. This just means that you cannot fill up the local disks with the storage of "cold" data as only "in use" data is stored on the local disks in this case. To be clear you can fill up (or brown out) the local disks even with remote permanent storage if the working set of data is too large. The downside of remote permanent storage is that the data is remote. Being across a network to some flexible storage solution means that getting to the data takes more time (with all the database systems having their own methods to hide this in as many cases as possible). This also means that the coherency control for the data is also across the network (in some aspect) and also comes with impacts.
External tables and transparent remote tables are both permanently stored remotely but there are differences. An external table isn't under the same coherency structure that a fully-owned table is under (whether local or remote). Transparent remote just implies that the database is working with the remote table "as if" it is locally owned.
VMs don't change the local disk situation. An amount of disk is apportioned to each VM in the box and an amount of local disk is allocated to each VM. The disks are still local, it's just that only a portion of the physical disks are addressable by any one VM.
So leaving fact and moving to opinion. While marketing will tell you why one type of database storage is better than the other in all cases this just isn't true. Each has advantages and disadvantages and which is best for you will depend on what your needs are. The database providers that offer only one data organization will tell you that this is the best option, and it is for some.
Local table storage will always be faster for those applications where speed of access to data is critical and caching doesn't work. However, this means that DBAs will need to do the work to maintain the on-disk data is optimized and fits is the available local storage (for the compute size needed). This is real work and takes time an energy. What you gain in moving remote is the reduction of this work but it comes at the cost of some combination of database cost, hardware cost, and/or performance. Sometimes worth the tradeoff, sometimes not.

When it comes to the concept of separating (or de-coupling) Cloud Compute vs. Cloud Storage, the concepts can become a little confusing. In short, true decoupling generally requires object level storage vs. faster traditional block storage (traditionally on-premises and also called local storage). The main reason for this is that object storage is flat, without a hierarchy and therefore scales linearly with the amount of data you add. It therefore winds up also being cheaper as it is extremely distributed, redundant, and easily re-distributed and duplicated.
This is all important because in order to decouple storage from compute in the cloud or any large distributed computing paradigm you need to shard (split) your data (storage) amongst your compute nodes... so as your storage grows linearly, object storage which is flat -- allows that to happen without any penalty in performance -- while you can (practically) instantly "remaster" your compute nodes so that they can evenly distribute the workload again as you scale your compute up or down or to withstand network/node failures.

Cloud SQL disk size is much larger than actual database

Cloud SQL reports that I've used ~4TB of SSD storage, but my database is only ~225 GB. What explains this discrepancy? Is there something I can delete to free up space? If I moved it to a different instance, would the required storage go down?

There are a couple of options about why your Cloud SQL storage has increase:
-Did you enable Point-in-time recovery? PITR uses write-ahead logs and if you enabled this feature, that could be the reason why of your increases.
-Have you used temporary tables and you have not deleted them?
If none of the above applies to you, I highly recommend you to open a case with GCP support team so that they take a look at your Cloud SQL instance.
On the other hand, you should open a case to decrease the disk size to a smaller one so it won’t be necessary to create a new instance and copy all the data to that new instance in addition that shrinking the disk is done at Google's end making the effort from you the lowest possible.
A maintenance window can be scheduled where Google can proceed with this task and you may want to schedule a maintenance window to minimize the impact of the downtime. For this case it is necessary to know the new disk size and when you would like to perform this operation.
Finally, if you prefer to use the migration method, you should export the DB, then create the new instance, import the DB and synchronize the old one with the new one to have all the data in both instances to which can take several hours to complete those four steps.

You do not specify what kind of database. In my case, for a MySQL database, there were several hundred GB as binary logs (mysql flag).
You could check with:
SHOW BINARY LOGS;

What affects DB2 restored database size?

I have database TESTDB with following details:
Database size: 3.2GB
Database Capacity: 302 GB
One of its tablespaces has its HWM too high due to an SMP extent, so it is not letting me reduce the high water mark.
My backup size is around 3.2 GB (As backups contains only used pages)
If I restore this database backup image via a redirected restore, what will be the newly restored database's size?
Will it be around 3.2 GB or around 302 GB?

The short answer is that RESTORE DATABASE will produce a target database that occupies about as much disk space as the source database did when it was backed up.
On its own, the size of a DB2 backup image is not a reliable indicator of how big the target database will be. For one thing, DB2 provides the option to compress the data being backed up, which can make the backup image significantly smaller than the DB2 object data it contains.
As you correctly point out, the backup image only contains non-empty extents (blocks of contiguous pages), but the RESTORE DATABASE command will recreate each tablespace container to its original size (including empty pages) unless you specify different container locations and sizes via the REDIRECT parameter.
The 302GB of capacity you're seeing is from GET_DBSIZE_INFO and similar utilities, and is quite often larger than the total storage the database currently occupies. This is because DB2's capacity calculation includes not only unused pages in DMS tablespaces, but also any free space on volumes or drives that are used by an SMS tablespace (most DB2 LUW databases contain at least one SMS tablespace).

Setting up MongoDB to work with two data directories

I have a mongoDb machine with 1 TB drive (on AWS that is the limit).
however I need to store more than 1 TB of data on this mongoDB setup, but it's not heavy on reads / write.
Is there a way to split the data directory to two mounts - two different directories? (instead of using LVM)

You can use the directoryperdb configuration option so that each database's files are stored in a separate subdirectory, and then use different mount points depending on the volume you want to use.
This option can be helpful in provisioning additional storage or different storage configurations depending on the database usage (i.e. SSD or PIOPS for a database which is more I/O intensive, while using normal EBS storage for archival data).
Important caveats:
There is a highlighted note in the documentation for directoryperdb which shows how to migrate the files for an existing deployment.
If you separate your data files onto multiple volumes, this may change your Backup Strategy. In particular, if you are using filesystem/EC2 snapshots to get a consistent backup of a running mongod you can no longer do so if the data and/or journal files are on different volumes.

I haven't solved anything like your problem. However I've read about Sharding technique which is defined as:
Sharding is a method for storing data across multiple machines.
MongoDB uses sharding to support deployments with very large data sets
and high throughput operations.
It sounds promissing for your problem. Wanna try? Sharding MongoDB
(sorry, not enough rep for commenting)

mongodb single node configuration

I am going to configure mongodb on a small number of cloud servers.
I am coming from mysql, and I remember that if I needed to change settings like RAM, etc. I would have to modify "my.cnf" file. This came useful while resizing each cloud server.
Now, how can I check or modify how much RAM or disk space the database is going to take for each node?
thank you in advance.

I don't think there are any built in broad stroke limitation tools or flags in mongodb per se and that is most likely because this is something you should be doing at the operating system level.
Most modern multi-user operating systems have built in ways to set quotas on disk space, etc per user so you could probably set up a mongo user and place the limits on them if you really wanted to. MongoDB works best when it has enough memory to hold the working set of data and indexes in memory and it does a good job of managing that on its own.
However, if you want to get granular you can take a look at the help output of mongod --help
I see the following options that you could tweak:
--nssize arg (=16) .ns file size (in MB) for new databases
--quota limits each database to a certain number of files (8 default)
--quotaFiles arg number of files allower per db, requires --quota