It is my understanding that Redshift is built for performance but not for Availability.
The documentation https://aws.amazon.com/redshift/faqs/
suggest that once any node is down, the whole cluster is down until the node is restored. In the case of the AZ failure, you have no luck at all.
This post suggests having a double cluster
https://aws.amazon.com/blogs/big-data/building-multi-az-or-multi-region-amazon-redshift-clusters/
however, it is not clear to me how do you replicate Looker's PDT tables to support instant failover via Route 53 to the standby cluster?
Just curious about what people do to address the HA issue on Redshift?
from the Q&A regarding high availability in case an AZ disruption scenario:
"If your Amazon Redshift data warehouse cluster's Availability Zone becomes unavailable, Amazon Redshift will automatically move your cluster to another AWS Availability Zone (AZ) without any data loss or application changes. To activate this, you must enable the relocation capability in your cluster configuration settings." https://aws.amazon.com/redshift/faqs/?nc1=h_ls
Redshift now supports multi-AZ deployments: https://aws.amazon.com/redshift/reliability
Related
I have some experience with AWS RDS MySQL multi-AZ (HA). I'm looking at GCP Cloud SQL Postgres HA for a new project.
I'm trying to figure how certain maintenance operations work but can't figure it out from the Cloud SQL docs.
How much unavailability does a failover cause?
How much unavailability does a CPU/memory upgrade cause?
After a failover, is it important to eventually "failback" to the original primary instance? Or can I leave it running on the standby instance indefinitely? (The Cloud SQL HA failover diagram make it seem like the two instances aren't totally symmetric.)
Just FYI, the answers for AWS RDS
Failover: usually under 70 seconds of unavailability before my application is able to issue queries again.
This is for planned failovers. (For unplanned failovers, it may take a little longer for RDS to detect that the primary instance is unresponsive before it actually initiates the failover.)
A lot of the failover lag is likely due to DNS. Using the AWS RDS Proxy service may reduce that time (they claim by ~80%). The Cloud SQL HA failover diagram shows both instances sharing a virtual IP, which might mean no DNS lag?
CPU/memory upgrade: I think AWS can accomplish this with a single failover worth of unavailability. It upgrades the standby instance (no unavailability), performs a failover, then upgrades the other instance.
On RDS, I think the two instances that are part of the HA set up are symmetric. So if you failover to the standby, it's fine to leave it that way. There's no need (as far as RDS is concerned) to failover back to the original.
To answer your following questions:
As you mentioned, the duration of the unavailability would vary depending if it is a planned (manual) failover vs unplanned. It's best that you test and manually initiate the failover so you can see how long your instance would respond to it, usually it would take a minute or so. When it comes to unplanned failovers, it's pretty much covered in the docs that when failover occurs, any existing connections to the primary instance and read replicas are closed, and it will take approximately 2-3 minutes for connections to be reestablished.
To address this question, you need to understand the requirements for your instance to allow failover:
The primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running Cloud SQL instance operation such as a backup, import or export operation).
That means that failover doesn't count when upgrading your instance, changing your hardware specs (CPU/Memory) will incur downtime so you should plan ahead when making these changes.
To understand the importance of failback, here's an excerpt from this link:
High availability solutions continuously replicate data to a remote site or cloud. In the event that a primary system goes down, the remote, secondary system can be spun up and users are rerouted. This process is commonly referred to as “failover,” and it reduces downtime to seconds or minutes.
However, failover isn’t a permanent state. Once primary servers are up and running, data and applications must be restored so normal operations can resume. This process is known as failback, and it is very important from a DR testing standpoint. Here’s why: Not all replication technology is created equally when it comes to failback. In some cases, failing back to production servers can be painfully slow.
UPDATE 1:
HA on Cloud SQL will provision specs for your standby instance similar to your primary, that's why you'll get billed double the price of a non-HA instance. Also, the importance of failback is not limited to any cloud providers. It is simply a good practice to make sure that all the operation returns to your primary instance instead of just leaving it on a standby instance. On that case, failback (on Cloud SQL to be specific) is really necessary to make sure that everything is back to normal after an outage.
UPDATE 2:
If you don't failback, what could happen is that when there's an outage on the zone where your standby instance is running (you can't control what zone your standby instance will come from), you won't be able to do a failover as the operation will be blocked. (See the docs)
Unfortunately there's pretty much no option as the downtime is required whenever you change hardware. The procedure will require the instance to restart. Here's a link to see how long it would take.
Additional resources: https://severalnines.com/database-blog/achieving-mysql-failover-failback-google-cloud-platform-gcp
I have created one DocumentDB cluster in AWS with two instance running in it, but I need to know the exact storage which will be used for storing the data and also how AWS charge for one cluster.
When you provision an Amazon DocumentDB cluster, you don’t need to specify how much storage or I/Os you need for your cluster. Amazon DocumentDB uses a unique storage system that automatically scales from 10 GB up to 64 TB of data per cluster in 10 GB increments.
Storage is at the cluster level, which means all your instances share the storage. You can view how much storage are you using my monitoring the VolumeBytesUsed metric in the Monitoring tab of your Amazon DocumentDB console. Storage in DocumentDB is priced as low as $0.02 per GB/month (prices may vary across AWS regions). Details here - https://aws.amazon.com/documentdb/pricing/. To how much you are paying for storage, you can also go to the AWS Billing console and view the details of your DocumentDB bill
I want to enable storage autoscaling by the first time in a AWS RDS PostgreSQL instance.
Someone knows or have some documentation to clarify if this requires downtime? i can't found any articles or documentation about explicitly "enabling autoscaling by the first time"
Thanks in advance.
From Amazon RDS now supports Storage Auto Scaling:
RDS Storage Auto Scaling automatically scales storage capacity in response to growing database workloads, with zero downtime.
The message you highlight suggests that downtime would only be caused by other changes in the "pending modifications queue" (eg a requested change of instance type).
I am looking at either setting up Aurora Postgresql or RDS Postgresql instance in AWS.
I would like the db instance to be running in 2 different regions and would like real time replication to be set up. I would also like no downtime for rehydration / patching etc.
Based on what I have read / discussed with colleagues so far , I am under the impression that Aurora Postgresql is the option to choose because RDS needs few minutes of downtime for rehydration and Aurora supports realtime replication of db instance across different regions.
Is my understanding correct and are there any other factors that I should be aware of?
No RDS product supports "real-time" replication across regions. Cross-region replication is always asynchronous.
You can expect to see a higher level of lag time for any Read Replica that is in a different AWS Region than the source instance, due to the longer network channels between regional data centers.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html#USER_ReadRepl.XRgn
Additionally, cross-region replicas for Aurora/Postgres are not yet available.
Cross-region replicas are only available for Aurora/MySQL... but a cross-region replica is not for zero downtime or failover, anyway -- it's only for geo/latency-based read scale-out or disaster recovery, because once you promote the replica, the original master has to be abandoned, because replication is one-way.
If, when you said "region," you were actually referring to availability zones, then that is much more straightforward, since the backing store of Aurora instances is replicated across 3 availability zones within the region, and replication is synchronous. All replicas in a single region can be synchronous, even in different AZs, since they all share the same replicated storage.
Is there some kind of native Postgres tool they use, or is it a custom one? Are the replicas always in sync or do they drift apart from time to time?
With Multi-AZ RDS replication is synchronous. And since AWS like to be in full control of their software, it’s most likely a customised replication (but I couldn’t tell you for sure).