AWS Redshift Snapshot Restore - Benchmark - amazon-redshift

Any benchmark available on restoring X TB snapshot from a N node Redshift cluster to another N node Redshift cluster?
For e.g. a 10 TB snapshot from a 12 node Redshift cluster takes 1 day to restore in another 12 node Redshift cluster? is it more like 1 day for the above or 1 week?
We are trying to get a benchmark.
Thank you

When you restore from a snapshot, Amazon Redshift creates a new cluster and makes the new cluster available before all of the data is loaded, so you can begin querying the new cluster immediately. The cluster streams data on demand from the snapshot in response to active queries, then loads the remaining data in the background.
"Amazon Redshift Snapshots"
The new cluster (restore target) becomes available very quickly. You can start running all queries and Redshift will retrieve any missing data from the snapshot.
Total time taken to finish restoring all data from the snapshot depends on multiple factors:
Is the cluster being heavily used during the restore? A cluster with no usage will complete the restore somewhat more quickly.
What is the cluster size and node type? Larger nodes and larger clusters will have more network resources.
Does the cluster use Enhanced VPC routing? This can improve network bandwidth further. https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-enabling-cluster.html

Related

how to setup replication instance in on premises postgres for master database in AWS RDS postgres?

I have a requirement of checking whether the exact copy of master database from AWS RDS can be created in on premises or not..
I have already established the connectivity between on prem and aws. Also checked the data migration using pg dump. But i am not getting how to create the replica without using DMS. Due to some security purpose we are not supposed to use DMS. So is there any other way out to implement thi ?
Any help will be much appreciated
It appears that your goal is disaster recovery.
Amazon RDS offers a few options for this:
Amazon RDS Snapshots are a backup of the database, stored in a region. If your database is in an Availability Zone that fails, the snapshot can be restored as a new database in another AZ. All AZs are physically separate data centers, much like your own data center is physically separate from an AWS data center.
Snapshots can also be copied to other Regions, which would guarantee a separation distance between data centers.
Multi-AZ Amazon RDS Databases keep a second copy of the data in another AZ and can switch-over to the alternate AZ without losing any data. This is faster than restoring a snapshot, but costs twice as much since two separate database servers are deployed.
These options would be easier to manage than replicating your data to an on-premises system. A Multi-AZ will automatically start the secondary instance, so your app can continue operating with only a short delay and no data loss. This is much better than you could offer if you fail-over to an on-premises system.

CockroachDB snapshot backups in Kubernetes

I am trying to take snapshot backups with Velero in Kubernetes of a 12 node test CockroachDB cluster with Velero such that, if the cluster failed, we could rebuild the cluster and restore the cockroachdb from these snapshots.
We're using Velero to do that and the snapshot and restore seems to work, but on recovery, we seem to have issues with CockroachDB losing ranges.
Has anyone gotten snapshot backups to work with CockroachDB with a high scale database? (Given the size of the dataset, doing dumps or restores from dumps is not viable.)
Performing backups of the underlying disks while CockroachDB nodes are running is unlikely to work as expected.
The main reason is that even if a persistent disk snapshot is atomic, there is no way to ensure that all disks are captured at the exact same time (time being defined by CockroachDB's consistency mechanism). The restore would contain data with replicas across nodes at different commit indices, resulting in data loss or loss of quorum (show in the Admin UI as "unavailable" ranges).
You have a few options (in order or convenience):
CockroachDB BACKUP which has all nodes write data to external storage (S3, GCS, etc...). Before version 20.2, this is only available with an enterprise license.
SQL dump which is impractical for large datasets
stop all nodes, snapshot all disks, startup all nodes again. warning: this is something we have used to quickly load testing datasets but have not used it in production environments.

Redshift restoring from snapshot - what the cluster status really means about the cluster?

Everyday, my team shuts down the company redshift cluster, on evening, to save money.
We create a snapshot of that cluster to restore it on the next working day.
When we restore the cluster, it takes some time to move from creating status to available, restoring status, however, what exactly available, restoring means? What kind of queries can I execute on my cluster when it stills restoring from snapshot?
I'm asking this because we have some queries to execute right after the creation of the cluster, and we are executing them on the available, restoring status. Should I wait until the restoring finishes?
From Amazon Redshift Snapshots:
When you restore from a snapshot, Amazon Redshift creates a new cluster and makes the new cluster available before all of the data is loaded, so you can begin querying the new cluster immediately. The cluster streams data on demand from the snapshot in response to active queries, then loads the remaining data in the background.
This is really neat, because it means you can start querying the data before it is fully loaded and Redshift will prioritize loading that data to fulfil your query!
Therefore, it would seem that:
Creating means the cluster is being created
Restoring means that the snapshot is being restored
Available means the restore is complete and the cluster is running

Taking EBS snapshot for multiple mongo node EBS volumes in mongoDB cluster

I have journal and data both on the same volume for a mongoDB shard, so the consistency problem of taking snapshots only after locking using fsyncLock is not needed. An EBS snapshot would be consistent point in time for a single shard.
I would like to know what is the preferred way of taking backups in mongodb cluster. I have explored two options:
Approximate point in time consistent backup by taking the EBS snapshots around the same time. Advantage being, no write lock needs to be taken.
Stop writes on the system, then take snapshots. This would give point in time consistent backup.
Now, I'd like to know how is it actually done in production. I've read about replica set's secondary node being used, but not clear how it gives point in time consistent backup. Unless all the secondary nodes have a consistent point in time data, the EBS snapshot cannot be point in time. For example, what if for a secondary for NodeA, data is synced with primary, but some data for secondary for NodeB is not. Am I missing something here?
Also, can if ever happen that approach 1 leads to inconsistent MongoDB cluster (when restored), such that is crashes or stuff?
Consistent backups
The first steps in any sharded cluster backup procedure should be to:
Stop the balancer (including waiting for any migrations in progress to complete). Usually this is done with the sh.stopBalancer() shell helper.
Backup a config server (usually with the same method as your shard servers, so EBS or filesystem snapshot)
I would define a consistent backup of a sharded cluster as one where the sharded cluster metadata (i.e. the data stored on your config servers) corresponds with the backups for the individual shards, and each of the individual shards has been correctly backed up. Stopping the balancer ensures that no data migrations happen while your backup is underway.
Assuming your MongoDB data and journal files are on a single volume, you can take a consistent EBS snapshot or filesystem snapshot without stopping writes to the node you are backing up. Snapshots occur asynchronously. Once an initial snapshot has been created, successive snapshots are incremental (only needing to update blocks that have changed since the previous snapshot).
Point-in-time backup
With an active sharded cluster, you can only easily capture a true point-in-time backup of data that has been written by stopping all writes to the cluster and backing up the primaries for each shard. Otherwise, as you have surmised, there may be differing replication lag between shards if you backup from secondaries. It's more common to backup from secondaries as there is some I/O overhead while the snapshots are written.
If you aren't using replication for your shards (or prefer to backup from primaries) the replication lag caveat doesn't apply, but the timing will be still be approximate for an active system as the snapshots need to be started simultaneously across all shards.
Point-in-time restore
Assuming all of your shards are backed by replica sets it is possible to use an approximate point-in-time consistent backup to orchestrate a restore to a more specific point-in-time using the replica set oplog for each of the shards (plus a config server). This is essentially the approach taken by backup solutions such as MongoDB Cloud Manager (née MMS): see MongoDB Backup for Sharded Cluster. MongoDB Cloud Manager leverages backup agents on each shard for continuous backup using the replication oplog, and periodically creates full snapshots on a schedule. Point-in-time restores can be built by starting from a full data snapshot and then replaying the relevant oplogs up to a requested point-in-time.
What's the common production approach?
Downtime is generally not a desirable backup strategy for a production system, so the common approach is to take a consistent backup of a running sharded cluster at an approximate point-in-time using snapshots. Coordinating backup across a sharded cluster can be challenging, so backup tools/services are also worth considering. Backup services can also be more suitable if your deployment doesn't allow snapshots (for example, if your data and/or journal directories are spread across multiple volumes to maximise available IOPS).
Note: you should really, really consider using replication for your production deployment unless this is a non-essential cluster or downtime is acceptable. Replica sets help maximise uptime & availability for your deployment and some maintenance tasks (including backup) will be much more impactful without data redundancy.
Your backup will be divided into multiple phases:
Stop the balancer on the mongos with sh.stopBalancer()
You can backup now the config database of the config servers. Does not matter whether you do it using EBS snapshots or mongodump --oplog
Now the shards and you can decide which way:
Either: You backup every node with mongodump --oplog. You do not need to stop writes since you're snapshotting the oplog together with the database export. This backup allows a consistent restore. When restoring, you can use the --oplogReplay and the --oplogLimit options to specify a timestamp (assuming your oplog is sized appropriately and did not roll over during backup). You can perform a dump on all shards in parallel and by the restore is synchronized by the oplog.
Or you fsync and lock and create an EBS snapshot (described http://docs.mongodb.org/ecosystem/tutorial/backup-and-restore-mongodb-on-amazon-ec2/) for every shard. MongoDB 3.0 cannot guarantee when using WiredTiger that the data files do not change. The cost here is, that you're required to stop all reads and writes since you have to unmount the device.
Now start the balancer on the mongos with sh.startBalancer()
Since you do not use replica sets, you have no hassle with lagging secondaries/a write is not replicated throughout the cluster. My favorite option is using mongodump/mongorestore which give a lot of control over the restore.
Update:
In the end, you've to decide, what you want to pay to get certain benefits:
Snapshots: Pay with space, write lock and a certain level of consistency to get fast backups, fast restore times and, not impacting performance after backup
Dumping: Pay with time and ousting the working set during backup to get smaller backups for consistent and slower restores, no write locks

MongoDB on Amazon SSD-backed EC2

We have mongodb sharded cluster currently deployed on EC2 instances in Amazon. These shards are also replica sets. The instances used are using EBS with IOPS provisioned.
We have about 30 million documents in a collection. Our queries count the whole collection that matches the filters. We have indexes on almost all of the query-able fields. This results to the RAM reaching 100% usage. Our working set exceeds the size of the RAM. We think that the slow response of our queries are caused by EBS being slow so we are thinking of migrating to the new SSD-backed instances.
C3 is available
http://aws.typepad.com/aws/2013/11/a-generation-of-ec2-instances-for-compute-intensive-workloads.html
I2 is coming soon
http://aws.typepad.com/aws/2013/11/coming-soon-the-i2-instance-type-high-io-performance-via-ssd.html
Our only concern is that SSD is ephemeral, meaning the data will be gone once the instance stops, terminates, or fails. How can we address this? How do we automate backups. Is it a good idea to migrate to SSD to improve the performance of our queries? Do we still need to set-up a sharded cluster?
Working with the ephemeral disks is a risk but if you have your replication setup correctly it shouldn't be a huge concern. I'm assuming you've setup a three node replica set correct? Also you have three nodes for your config servers?
I can speak of this from experience as the company I'm at has been setup this way. To help mitigate risk I'm moving towards a backup strategy that involved a hidden replica. With this setup I can shutdown the hidden replica set and one of the config servers (first having stopped balancing) and take a complete copy of the data files (replica and config server) and have a valid backup. If AWS went down on my availability zone I'd still have a daily backup available on S3 to restore from.
Hope this helps.