Migrate radosgw data to a new pool - ceph

I have a large (2.2PB, ~6 Billion files) EC pool used by radosgw. It is still under very heavy use by users for reading and writing. However, I want to start using a new ceph pool and migrate all the data in this ceph pool to the new one.
Unfortunately, taking a downtime is not going to be an option as time sensitive data is actively stored and retrieved from this pool and the downtime would likely have to be days to migrate this much data.
Is there a way to migrate this pool without a significant downtime? The closest discussion I found is in this thread where they talk about setting up a new default pool but it doesn't have enough details for my skill level as to how to implement the solution.
Any instructions on how I could accomplish this migration would be appreciated.
OS Release: Ubuntu 20.04
Ceph Release: Octopus

Related

RDS Serverless - Could not verify and start postgres

In the last few days, I'm having this weird issue with my Serverless Postgres RDS.
After deploying new code to the backend service the RDS server becomes unavailable, the only logs I could find are those :
Freeable Memory (MB):
The only document I found is this one, which said AWS working on fixing this issue.
Any help will be much appreciated.
As per the AWS Blog on RDS serverless best practices:
Aurora Serverless scales up when capacity constraints are seen in CPU or connections. However, finding a scaling point can take time (see the Scale-blocking operations section). If there is a sudden spike in requests, you can overwhelm the database. Aurora Serverless might not be able to find a scaling point and scale quickly enough due to a shortage of resources.
The error - Error restarting database: Unable to find shared memory value in the postgres.log file from pg_ctl getSharedMemory command ideally would replace to memory allocation issue.
The best way to handle it would be to keep a buffer/minimum higher allocation of memory while expecting a load on the server.

How to downsize an AWS RDS instance to free tier

I want to create a free tier clone of a production AWS RDS PostgreSQL. As per my understanding, following are different ways
create a snapshot of the production DB and restore it on t2.micro
create a read replica of the production DB using t2.micro and then detach it as independent database
create a free tier database and restore a database dump of the production db
Option 3 is my last preference.
The problem is while creating read replica or restoring from snapshot, AWS doesn't explicitly allow to choose the free tier template. I just want to know if restoring to t2.micro without any advanced features like autoscaling, performance monitoring etc. is equivalent to free tier or not? I read here that the key thing with AWS production DB is that AWS provisions a secondary database provisioned to fallback in event of failure of the primary database or the Availability Zone in which the database is running.
AWS Free Tier doesn't actually care about the kind of service you use. Per their website you just get 750 instance hours per month for a db.t2.micro.
You can use these in any service you see fit and the discount will be applied automatically for the first 12 months.
Looking at the pricing page for RDS Postgres I can see, that these instances aren't listed anymore, which seems weird. The t2 instance family is fairly old, so they're probably trying to phase it out, but typically you can provision older instance types using the API directly if they're not available in the Console.
So what you want to do is create your db.t2.micro instance using one of the SDKs or the AWS CLI and restore from a snapshot. Alternatively you can create a read replica from the CLI and set the class to db.t2.micro. Later detaching that from the main cluster should work.
The production ready stuff refers to the Multi-AZ deployment, which is good for production use, but for anything production related a t2.micro seems like a bad choice, so I'm going to assume you're not planing to do that.

MongoDB Atlas - Replica Set Has No Primary

I'm fairly new to MongoDB (Atlas - free tier), where I have created a project using it for storing my data. I had it set up and working fine for a couple of weeks, when suddenly I received an email with: An alert is open for your Atlas project: Replica set has no primary. I have no idea what this means and I don't believe I have done anything in the last couple of days/weeks that could warrant this alert. However, after checking my project, it seems that I can no longer connect to my cluster and access my data.
After checking on MongoDB Cloud, it seems that my cluster has stopped working and only the secondary shard (don't know if this is the right terminology) is running, while the other two seem to be down. Can anyone explain what this means, why it is happening or how to fix it? Thanks.
To troubleshoot issues like this, read the server logs and act based on the information therein.
For free and shared tiers in Atlas the logs are apparently not available. Therefore:
For a free tier cluster (M0), delete this cluster and create a new one. If you don't have a backup you should be able to dump via a direct connection to any of the operational secondary nodes or using the secondary read preference.
For a shared tier cluster (M2/M5), use the official MongoDB support channels for assistance.

Artifactory Upgrade from 4 to 6 - SHA256 re-indexing takes very long

I am upgrading Artifactory Pro from 4.12.2 to 6.5.2. On my test instance, with around 12k artifacts, the re-indexing of the database after the upgrade takes around 12 hours. I'm afraid in my prod instance it will take close to a month (around 800k artifacts).
- Has anyone seen this before? I did not find any articles that would indicate such a long time
- Is there a way to tune parameters to speed up the indexing?
- Is there a way to predict how much time my prod indexing would take? If it is based on number/size/type of artifacts?
Specifications:
Artifactory in HA mode installed on linux server. DB MSSQL 2016. Filestore - NAS shared mount between the HA nodes. Upgrading from 4.12.2 to 6.5.2.
Firstly, as you state you have Artifactory with HA, leads me to that you have an Enterprise subscription. Please note that Enterprise subscription allows you to contact the JFrog Support-Team at support#jfrog.com.
Now with regards to your scenario and questions, please note that JFrog mentions in their wiki page regarding SHA-2 migration, that the process is "resource intensive operation".
As you mentioned, the process can indeed take weeks/months for big filestores, but it can be tweaked by using the system properties mentioned on the same wiki page.
If you do decide to tweak it and add more workers for example, I highly recommend you to keep monitoring your database, as if the SHA2 migration will cause an impact on the database performance, your production can be affected.
As for statistics, there is no way to predict the time this process can take as it depends on the workers you specified for the job, the resources the instance has, etc..
I hope this clarifies further.

AWS RDS with Postgres : Is OOM killer configured

We are running load test against an application that hits a Postgres database.
During the test, we suddenly get an increase in error rate.
After analysing the platform and application behaviour, we notice that:
CPU of Postgres RDS is 100%
Freeable memory drops on this same server
And in the postgres logs, we see:
2018-08-21 08:19:48 UTC::#:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed
After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.
But since we're on RDS, we cannot access system logs /var/log messages to confirm.
So can somebody:
confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?
I didn't find the answer here:
http://postgresql.freeideas.cz/server-process-was-terminated-by-signal-9-killed/
https://www.postgresql.org/message-id/CAOR%3Dd%3D25iOzXpZFY%3DSjL%3DWD0noBL2Fio9LwpvO2%3DSTnjTW%3DMqQ%40mail.gmail.com
https://www.postgresql.org/message-id/04e301d1fee9%24537ab200%24fa701600%24%40JetBrains.com
AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html
In terms of memory allocation, that's the recommendation:
An Amazon RDS performance best practice is to allocate enough RAM so
that your working set resides almost completely in memory. To tell if
your working set is almost all in memory, check the ReadIOPS metric
(using Amazon CloudWatch) while the DB instance is under load. The
value of ReadIOPS should be small and stable. If scaling up the DB
instance class—to a class with more RAM—results in a dramatic drop in
ReadIOPS, your working set was not almost completely in memory.
Continue to scale up until ReadIOPS no longer drops dramatically after
a scaling operation, or ReadIOPS is reduced to a very small amount.
For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.
Also, that's their recommendation to troubleshoot possible OS issues:
Amazon RDS provides metrics in real time for the operating system (OS)
that your DB instance runs on. You can view the metrics for your DB
instance using the console, or consume the Enhanced Monitoring JSON
output from Amazon CloudWatch Logs in a monitoring system of your
choice. For more information about Enhanced Monitoring, see Enhanced
Monitoring
There's a lot of good recommendations there, including query tuning.
Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:
Aurora features a distributed, fault-tolerant, self-healing storage
system that auto-scales up to 64TB per database instance. Aurora
delivers high performance and availability with up to 15 low-latency
read replicas, point-in-time recovery, continuous backup to Amazon S3,
and replication across three Availability Zones.
EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.
We had a long connection with auto commit set to false:
connection.setAutoCommit(false)
During that time we were doing a lot
of small queries and a few queries with a cursor:
statement.setFetchSize(SOME_FETCH_SIZE)
In JDBC you create a connection object, and from that connection you
create statements. When you execute the statments you get a result
set.
Now, every one of these objects needs to be closed, but if you close
statement, the entry set is closed, and if you close the connection
all the statements are closed and their result sets.
We were used to short living queries with connections of their own so
we never closed statements assuming the connection will handle the
things once it is closed.
The problem was now with this long transaction (~24 hours) which never
closed the connection. The statements were never closed. Apparently,
the statement object holds resources both on the server that runs the
code and on the PostgreSQL database.
My best guess to what resources are left in the DB is the things
related to the cursor. The statements that used the cursor were never
closed, so the result set they returned never closed as well. This
meant the database didn't free the relevant cursor resources in the
DB, and since it was over a huge table it took a lot of RAM.
Hope it helps!
TLDR: If you need PostgreSQL on AWS and you need rock solid stability, run PostgreSQL on EC2 (for now) and do some kernel tuning for overcommitting
I'll try to be concise, but you're not the only one who has seen this and it is a known (internal to Amazon) issue with RDS and Aurora PostgreSQL.
OOM Killer on RDS/Aurora
The OOM killer does run on RDS and Aurora instances because they are backed by linux VMs and OOM is an integral part of the kernel.
Root Cause
The root cause is that the default Linux kernel configuration assumes that you have virtual memory (swap file or partition), but EC2 instances (and the VMs that back RDS and Aurora) do not have virtual memory by default. There is a single partition and no swap file is defined. When linux thinks it has virtual memory, it uses a strategy called "overcommitting" which means that it allows processes to request and be granted a larger amount of memory than the amount of ram the system actually has. Two tunable parameters govern this behavior:
vm.overcommit_memory - governs whether the kernel allows overcommitting (0=yes=default)
vm.overcommit_ratio - what percent of system+swap the kernel can overcommit. If you have 8GB of ram and 8GB of swap, and your vm.overcommit_ratio = 75, the kernel will grant up to 12GB or memory to processes.
We set up an EC2 instance (where we could tune these parameters) and the following settings completely stopped PostgreSQL backends from getting killed:
vm.overcommit_memory = 2
vm.overcommit_ratio = 75
vm.overcommit_memory = 2 tells linux not to overcommit (work within the constraints of system memory) and vm.overcommit_ratio = 75 tells linux not to grant requests for more than 75% of memory (only allow user processes to get up to 75% of memory).
We have an open case with AWS and they have committed to coming up with a long-term fix (using kernel tuning params or cgroups, etc) but we don't have an ETA yet. If you are having this problem, I encourage you to open a case with AWS and reference case #5881116231 so they are aware that you are impacted by this issue, too.
In short, if you need stability in the near term, use PostgreSQL on EC2. If you must use RDS or Aurora PostgreSQL, you will need to oversize your instance (at additional cost to you) and hope for the best as oversizing doesn't guarantee you won't still have the problem.