I have recently started using google cloud sql as the backend for an api. Today whilst adding a new IP to the access control the DB went down. I restarted it after a while and when it came back up several tables of data had been lost.
I have it set to use the more reliable method for writes and cannot see how this could happen.
How is this possible?
It could be you are using the wrong storage engine. You should use InnoDB. The Google Cloud SQL FAQ states:
Warning: Using the MyISAM storage engine could result in data loss in some situations, such as unclean shutdowns. For more information, Corrupted MyISAM Tables.
https://developers.google.com/cloud-sql/faq#innodb
Related
I would like to achieve a real time change data capture (log-based preferred) pipeline from Google Cloud Spanner to PubSub/Kafka for my downstream real time applications. Could you please let me know if there is a great and cost-effective way to achieve that? I will appreciate any advice and recommendations.
In addition, for Cloud Data Fusion from google, I noticed that it could achieve real time from mysql/postgresql to cloud spanner, but I did not find the way go from cloud spanner to pubsub/kafka in real time.
Also, I found another two ways, which to be listed here for any comments or suggestions.
Use Debezium, a log-based change data capture Kafka connector from the link https://cloud.google.com/architecture/capturing-change-logs-with-debezium#deploying_debezium_on_gke_on_google_cloud
Create a polling service (which may miss some data) to poll data from cloud spanner from the link: https://cloud.google.com/architecture/deploying-event-sourced-systems-with-cloud-spanner
If you have any suggestion or comment on this, I will be really grateful.
There's a open source implementation of a polling service for Cloud Spanner that can also automatically push changes to PubSub here: https://github.com/cloudspannerecosystem/spanner-change-watcher
It is however not log-based. It has some inherent limitations:
It can miss updates if the same record is updated twice within the polling interval. In that case, only the last value will be reported.
It only supports soft deletes.
You could have a look at the samples to see if it is something that might suit your needs at least to some degree: https://github.com/cloudspannerecosystem/spanner-change-watcher/tree/master/samples
Cloud Spanner has a new feature called Change Streams that would allow building a downstream pipeline from Spanner to PubSub/Kafka.
At this time, there's not a pre-packaged Spanner to PubSub/Kafka connector.
The way to read change streams currently is to use the SpannerIO Apache Beam connector that would allow building the pipeline with Dataflow, or also directly querying the API.
Disclaimer: I'm a Developer Advocate that works with the Cloud Spanner team.
I am new to google cloud and was wondering if it is possible to run PostgresQL container on Cloud Run but the data_directory of PostgresQL was pointed to Cloud Storage?
If possible, then please could you point me to some tutorials/guides on this topic. And also what are the downsides of this approach?
Edit-0: Just to clarify what I am trying to achieve:
I am learning google cloud and want to write simple application to work along with it. I have decided that the backend code will run as a container under Cloud Run and the persistent data(i.e the database file) will reside on Cloud Storage. Because this is a small app for learning purpose, I am trying to use as less moving parts as possible on the backend(and also ones that are always free). And also both PostgresQL and the backend code will reside in the same container except for the actual data file, which will reside under Cloud Storage. Is this approach correct? Are there better approaches to achieve the same minimalism?
Edit-1: Okay, I got the answer! The Google documentation here mentions the following:
"Don't run a database over Cloud Storage FUSE!"
The buckets are not meant to store database information, some of the limits are the following:
There is no limit to writes across multiple objects, which includes uploading, updating, and deleting objects. Buckets initially support roughly 1000 writes per second and then scale as needed.
There is no limit to reads of objects in a bucket, which includes reading object data, reading object metadata, and listing objects. Buckets initially support roughly 5000 object reads per second and then scale as needed.
One alternative to separate persistent disk for your PostgreSQL database, is to use Google Compute Engine. You can follow the “How to Set Up a New Persistent Disk for PostgreSQL Data” Community Tutorial.
I have added logging for all DDL and DML queries on my Google Cloud PostgreSQL database instance.
The queries are logged, but I want to avoid query params or column values from appearing in the logs.
Is there a way to prevent this sensitive data from logging?
Currently, the entire query is logged as a single string with commands and values.
I tried to find if there's a setting which can be enabled using Database Flag on Google Cloud as I can't access postgresql.conf file using Google Cloud Platform.
We have our web services and database set up on AWS a while back and application is now in production. For some reason, we need to terminate the old AWS and move everything under a newly created AWS account. Application and all the infrastructure are pretty straightforward. It is trickier for data though. The current database is still receiving lots of data on daily basis. So it is best to migrate the data after we turn off the old application and switch on new platform.
Both source RDS and target RDS are Postgres. We have about 40GB data to transfer. There are three approaches I could think of and they all have drawbacks.
Take a snapshot of the first RDS and restore it in second one. Problem is I don't need to transfer all the data from source to destination. Probably just records after 10/01 is enough. Also snapshot works best to restore in an empty rds that is just created. For our case, the new RDS will start receiving data already after the cutoff. Only after that, the data will be transferred from old account to new account otherwise we will lose data.
Dump data from tables in old RDS and backup in new RDS. This will have the same problem as #1. Also, if I dump data to local machine and then back up from local, the network speed is bottleneck.
Export table data to csv files and import to new RDS. The advantage is this method allows pick and choose and some data cleaning as well. But it takes forever to export a big fact table to local csv file. Another problem is, for some of the tables, I have surrogate row IDs which are serial (auto-incremental). The row IDs of exported csv may conflicting with existing data in new RDS tables.
I wonder if there is a better way to do it. Maybe some ETL tool AWS has which does point to point direct transfer without involving using local computer as the middle point.
In 2022 the simplest way to achieve this task is using AWS Database Migration Services (AWS DMS).
You can create a migration task, and set the original database as the source endpoint, and the new database as a destination endpoint.
Next create a task with "Full load, ongoing replication" settings.
More details here: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
I have recently moved the data of RDS from one account to other using Bucardo (https://bucardo.org/). Please refer the following blogs
https://www.compose.com/articles/using-bucardo-5-3-to-migrate-a-live-postgresql-database/
https://bucardo.org/pipermail/bucardo-general/2017-February/002875.html
Though this has not mentioned exactly about migration between two RDS account, this could help setting things. We still need some intermediate point such as EC2 instance where we need to configure this Bucardo and migrate the data between accounts. If you are looking for more information, I am happy to help.
In short, we need to take a manual snapshot of the source db and restore it in the another account (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ShareSnapshot.html) and with Bucardo set up in the EC2 instance, we can start to sync the data using triggers and that will update the data in destination db as and then the new data comes in to the source DB.
I've been testing Google Cloud SQL with Postgresql, but I have random queries taking ~3s instead of a few ms.
Troubleshooting I did:
The queries themselves aren't problems, rerunning the same query will work.
Indexes are properly set. The database is also very very small, it shouldn't do this, even if there weren't any index.
The Kubernetes container is connecting to the database through SQL Proxy (I followed this https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine). It is not the problem though as I tried to connect directly to the database, with the same issue.
I configured net.ipv4.tcp_keepalive_time to 60 to make sure the connection weren't dropping.
I also have a pool of connection that are never disconnected to make sure it wasn't from that.
When I run queries directly through my local Postgresql client, I never have the problem.
I don't have this issue when developing locally either and connecting to my local database.
What I'm getting at is: I feel there's some weird connection/link issue between my Google Compute instances and my Google SQL instance that I can't seem to figure out.
Any idea?
Edit:
I also noticed these logs in my SQL Cloud instance every 30s:
ERROR: recovery is not in progress
HINT: Recovery control functions can only be executed during recovery.
STATEMENT: SELECT pg_is_xlog_replay_paused(), current_timestamp
That's an interesting problem you are facing. So my knowledge on Kubernetes isn't that great, but I do have a general understanding so let's see if I can provide some suggestions.
To start with, the API that you linked to in your question does mention that it is still in beta. So I do believe there would still be issues to patch in maximizing speed performance.
Secondly, from what I understand, Kubernetes is a great tool for handling stateless workloads. Thus, handling data where state is required for queries would be a slow operation. This article (although not entirely related) does explain some of the pitfalls of Kubernetes (not all the questions are relevant)
Thirdly, could you explain your use case a little bit? Do you really need to use Kubernetes or will another tool like a powerful Compute Engine Instance or or a Dataflow job resolve the the issue? Are you making your database queries through a programming language or an application call?
Thanks, and do let me know!