Is there an easy way to migrate a large (100G) Google cloud sql database to Google datastore?
The way that comes to mind is to write a python appengine script for each database and table and then put it into the datastore. That sounds tedious but maybe it has to be done?
Side note, the reason I'm leaving cloud sql is because I have jsp pages with multiple queries on them and they are incredibly slow even with a d32 sql instance. I hope that putting it in the datastore will be faster?
There seems to be a ton of questions about moving away from the datastore to cloud sql, but I couldn't find this one.
Thanks
Here are a few options:
Write an App Engine mapreduce [1] program that pulls data in appropriate chunks from Cloud SQL and write is to Datastore .
Spin up a VM on Google Compute Engine and write a program that fetches the data from Cloud SQL and write to Datastore using the Datastore external API [2].
Use the Datastore restore [3]. I'm not familiar with the format so I don't know how much work is to get produce something that the restore will accept.
[1] https://cloud.google.com/appengine/docs/python/dataprocessing/
[2] https://cloud.google.com/datastore/docs/apis/overview
[3] https://cloud.google.com/appengine/docs/adminconsole/datastoreadmin?csw=1#restoring_data
I wrote a couple scripts that do this running on compute engine.
The gcp datastore api
import googledatastore
Here is the code:
https://gist.github.com/nburn42/d8b488da1d2dc53df63f4c4a32b95def
And the dataflow api
from apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastore
Here is the code:
https://gist.github.com/nburn42/2c2a06e383aa6b04f84ed31548f1cb09
I get a quota exceeded though after it hits 100,000 entities, and I have to wait another day to do another set.
Hopefully these are useful to someone with a smaller database than me.
( The quota problem is here
Move data from Google Cloud-SQL to Cloud Datastore )
Related
I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.
I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.
We have one Google Cloud SQL instance with 1 vCPU for production. I want to grab a copy of the data by exporting to a bucket. Is this safe to do? As in might it block other operations on the instance?
I think it's important to take into consideration the RDBMS that you are using, it's mentioned in here that PostgreSQL has issues when handling big blobs in an export, and at this other SO post there's an answer with the most votes with hints to have an smoother export, since it can lead to DBs getting unresponsive, which is a pretty well known fact.
In the case of MySQL, the product doc have some tips for this case in this article where it stated:
"If the server is running, it is necessary to perform appropriate locking so that the server does not change database contents during the backup"
And you can achive this by using mysqldump --lock-tables=false into your export command.
I need to migrate the tables from the BigQuery to the on-prem Postgres database.
How can I efficiently achieve that?
Some thoughts that are coming
I will use Google APIs to export the data from the tables
Store it locally
And finally, import to Postgres
But I am not sure if that can be done for a huge amount of data in TBs. Also, how can I automate this process? Can I use Jenkins for that?
Exporting the data from BigQuery, store it and importing it to PostgreSQL is a good approach. Here are other two alternatives that you can consider:
1) There's a PostgreSQL wrapper for BigQuery that allows to query directly from BigQuery. Depending on your case scenario this might be the easiest way to transfer the data; although, for TBs it might not be the best approach. This suggestion was made by #David in this SO question.
2) Using Dataflow. You can create a ETL process using Apache Beam to made the transfer. Take a look at this how-to for transferring data from BigQuery to CloudSQL. You would need to adapt it for local PostgreSQL, but the idea maintains.
Here's another SO answer that gives more context on this approach.
I have a cloud native app based on azure. The app uses azure table storage.
Due to a fantastic opportunity I have decided to also provide the app on-premises. So I have to replace the NoSql data provider... my question is: Which solution is more alike Azure Table Storage? Mongo? Raven? you name it!
What I intend is to migrate the code effortlessly, like migrating from SQL Azure to Sql Server 2012... no code change needed... but I know that theres no equivalent for table storage... so I intend to find the one that will reduce my TTM as much as possible...
MongoDB and Table Storage are not exactly swappable replacements for each other. One is key/value, the other is document. I compared the two in this answer.
There's no getting around the fact that Table Storage is Storage-as-a-Service and you only pay for quantity of data (plus a very small per-transaction cost), whereas to work with MongoDB, you'd either have to host it in your own VMs (which gives you plenty of storage room, but at the expense of VMs) or work with a hoster (such as MongoLab, which offers 500MB for free currently). Regardless, you'd have do do some code changes to work with MongoDB over Table Storage.
I'm not sure if there exists a key/value store equivalent to Table Storage that's locally-installable. No matter what you pick, you'll have modifications on your Azure-side solution if you swap out Table Storage.
Is it possible, for your on-premises solution, to provide a MongoDB backend that stays relatively simple? That is: Stick with a single index to substitute for rowkey, and then store your table entities as documents (avoiding sub-documents)? This would keep your data layout very similar. At that point, you could use things like Aggregation Framework for a bit of data processing, and not damage the overall layout style/schema of your data.
MongoDB would give you a consistent storage framework that you could use in-cloud and on-premises, and has good support for Windows Azure.