DBT cloud connection setup with different hosts in Redshift - amazon-redshift

I am currently using a Redshift data warehouse to connect with my dbt project. We typically have 2 different hosts for development and production databases. I want to trigger jobs during PRs to 2 different branches where one job should run on our development database and the other on our production database(these 2 databases have 2 different host addresses). How do I configure dbt Cloud to use two different hosts in the same project? For dbt core, I was able to configure this using profiles.yml and env variables. Will dbt cloud by default use profiles.yml? or is there any other way I need to consider to achieve this?

Related

MongoDB Atlas clone entire cluster with database, collections, rules, and functions to a second cluster (Dev/Prod)

I'm new-ish to MongoDB Atlas and I have my dev environment all set up and working with my collections, rules, and some functions in use by HTTPS Endpoints. I'd like to clone all of that into a second cluster on a serverless cluster instance for production.
Is there a best-practice way of doing this and then updating it on an ongoing basis in both directions? Copying functions/endpoints etc from dev to prod, and copying collections from prod to dev?

Can ankane/pgsync perform bidirectional synchronisation?

We have a cluster on Postgres 12 installed locally (CentOS) and another RDS. We want to synchronize two tables, one from on-premise against another from the cloud (RDS), but we need it to be in both directions, that is, local-> RDS and RDS -> local.
The idea is to be able to attack (insert/update/delete) indistinctly from the applications to one or another cluster.
Can ankane/pgsync be used to update tables bi-directionally, or is it one-way parent/child mode?
Is this possible with pgsync (ankane) or is there another tool?

Google Datafusion : Loading multiple small tables daily

I want to load about 100 small tables (min 5 records, max 10000 records) from SQL Server into Google BigQuery on a daily basis. We have created 100 Datafusion pipelines, one pipeline per source table. When we start one pipeline it takes about 7 minutes to execute. Offcourse its starts DataProc, connects to SQL server and sinks the data into Google BigQuery. When we have to run this sequentially it will take 700 minutes? When we try to run in pipelines in parallel we are limited by the network range which would be 256/3. 1 pipeline starts 3 VM's one master 2 slaves. We tried but the performance is going down when we start more than 10 pipelines in parallel.
Questions. Is this the right approach?
When multiple pipelines are running at the same time, there are multiple Dataproc clusters running behind the scenes with more VMs and require more disk. There are some plugins to help you out with multiple source tables. Correct plugin to use should be CDAP/Google plugin called Multiple Table Plugins as it allows for multiple source tables.
In the Data Fusion studio, you can find it in Hub -> Plugins.
To see full lists of available plugins, please visit official documentation.
Multiple Data Fusion pipelines can use the same pre-provisioned Dataproc cluster. You need to create the Remote Hadoop Provisioner compute profile for the Data Fusion instance.
This feature is only available in Enterprise edition.
How setup compute profile for the Data Fusion instance.

Running automated jobs in PostgreSQL

I have setup a PostgreSQL server and am using PgAdmin 4 for managing the databases/clusters. I have a bunch of SQL validation scripts (.sql) which I run on the databases every time some data is added to the database.
My current requirement is to automatically run these .sql scripts and generate some results/statistics every time new data is added to any of the tables in the database.
I have explored the use of pg_cron (https://www.citusdata.com/blog/2016/09/09/pgcron-run-periodic-jobs-in-postgres/) and pgAgent (https://www.pgadmin.org/docs/pgadmin4/dev/pgagent_jobs.html)
Before I proceed to integrate any of these tools into my application, I wanted to know if it is advisable to proceed using these utilities or if I should employ the service of a full-fledged CI framework like Jenkins?

How to replicate MySQL database to Cloud SQL Database

I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !