Anyone ever tried migrating from Snowflake to Redshift? Or things to consider if you have been using Snowflake.
Need to understand the implications as providing options paper.
Thanks
P
This is a very wide topic and consideration depends on the usage, volume of data, availability, SLA etc. For migration to Redshift I recommend to review white papers and options in AWS Database Migration Service.
This is a very helpful guide on how to migrate data from snowflake to redshift cost effectively. Basically you can orchestrate migration using Glue Workflow and automate generation of COPY commands both in SF and RS.
https://aws.amazon.com/blogs/big-data/migrate-from-snowflake-to-amazon-redshift-using-aws-glue-python-shell/
Related
We are working on building a ETL pipeline. And I have come across these two plugins for postgres to capture the writes and deletes of the table.
Can someone please explain the differences between Wal2json and Pgoutput? What are the pros/cons and performance impact if there are any?
Basically I want to understand when to use among these two plugins in Postgres.
Thanks in advance
They are for different purposes. pgoutput is for use with logical replication between PostgreSQL databases, wal2json is for getting a JSON representation of data modifications for use with third-party applications.
I'm getting familiar with the greenplum solution concepts, and trying to understand whether, and if so, when the organisation I work for should use this solution. Our conceptual idea is to setup a kind of central 'datastore' suitable for both OLTP and OLAP access.
My research: this article suggests Greenplum is more suitable for OLAP, and PostgreSQL for OLTP. But I also read about Greenplum improvements for OLTP processing. And in favour of Postgresql, there are also articles like this that suggest that OLAP (eg, a datawarehouse implementation) can be done by means of Postgresql.
So my question is: how to move forward, and what are the main criteria to decide? For example, in case we now have a just a few TB's (1-5), start with a Postgresql cluster (for OLTP+OLAP), and when data volumes grow, move on to Greenplum? Or start straight away with Greenplum?
maybe use postgres if it can handle your use case. If you have you have too much data and need to finish reports and analytics faster; change to greenplum
I've seen lots of blogs and posts comparing AWS Athena and Redshift Spectrum. The unanimous consensus seems to be that if you don't already have a Redshift implementation, just go with Athena.
Are there any scenarios or thresholds where Redshift Spectrum would better support a reporting need, and force a switch from Athena to Redshift?
--Update--
I found the following in the Big Data Analytics Options on AWS white paper under the Anti-Patterns section for Athena
Amazon Redshift is a better tool for Enterprise Reporting and Business Intelligence Workloads involving iceberg queries or cached data at the
nodes.
Then is it fair to say that Athena is for data analytics as opposed to business intelligence?
https://www.stitchdata.com/blog/business-intelligence-vs-data-analytics/
So it comes down to storage. The storage of large amounts of structured data only makes sense in a true data wharehouse setup like Redshift.
Trying to fit the same level of data into flat files like Parquet isn't appropriate.
I am assembling a Business Intelligence solution using the Pentaho software as a BI engine. Within this solution, I had to set up a requirements for a PostgreSQL database server.
The current situation is very easy, since no ETL process is being carried out for data extraction, so the PostgreSQL configuration has not changed it much, and it is practically as it is configured as "factory".
I would like to know what Postgres configuration parameters have to be touched and modified to optimize it as a Datawarehouse. I have seen a lot of documentation, but it is not clear to me at all, since one documentation says that such values have to be modified, and other documentation, other completely different values.
I would like to know just that, if there is a clearer and more precise documentation to optimize a postgres 9.6 to be used as a Pentaho DW.
Thank you very much
I need to migrate the tables from the BigQuery to the on-prem Postgres database.
How can I efficiently achieve that?
Some thoughts that are coming
I will use Google APIs to export the data from the tables
Store it locally
And finally, import to Postgres
But I am not sure if that can be done for a huge amount of data in TBs. Also, how can I automate this process? Can I use Jenkins for that?
Exporting the data from BigQuery, store it and importing it to PostgreSQL is a good approach. Here are other two alternatives that you can consider:
1) There's a PostgreSQL wrapper for BigQuery that allows to query directly from BigQuery. Depending on your case scenario this might be the easiest way to transfer the data; although, for TBs it might not be the best approach. This suggestion was made by #David in this SO question.
2) Using Dataflow. You can create a ETL process using Apache Beam to made the transfer. Take a look at this how-to for transferring data from BigQuery to CloudSQL. You would need to adapt it for local PostgreSQL, but the idea maintains.
Here's another SO answer that gives more context on this approach.