Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum - postgresql

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?

Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.

AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.

Related

Will queries run against S3 using Glue job be faster than running the queries in RedShift?

In our data warehouse we have three data organization layers, Landing, Distilled and Curated. We take the data in landing and put it into the distilled zone.In the distilled zone, we run some technical data transformations including SCD type 2 transformations. In the curated zone, we apply more business transformations
There is a business requirement that distilled must have all data in S3 also.
For transformations in Distilled, there are two options
Have the data in S3 and use glue job(serverless spark) to run the transformations. Only for SCD type2, use RedShift spectrum to do the transformations in distilled.
Load the data from S3 to RedShift and run all transformations using RedShift
My take is that option#2 will be much faster because it will be able to leverage the column oriented data store architecture of RedShift and also the optimizer of redshift for better pruning.
I wanted to check if my understanding above is correct. I feel RedShift spectrum will still be relatively slower that using RedShift for the transformation. Also, Spectrum can only insert data, it cannot do any updates.
Thanks

Datapipeline from Sagemaker to Redshift

I wanted to check with the community here if they have explore the pipeline option from Sagemaker to Redshift directly.
I want to load the predicted data from Sagemaker to a table in Redshift. I was planning to do it via S3, but was wondering if there are better ways to do this.
I think your idea to stage data in S3, if acceptable in your specific use-case, is a good baseline design:
SageMaker smoothly connects to S3 (via Batch Transform or Processing job)
Redshift COPY statements are best practice for efficient loading of data, and can be done from S3 ("COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well." - Redshift documentation)

What's the best way to read/write from/to Redshift with Scala spark since spark-redshift lib is not supported publicly by Databricks

I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.

Is there a threshold or use case that would push an implementation from AWS Athena to Redshift Spectrum?

I've seen lots of blogs and posts comparing AWS Athena and Redshift Spectrum. The unanimous consensus seems to be that if you don't already have a Redshift implementation, just go with Athena.
Are there any scenarios or thresholds where Redshift Spectrum would better support a reporting need, and force a switch from Athena to Redshift?
--Update--
I found the following in the Big Data Analytics Options on AWS white paper under the Anti-Patterns section for Athena
Amazon Redshift is a better tool for Enterprise Reporting and Business Intelligence Workloads involving iceberg queries or cached data at the
nodes.
Then is it fair to say that Athena is for data analytics as opposed to business intelligence?
https://www.stitchdata.com/blog/business-intelligence-vs-data-analytics/
So it comes down to storage. The storage of large amounts of structured data only makes sense in a true data wharehouse setup like Redshift.
Trying to fit the same level of data into flat files like Parquet isn't appropriate.

Redshift insert bottleneck

I am trying to migrate a huge table from postgres into Redshift.
The size of the table is about 5,697,213,832
tool: pentaho Kettle Table input(from postgres) -> Table output(Redshift)
Connecting with Redshift JDBC4
By observation I found the inserting into Redshift is the bottleneck. only about 500 rows/second.
Is there any ways to accelerate the insertion into Redshift in single machine mode ? like using JDBC parameter?
Have you consider using S3 as mid-layer?
Dump your data to csv files and apply gzip compression. Upload files to the S3 and then use copy command to load the data.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
The main reason for bottleneck of redshift performance, which i considered is that Redshift treats each and every hit to the cluster as one single query. It executes each query on its cluster and then proceeds to the next stage. Now when i am sending across multiple rows (in this case 10), each row of data is treated a separate query. Redshift executes each query one by one and loading of the data is completed once all the queries are executed. It means if you are having 100 million rows, there would be 100 million queries running on your Redshift cluster. Well the performance goes to dump !!!
Using S3 File Output step in PDI will load your data to S3 Bucket and then apply the COPY command on the redshift cluster to read the same data from S3 to Redshift. This will solve your problem of performance.
You may also read the below blog links :
Loading data to AWS S3 using PDI
Reading Data from S3 to Redshift
Hope this helps :)
Better to export data to S3, then use COPY command to import data into Redshift. In this way, the import process is fast while you don't need to vacuum it.
Export your data to S3 bucket and use the COPY command in Redshift . COPY command is the fastest way to insert data in Redshift .