Will queries run against S3 using Glue job be faster than running the queries in RedShift? - amazon-redshift

In our data warehouse we have three data organization layers, Landing, Distilled and Curated. We take the data in landing and put it into the distilled zone.In the distilled zone, we run some technical data transformations including SCD type 2 transformations. In the curated zone, we apply more business transformations
There is a business requirement that distilled must have all data in S3 also.
For transformations in Distilled, there are two options
Have the data in S3 and use glue job(serverless spark) to run the transformations. Only for SCD type2, use RedShift spectrum to do the transformations in distilled.
Load the data from S3 to RedShift and run all transformations using RedShift
My take is that option#2 will be much faster because it will be able to leverage the column oriented data store architecture of RedShift and also the optimizer of redshift for better pruning.
I wanted to check if my understanding above is correct. I feel RedShift spectrum will still be relatively slower that using RedShift for the transformation. Also, Spectrum can only insert data, it cannot do any updates.
Thanks

Related

Datapipeline from Sagemaker to Redshift

I wanted to check with the community here if they have explore the pipeline option from Sagemaker to Redshift directly.
I want to load the predicted data from Sagemaker to a table in Redshift. I was planning to do it via S3, but was wondering if there are better ways to do this.
I think your idea to stage data in S3, if acceptable in your specific use-case, is a good baseline design:
SageMaker smoothly connects to S3 (via Batch Transform or Processing job)
Redshift COPY statements are best practice for efficient loading of data, and can be done from S3 ("COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well." - Redshift documentation)

What's the best way to read/write from/to Redshift with Scala spark since spark-redshift lib is not supported publicly by Databricks

I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.

Is there a threshold or use case that would push an implementation from AWS Athena to Redshift Spectrum?

I've seen lots of blogs and posts comparing AWS Athena and Redshift Spectrum. The unanimous consensus seems to be that if you don't already have a Redshift implementation, just go with Athena.
Are there any scenarios or thresholds where Redshift Spectrum would better support a reporting need, and force a switch from Athena to Redshift?
--Update--
I found the following in the Big Data Analytics Options on AWS white paper under the Anti-Patterns section for Athena
Amazon Redshift is a better tool for Enterprise Reporting and Business Intelligence Workloads involving iceberg queries or cached data at the
nodes.
Then is it fair to say that Athena is for data analytics as opposed to business intelligence?
https://www.stitchdata.com/blog/business-intelligence-vs-data-analytics/
So it comes down to storage. The storage of large amounts of structured data only makes sense in a true data wharehouse setup like Redshift.
Trying to fit the same level of data into flat files like Parquet isn't appropriate.

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?
Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.
AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.

How do I use Redshift Database for Transformation and Reporting?

I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?
Near Real Time Data Loads in Amazon Redshift
I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.
Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.
You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.
It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.
This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.
In order to summarize it:
Break input files --- Break your load files in several smaller files
that are a multiple of the number of slices
Column encoding --- Have column encoding pre-defined in your DDL.
COPY Settings --- Ensure COPY does not attempt to evaluate the best
encoding for each load
Load in SORT key order --- If possible your input files should have
the same "natural order" as your sort key
Staging Tables --- Use multiple staging tables and load them in
parallel.
Multiple Time Series Tables --- This documented approach for dealing
with time-series in Redshift
ELT --- Do transformations in-database using SQL to load into the
main fact table.
Of course all the recommendations for data loading in Redshift still apply. Look at this article here.
Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.