Is there a threshold or use case that would push an implementation from AWS Athena to Redshift Spectrum? - amazon-redshift

I've seen lots of blogs and posts comparing AWS Athena and Redshift Spectrum. The unanimous consensus seems to be that if you don't already have a Redshift implementation, just go with Athena.
Are there any scenarios or thresholds where Redshift Spectrum would better support a reporting need, and force a switch from Athena to Redshift?
--Update--
I found the following in the Big Data Analytics Options on AWS white paper under the Anti-Patterns section for Athena
Amazon Redshift is a better tool for Enterprise Reporting and Business Intelligence Workloads involving iceberg queries or cached data at the
nodes.
Then is it fair to say that Athena is for data analytics as opposed to business intelligence?
https://www.stitchdata.com/blog/business-intelligence-vs-data-analytics/

So it comes down to storage. The storage of large amounts of structured data only makes sense in a true data wharehouse setup like Redshift.
Trying to fit the same level of data into flat files like Parquet isn't appropriate.

Related

Will queries run against S3 using Glue job be faster than running the queries in RedShift?

In our data warehouse we have three data organization layers, Landing, Distilled and Curated. We take the data in landing and put it into the distilled zone.In the distilled zone, we run some technical data transformations including SCD type 2 transformations. In the curated zone, we apply more business transformations
There is a business requirement that distilled must have all data in S3 also.
For transformations in Distilled, there are two options
Have the data in S3 and use glue job(serverless spark) to run the transformations. Only for SCD type2, use RedShift spectrum to do the transformations in distilled.
Load the data from S3 to RedShift and run all transformations using RedShift
My take is that option#2 will be much faster because it will be able to leverage the column oriented data store architecture of RedShift and also the optimizer of redshift for better pruning.
I wanted to check if my understanding above is correct. I feel RedShift spectrum will still be relatively slower that using RedShift for the transformation. Also, Spectrum can only insert data, it cannot do any updates.
Thanks

Datapipeline from Sagemaker to Redshift

I wanted to check with the community here if they have explore the pipeline option from Sagemaker to Redshift directly.
I want to load the predicted data from Sagemaker to a table in Redshift. I was planning to do it via S3, but was wondering if there are better ways to do this.
I think your idea to stage data in S3, if acceptable in your specific use-case, is a good baseline design:
SageMaker smoothly connects to S3 (via Batch Transform or Processing job)
Redshift COPY statements are best practice for efficient loading of data, and can be done from S3 ("COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well." - Redshift documentation)

What's the best way to read/write from/to Redshift with Scala spark since spark-redshift lib is not supported publicly by Databricks

I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.

Anyone Migrated from Snowflake to Redshift

Anyone ever tried migrating from Snowflake to Redshift? Or things to consider if you have been using Snowflake.
Need to understand the implications as providing options paper.
Thanks
P
This is a very wide topic and consideration depends on the usage, volume of data, availability, SLA etc. For migration to Redshift I recommend to review white papers and options in AWS Database Migration Service.
This is a very helpful guide on how to migrate data from snowflake to redshift cost effectively. Basically you can orchestrate migration using Glue Workflow and automate generation of COPY commands both in SF and RS.
https://aws.amazon.com/blogs/big-data/migrate-from-snowflake-to-amazon-redshift-using-aws-glue-python-shell/

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?
Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.
AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.