Amazon Redshift Spectrum - allows you to ONLY query data on S3
Amazon Redshift Federated Query - allows you to query data on other databases and ALSO S3.
So, if we are querying S3,
the query we execute is exactly same in both cases:
Select * from my-schema.my_table
Where my-schema is External Schema in Glue Data Catalog, pointing to data in S3
Therefore, when querying S3 - how does Redshift decide whether it must use Spectrum or Federated Query engine? Is the underlying engine same for both - when querying S3?
Related
In our data warehouse we have three data organization layers, Landing, Distilled and Curated. We take the data in landing and put it into the distilled zone.In the distilled zone, we run some technical data transformations including SCD type 2 transformations. In the curated zone, we apply more business transformations
There is a business requirement that distilled must have all data in S3 also.
For transformations in Distilled, there are two options
Have the data in S3 and use glue job(serverless spark) to run the transformations. Only for SCD type2, use RedShift spectrum to do the transformations in distilled.
Load the data from S3 to RedShift and run all transformations using RedShift
My take is that option#2 will be much faster because it will be able to leverage the column oriented data store architecture of RedShift and also the optimizer of redshift for better pruning.
I wanted to check if my understanding above is correct. I feel RedShift spectrum will still be relatively slower that using RedShift for the transformation. Also, Spectrum can only insert data, it cannot do any updates.
Thanks
currently scraping data and dumping them on a cloudSQL postgres database .. this data tends to grow exponentially and I need an efficient way to execute queries .. database grows by ~3GB/day and I'm looking to keep data for at least 3 months .. therefore, I've connected my CloudSQL to BigQuery .. the following is an example of a query that I'm running on BigQuery but I'm skeptical .. not sure if the query is being executed in Postgres or BigQuery ..
SELECT * FROM EXTERNAL_QUERY("project.us-cloudsql-instance", "SELECT date_trunc('day', created_at) d, variable1, AVG(variable2) FROM my_table GROUP BY 1,2 ORDER BY d;");
seems like the query is being executed in postgreSQL though, not BigQuery .. is this true? if it is, is there a way for me to load data from postgresql to bigquery in realtime and execute queries directly in bigquery ?
I think you are using federated queries. These queries are intended to collect data from BigQuery and from a CloudSQLInstance:
BigQuery Cloud SQL federation enables BigQuery to query data residing in Cloud SQL in real-time, without copying or moving data. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.
The query is being executed in CloudSQL and this could lead into a lower performance than if you run in BigQuery.
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table. The result would be a BigQuery table.
Now, the current ways to load data into BigQuery are from: GCS, other Google Ad Manager and Google Ads, a readtable data source, By inserting individual records using streaming inserts, DML statements and BigQuery I/O transform in a Dataflow pipeline.
This solution is well worth to take a look which is pretty similar to what you need:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
Although technically, it is possible to rewrite the query as
SELECT date_trunc('day', created_at) d, variable1, AVG(variable2)
FROM EXTERNAL_QUERY("project.us-cloudsql-instance",
"SELECT created_at, variable1, variable2 FROM my_table")
GROUP BY 1,2 ORDER BY d;
It is not recommended though. Better do aggregation and filtering on CloudSQL as much as possible to reduce the amount of data that has to be transfered from CloudSQL to BigQuery.
I've seen lots of blogs and posts comparing AWS Athena and Redshift Spectrum. The unanimous consensus seems to be that if you don't already have a Redshift implementation, just go with Athena.
Are there any scenarios or thresholds where Redshift Spectrum would better support a reporting need, and force a switch from Athena to Redshift?
--Update--
I found the following in the Big Data Analytics Options on AWS white paper under the Anti-Patterns section for Athena
Amazon Redshift is a better tool for Enterprise Reporting and Business Intelligence Workloads involving iceberg queries or cached data at the
nodes.
Then is it fair to say that Athena is for data analytics as opposed to business intelligence?
https://www.stitchdata.com/blog/business-intelligence-vs-data-analytics/
So it comes down to storage. The storage of large amounts of structured data only makes sense in a true data wharehouse setup like Redshift.
Trying to fit the same level of data into flat files like Parquet isn't appropriate.
I am trying to access an existing AWS Athena table fron AWS Redshift.
I tried creating external schema (pointing to AWS Athena DB) in AWS Redshift console. It creates the external schema successfully but it doesn't display tables from Athena DB. Below is the code used.
CREATE EXTERNAL SCHEMA Ext_schema_1
FROM DATA CATALOG
DATABASE 'sample_poc'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::55276673986:role/sample_Redshift_Role';
Few observations..
Even if I specify not existing Athena DB name, it still create external schema in Redshift.
My Redshift role has full access to S3 & Athena.
AWS Glue Catalog contains databases, which contain tables. There are no schemas from the perspective of Athena or Glue Catalog.
In Redshift Spectrum, you create an EXTERNAL SCHEMA which is really a placeholder object, a pointer within Redshift to the Glue Catalog.
Even if I specify not existing Athena DB name, it still create external schema in Redshift.
The creation of the object is lazy as you have discovered, which is useful if the IAM Role needs adjusting. Note the example in the docs has an additional clause:
create external database if not exists
So your full statement would need to be this if you wanted the database to be created as well.
CREATE EXTERNAL SCHEMA Ext_schema_1
FROM DATA CATALOG
DATABASE 'sample_poc'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::55276673986:role/sample_Redshift_Role'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
it doesn't display tables from Athena DB
If you are creating an EXTERNAL SCHEMA to a non-existent database then there will be nothing to display. I assume your point 1. is unrelated to the real attempt you made to create the external schema; that you pointed it to an existing schema with tables.
I have found that tables created using Redshift Spectrum DDL are immediately available to Athena via the Glue Catalog.
I have also tried specifying tables in Glue Catalog, and alternatively using the Crawler, and in both cases those tables are visible in Redshift.
What tool are you using to attempt to display the tables? Do you mean tables don't list in metadata views or do you mean the contents of tables doesn't display?
Redshift does appear to have some differences in the datatypes that are allowed, and the Hive DDL required in Athena can have some differences to the Redshift Spectrum DDL. Spectrum has some nesting limitations.
My Redshift role has full access to S3 & Athena
Assuming you are using Glue Catalog and not the old Athena catalog then your role doesn't need any Athena access.
I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?
Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.
AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.