CREATE MODEL with Redshift: Correlation - amazon-redshift

how can to see the correlations regarding the target column of the trained models? Is there an option in the Redshift query editor oder Sagemaker?
Cheers

you can refer to the blog Build regression models with Amazon Redshift ML for instructions on using a sample stored procedure that will calculate correlations with the target variable in Redshift ML.
If you need just the stored procedure, it is also available in GitHub here - https://github.com/awslabs/amazon-redshift-utils/blob/master/src/StoredProcedures/sp_correlation.sql

Related

Will queries run against S3 using Glue job be faster than running the queries in RedShift?

In our data warehouse we have three data organization layers, Landing, Distilled and Curated. We take the data in landing and put it into the distilled zone.In the distilled zone, we run some technical data transformations including SCD type 2 transformations. In the curated zone, we apply more business transformations
There is a business requirement that distilled must have all data in S3 also.
For transformations in Distilled, there are two options
Have the data in S3 and use glue job(serverless spark) to run the transformations. Only for SCD type2, use RedShift spectrum to do the transformations in distilled.
Load the data from S3 to RedShift and run all transformations using RedShift
My take is that option#2 will be much faster because it will be able to leverage the column oriented data store architecture of RedShift and also the optimizer of redshift for better pruning.
I wanted to check if my understanding above is correct. I feel RedShift spectrum will still be relatively slower that using RedShift for the transformation. Also, Spectrum can only insert data, it cannot do any updates.
Thanks

Row based database or Column based database

We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks
For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?
Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.
AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.

How can I run MDX queries against a PostgreSQL database?

If I have a PostgreSQL server running with my data already structured in facts and dimensions, how can I run MDX queries against it?
Let's suppose each row in the fact table is a sale, so the fact table has the following columns: id, product_id, country_id and amount.
And the dimension tables are very simple: product_id and product_name, and country_id and country_name.
How should I proceed to be able to run MDX queries against this data? I tried downloading Mondrian but I found it very hard to use.
Please keep in mind I am not a developer, so my technical skills are limited; I work at an investment fund and I want to be able to run more powerful analysis on our data sets. But I do have some basic knowledge on SQL and I can code a little bit in Ruby.
As you already have a DWH (data warehouse) in PostgreSQL which contains dimension tables and fact tables, now you are two steps from building simple analysis solution. The solution I recommend consists of:
DWH: PostgreSQL
OLAP server: Mondrian OLAP (OLAP schema workbench tool)
Analysis tool: Saiku Analysis application (you can preview Saiku demo here)
Steps:
Download the OLAP schema workbench tool. Using this tool you can create Mondrian OLAP schema easily on the top of the existing tables (dimensions, facts) of your DWH.
Once you create the OLAP schema, download the Saiku Analysis application, configure it to use your OLAP schema and your DWH
Run Saiku - you can run MDX queries on the DWH or do ad-hoc data analysis by drag&drop of measures (amount, etc.) and dimensions (product name, country name).

Performance using T-SQL PIVOT vs SSIS PIVOT Transformation Component

I am in process of building Dimension from EDW (source), wherein I need to pivot columns of source to load Dimension.
Currently most of the pivoting stuff am doing is by using T-SQL PIVOT which further get used in my SSIS package to merge with Dim table
This pivoting can also be achieved by SSIS PIVOT Transformation component.
In regards to Performance which approach would be the best?
Thanks
In theory, SQL Server pivot perf should be faster, or at least the same, but to be sure, requires doing some perf comparison tests.
But even if SSIS currently has the advantage, feel free to use SQL Server, as staying out of SSIS is a good thing.