Data Vault in Redshift and ETL Strategy - amazon-redshift

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 days ago.
Improve this question
I have a bunch of data in files stored in Amazon S3 and am planning to use it to build a Data Vault in Redshift. My first question is if the right approach is to build the DV and Data Marts all in Redshift or if I should consider the S3 as my Data Lake and have only the Data Marts in Redshift?
In my architecture I'm currently considering the former (i.e. S3 Data Lake + Redshift Vault and Marts). However, I don't know if I can create ETL processes directly in Redshift to populate the Marts with data in the Vault or if I'll have to for example use Amazon EMR to process the raw data in S3, generate new files there and finally load them in the Marts.
So, my second question is: What should the ETL strategy be? Thanks.

Apologies! Dont have the reputation to comment that is why I am writing in the answer section. I am exactly in the same boat as you are. Trying to perform my ETL operation in redshift and as of now I have 3 billion rows and expecting to grow drastically. Right now, loading data into data marts in redshift using DML's that are called at regular interval from AWS lambda. According to me it is very difficult to create a data vault in Redshift.

Im a bit late the party and no doubt you've resolved this however it still might be relevant. Just thought I'd share my opinion on this. One solution is to use S3 and Hive as a Persistent Staging Area(Data Lake if you will) to land the data from sources. Construct your DV entirely in Redshift. You will still need a staging area in Redshift in order to ingest files from S3 ensuring the hashes are calculated on the way into Redshift staging tables(That's where EMR/Hive comes in). You could add the hashes directly in Redshift but it could put Redshift under duress depending on volume. Push data from staging into the DV via plain old bulk insert and update statements and then virtualise your marts in Redshift using views.
You could use any data pipeline tool to achieve this and lambda could also be a candidate for you or another workflow/pipeline tool.

I strongly recommend you check out Matillion for Redshift: https://redshiftsupport.matillion.com/customer/en/portal/articles/2775397-building-a-data-vault
It's fantastic and affordable for Redshift ETL and has a Data Vault sample project.

I recommend you to read this article and consider following the design explained there in much details in case AWS is your potential tech stack.
I do not want to copy-paste the article here but it is really a recipe how to implement Data Vault and I believe it addresses your requirements.

S3 is just a key-value store for files. You can't create a DV or DW there. So you can use Redshift or EMR to process the data into the relational format for your DV. It's up to you as to whether or not which you choose; EMR has specific use cases IMO

Related

Datapipeline from Sagemaker to Redshift

I wanted to check with the community here if they have explore the pipeline option from Sagemaker to Redshift directly.
I want to load the predicted data from Sagemaker to a table in Redshift. I was planning to do it via S3, but was wondering if there are better ways to do this.
I think your idea to stage data in S3, if acceptable in your specific use-case, is a good baseline design:
SageMaker smoothly connects to S3 (via Batch Transform or Processing job)
Redshift COPY statements are best practice for efficient loading of data, and can be done from S3 ("COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well." - Redshift documentation)

What's the best way to read/write from/to Redshift with Scala spark since spark-redshift lib is not supported publicly by Databricks

I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.

How do I efficiently migrate the BigQuery Tables to On-Prem Postgres?

I need to migrate the tables from the BigQuery to the on-prem Postgres database.
How can I efficiently achieve that?
Some thoughts that are coming
I will use Google APIs to export the data from the tables
Store it locally
And finally, import to Postgres
But I am not sure if that can be done for a huge amount of data in TBs. Also, how can I automate this process? Can I use Jenkins for that?
Exporting the data from BigQuery, store it and importing it to PostgreSQL is a good approach. Here are other two alternatives that you can consider:
1) There's a PostgreSQL wrapper for BigQuery that allows to query directly from BigQuery. Depending on your case scenario this might be the easiest way to transfer the data; although, for TBs it might not be the best approach. This suggestion was made by #David in this SO question.
2) Using Dataflow. You can create a ETL process using Apache Beam to made the transfer. Take a look at this how-to for transferring data from BigQuery to CloudSQL. You would need to adapt it for local PostgreSQL, but the idea maintains.
Here's another SO answer that gives more context on this approach.

How do I use Redshift Database for Transformation and Reporting?

I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?
Near Real Time Data Loads in Amazon Redshift
I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.
Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.
You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.
It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.
This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.
In order to summarize it:
Break input files --- Break your load files in several smaller files
that are a multiple of the number of slices
Column encoding --- Have column encoding pre-defined in your DDL.
COPY Settings --- Ensure COPY does not attempt to evaluate the best
encoding for each load
Load in SORT key order --- If possible your input files should have
the same "natural order" as your sort key
Staging Tables --- Use multiple staging tables and load them in
parallel.
Multiple Time Series Tables --- This documented approach for dealing
with time-series in Redshift
ELT --- Do transformations in-database using SQL to load into the
main fact table.
Of course all the recommendations for data loading in Redshift still apply. Look at this article here.
Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.

MongoDB to DynamoDB

I have a database currently in Mongo running on an EC2 instance and would like to migrate the data to DynamoDB. Is this possible and what is the most cost effective way to achieve this?
When you ask for a "cost effective way" to migrate data, I assume you are looking for existing technologies that can ease your life. If so, you could do the following:
Export your MongoDB data to a text file, say in tsv format, using mongoexport.
Upload that file somewhere in S3.
Import this data, in S3, to DynamoDB using AWS Data Pipeline.
Of course, you should design & finalize your DynamoDB table schema before doing all this.
Whenever you are changing databases, you have to be very careful about the way you migrate data. Certain data formats maintain type consistency, while others do not.
Then there are just data formats that cannot handle your schema. For example, CSV is great at handling data when it is one row per entry, but how do you render an embedded array in CSV? It really isn't possible, JSON is good at this, but JSON has its own problems.
The easiest example of this is JSON and DateTime. JSON does not have a specification for storing DateTime values, they can end up as ISO8601 dates, or perhaps UNIX Epoch Timestamps, or really anything a developer can dream up. What about Longs, Doubles, Ints? JSON doesn't discriminate, it makes them all strings, which can cause loss of precision if not deserialized correctly.
This makes it very important that you choose the appropriate translation medium. The generally means you have to roll your own solution. This means loading up the drivers for both databases, reading an entry from one, translating, and writing to this other. This is the best way to be absolutely sure errors are handled properly for your environment, that types are kept consistently, and that the code properly translates schema from source to destination (if necessary).
What does this all mean for you? It means a lot of leg work for you. It is possible somebody has already rolled something that is broad enough for your case, but I have found in the past that it is best for you to do it yourself.
I know this post is old, Amazon made it possible with AWS DMS, check this document :
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
Some relevant parts:
Using an Amazon DynamoDB Database as a Target for AWS Database
Migration Service
You can use AWS DMS to migrate data to an Amazon DynamoDB table.
Amazon DynamoDB is a fully managed NoSQL database service that
provides fast and predictable performance with seamless scalability.
AWS DMS supports using a relational database or MongoDB as a source.