Handle Changes in multiple tables used in creation of dimension

Handle Changes in multiple tables used in creation of dimension - amazon-redshift

I am working on data warehousing project, Need help with below
OLAP Table:
Product Dimension Table:
Product_id, category_id, category_name,brand_id, brand_name ,manufacturer_id, manufacturer_name
OLTP Tables:
Each table contains create_ts and update_ts for tracking creation & update in tables.
**Product_info, id, product_name,category_id,brand_id,manufacturer,create_ts, update_ts
Product_category_mapping: id,product_id,category_id,create_ts, update_ts
brand: id, name,create_ts, update_ts
manufacturer:id, name,create_ts, update_ts**
Looking to track all the changes in any of the tables, should reflect in the dimension table.
For Example:
Current OLAP Snapshot
Product_id, category_id, category_name,brand_id, brand_name ,manufacturer_id, manufacturer_name
1,33,Noodles,45, Nestle,455,nestele_pvt_ltd
Suppose brand name changes from nestle to nestle-us, How will we track this as we are capturing changes based on only product_info update_ts??
Should we consider all the 4 table changes??
Please suggest.

if data changes in any table that is a source for your DW then you need to include it in your extract logic.
For reference data like this where you can have a number of tables that contribute to a single "target" table, an approach I often take is to create a View across these tables in your source DB, include all the columns you need to take across to the DW but only have a single update_ts column that is calculated using the SQL GREATEST function where you pass in the update_ts columns from all the tables in the View. Then you only need to compare this single column to your "last extracted date" to determine if there are any changes that you may need to process

Related

Redshift CDC or delta load

Any one knows best way for loading delta or CDC with using any tools
I got big table with billions of records and want to update or insert like Merge in Sql server or Oracle but in Amazon Redshift S3
Also we have loads of columns as can't compare all columns as well
e.g
TableA
Col1 Col2 Col3 ...
It has say already records
SO when inserting new records need to check that particular record is already existing if so no insert if not insert and if changed update record like that
I do have key id and date columns but as its got 200+ columns not easy to check all columns and taking much time
Many thanks in advance

PostgreSQL: Table that points to other Tables?

Just getting started in PostgreSQL and wanted to ask some questions.
Suppose that I have a table of Vendors. Each Vendors has an attribute called Sales Record, which is a time series data about their sales. For each Vendors, I want to be one associated Sales Record Table that has the timeseries sales data for that specific vendor.
How might I want to code that?

You shouldn't have a table per vendor.
Rather, create one big table for all. The table contains a column like vendor_id that is a foreign key to vendors and identifies to which vendor a record belongs.
If you create an index on vendor_id, searching the big table for the data of a vendor will be efficient.

Use of Postgres table inheritance

Since Postgres also supports partitioned tables, what is the use of child table.
Suppose there is a table of users which has a column created_date. We can store data in 2 ways:
We create many child tables of this user table and distribute the data of users on the basis of created_date (say, one table for every date, like user_jan01_21).
We can create a partitioned table with the partitioning key created_date
Then what is the difference between these solution?
Basically, I want to know what problem table inheritance can solve that partitioning cannot.
Another doubt I have: if I follow solution 1, and I query the user table without the ONLY keyword, will it scan all the child tables?
For example:
SELECT * FROM WHERE where created_date = current_date - 10;

If the objective is partitioning, as in your example, then there is no advantage in using table inheritance. Declarative partitioning is far superior in ease of use, performance and available features.
Table inheritance has uses that are unrelated to partitioning. Features that partitioning doesn't offer are:
the child table can have additional columns
a table can inherit from more than one table
With table inheritance, if you select from the parent table, you will also get all results from the child tables, just as if you had used UNION ALL to combine the results.

Creating a temp table inside a PostgreSQL function creates conflicts between different function calls?

I want to use a temporary table (let's call it temp_tbl) created in a PostgreSQL function in order to SELECT into it just some rows and columns from a table (let's call it tbl) as follows:
One of the columns that both temp_tbl and tbl have is order_date of type DATE and the function also takes a start_date DATE argument. I want to SELECT in temp_tbl just the rows from tbl that have an order_date later than the function's start_date.
My question is: if this function gets called concurrently 2 or more times in the same session, won't the calls use the same instance of the temporary table temp_tbl ?
More specifically, when using psycopg2 in the backend of a webserver, different clients of the webserver might require calling our function at the same time. Will this generate a conflict over the temp_tbl temporary table declared inside the function?
EDIT: my actual context
I'm building (for education purposes) an online shop. I have 3 tables for 3 kinds of products that all use a common sequence for their ids. I have another table for orders that includes a column which is an array of product ids and a column which is an array of quantities (associated with the product ids of the ordered products).
I want to return a table of common product details (columns common to all 3 tables like id, name, price etc) and their associated number of sales.
My current method is to concatenate all the arrays of ids and quantities from all order entries, then create a temporary table out of the 2 arrays and sum the number of orders for each product id so I have a table with one entry for each ordered product.
Then, I create 3 temporary tables in order to join each product table with the temporary product orders figures table and SELECT only the columns that are common to all 3 tables.
Finally, I UNION the 3 temporary tables.
This is kind of complicated for me so I think that maybe there were better design decisions I could have made ?

Postgres table partitioning with star schema

I have a schema with one table with the majority of data, customer, and three other tables with foreign key references to customer.entry_id which is a BIGSERIAL field. The three other tables are called location, devices and urls where we store various data related to a specific entry in the customer table.
I want to partition the customer table into monthly child tables, and have that part worked out; customer will stay as-is, each month will have a table customer_YYYY_MM that inherits from the master table with the right CHECK constraint and indexes will be created on each individual child table. Data will be moved to the correct child tables while the master table stays empty.
My question is about the other three tables, as I want to partition them as well. However, they have no date information (at all), only the reference to the primary key from the master table. How can I setup the constraints on these tables? Is it even meaningful or possible without date information?
My application logic knows where to insert all the data (it's fairly trivial), but I expect to be able to do simple SELECT queries without specifying which child tables to get it from. So this should work as you would expect from non-partitioned tables:
SELECT l.*
FROM customer c
JOIN location l USING entry_id
WHERE c.date_field > '2015-01-01'

I would partition them by the reference key. The foreign key is used in join conditions and is not usually subject to change so it fulfills the following important points:
Partition by the information that is used mostly in the WHERE clauses of the queries or other parts where partitioning can be used to filter out tables that don't need to be scanned. As one guide puts it:
The objective when defining partitions should be to allow as many queries as possible to fetch data from as few partitions as possible - ideally one.
Partition by information that is not going to be changed so that rows don't constantly need to be thrown from one subtable to another
This all depends of the size of the tables too of course. If the sizes stay small then there is no need to partition.

Read more about partitioning here.
Use views:
create view customer as
select * from customer_jan_15 union all
select * from customer_feb_15 union all
select * from customer_mar_15;
create view location as
select * from location_jan_15 union all
select * from location_feb_15 union all
select * from location_mar_15;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Handle Changes in multiple tables used in creation of dimension - amazon-redshift

Related

Redshift CDC or delta load

PostgreSQL: Table that points to other Tables?

Use of Postgres table inheritance

Creating a temp table inside a PostgreSQL function creates conflicts between different function calls?

Postgres table partitioning with star schema

Categories

Resources