How to use the curated zone in Azure Data Lake Storage in relation to Data Warehouse - azure-data-lake-gen2

I'm storing data in ADLS zones (Raw > Staging > Curated) which will be fed to a data warehouse. Think for example of Customer data from a CRM application:
Raw\CRM\Customer\2022\03\05\raw_crm_customer_2022_03_05.csv containing entries
1, John Smith,100
2, Mario Castillo,200
Raw\CRM\Customer\2022\03\06\raw_crm_customer_2022_03_06.csv containing entries:
2, Mario Castillo,300 //record has been modified
3, Mary Tyler, 500 // new record on 3rd March
So what should I store in the curated zone? What is the best naming convention?
Do I show an aggregation showing the latest modified version of each record? Or do I show historical records as well? For example, using aggregation on the last modified record:
1 John Smith, 100
2 Mario Castillo, 300
3 Mary Tyler,500
And what is the naming convention? Is it curated\CRM\Customer ?
My plan is to load historical data in the data warehouse from the curated zone.

Related

PostgreSQL Table size and partition consideration

I am working on a use case where initial data load for name table in PostgreSQL DB will be around 650 million rows with average row size of 0.6 KB bringing table size up to 400 GB. After that there could be up to 20,000 inserts or updates on daily basis.
I am new to PostgreSQL, want to check if I should consider partitioning looking at the table size.
Updating some information from Comments section:
It is an OLTP application for Identity resolution for business names, this is one specific table where all the business names are stored along with the metadata such as Start Date, End Date and any incoming name is matched with existing names to identify if it is related to another business. This table is updated throughout the day using batch files from different data sources.
Also, we are not planning to expire or remove any data from this table.

How to create a "bucket" variable in SAS from ranges given in another table

I am trying to create a bucket variable in SAS that will split transactions into various buckets. However, depending on the retailer where the transactions occurred, the buckets have different lengths and end points. For example, Bucket 1 for Retailer 1 is from June 2017 to July 2018, while for Retailer 2 it is from January 2018 to November 2018. The retailers, bucket labels, and end points for the buckets are stored in an Excel file which I have imported successfully. The transactions are stored in a separate table with retailer information and a "date incurred" column. I am struggling to create a bucket variable in the transactions table. Does SAS allow for conditional logic when merging, like "if the transaction date is between these two dates assign this bucket value"? Is merging even the best way to add the bucket info to the transactions table?
Thank you so much for your help - this is my first ever Stack Overflow question, and I am teaching myself SAS for the first time. Please let me know what other information I can provide to make answering this question easier!
Add the condition to the SQL join, for example :
transactions a
left join
buckets b on a.Retailer = b.Retailer
and a.TransactionDate between b.BucketStart and b.BucketEnd

How to Handle Rows that Change over Time in Druid

I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks
I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!

create filter from multiple data sources

I have multiple data sources that have city, state and country information.
Example -
Source 1:
ID City State Country
12345 New York New York USA
12344 Cebu City PHL
12232 Bengaluru Karnataka IND
Source 2:
ID City State Country
12345 Dallas Texas USA
12344 London UK
I would like to create a filter to show drill down option into country, state and city using both the databases. I cannot combine the source due to few sourcing issues. Not sure if a set can be created like union of the columns from the data sources to be able to show a filter that will have values from all the data sources.
Like..
Country = USA, PHL, IND, UK and then filter state and city and so on. Can someone please advise how I can achieve this?
The city, state and country columns are dynamic in my data sources
Quote from Tableau's online help.
You can union your data to combine two or more tables by appending values (rows) from one table to another. To union your data in Tableau data source, the tables must come from the same connection.
If your data source supports union, the New Union option displays in the left pane of the data source page after you connect to your data.
It's pretty easy to do inside Tableau Desktop.
https://onlinehelp.tableau.com/current/pro/desktop/en-us/union.html

Pentaho spoon transformation from excel file

I have yearly data in my excel file in such format:
Country \ Years 1980 1981 ... 2010
Abkhazia 234 334 ... 456
Afghanistan 466 789 ... 732
...
Here is picture
And I want my data transform to 3 different tables and load it to postgres database.
Tables should look something like that
First table - country:
id | name
1 | Abkhazia
2 | Afghanistan
Second table dates:
id | date
1 | 1980
2 | 1981
And third is a table where all data is stored depending on country and date:
country_id date_id data
1 1 234
1 2 334
2 1 466
2 2 789
... ... ...
Any ideas how I could achieve my goal?
Assuming the source excel structure is as below (i have custom built this):
There are basically 3 parts to your question. I break down the transformation into part for better understanding:
1. Loading Table - Country
This is pretty straight forward based on the data given in the excel. Simply take an
Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output.
2. Loading Table - Year:
The idea here is to display the Year ID in Row wise format instead of the columns given the excel source data. PDI version 5 and above provides you with a very useful step called Metadata Structure. This step allows you to get the structure of your table. In this case, we need to have the year columns pulled, ignoring the country column.
Follow the steps as below:
Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.
3. Loading the Final Table - Country and Year along with Data:
The way to display all the column data values to a row level in PDI is using Row Normalizer step. Use this step to display a normalized output. Now follow the below steps:
Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output
Hope it helps :)
I have placed the codes in github repo along with the data file which i have used. Its here.
Also, just realized that i have given the wrong naming conventions as per your question. Consider date_id as YearID and instead of id's i have given countryid and yearid.