i am new to data warehouse and i want to ask that on copying all the foreign key data and to the fact table then why we still use dimension as all the data is present in Fact table , can some please guide me.
Short answer: a typical dimension has additional attributes than only a key. Your fact table has a foreign key to a dimension where additional info is available and even grouping is possible.
Recommended reading: "The Data Warehouse Toolkit" by Ralph Kimball
A fact table should only store 1) the business metric that it models (e.g. a sales order/transaction, or some other business transaction that you are measuring); 2) foreign keys to the related dimensions.
A dimension table should only store the context/qualitative data that is necessary to understand your business transactions (your facts).
Let's say, for example, that you are modelling sales on retail stores; a very simplified dimensional model for this would be something like:
Store Dimension: name, street address, city, county, etc
Product Dimension: name, brand, description, sku, etc
Date Dimension: year, month, day, etc
Sales Fact table: fkStore, fkProduct, fkDate, unitsSold, salesAmmount
So, the fact table only holds the metrics/measures and foreign keys, but business users need to use the information stored in dimensions to be able to explore the facts. That's how you enable them to explore unitsSold or salesAmmount according to a specific product, or on a specific store/location, or on a specific date.
The fact table by itself only provides quantitative data ("ammount sold") while the dimensions provide the context that a business user needs to interpret that metric ("ammount of product X sold in store Y in 2017").
The decision on what falls into dimension or fact data is not clear cut in many cases. Typically data that is re-usable ( is meaningful in relation to other fact data) can be considered dimension data.
A lot of times fact data is the most changeable over time. Fact tables contain the history of these records changing over time
Like daily Sales numbers, nightly EndOfDay results, etc.
These are often of numeric type i.e. quantitative measures. Datawarehouse analysis then consists of bucketing ( summing / Grouping ) these numerics so they carry the narrative of a trend over time at varying levels of granularity
Where dimension data is of more of 'static' nature like Trade , Customer , Product details.
I recommend reading:
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/
Related
I'm facing the challenge of generating 'balance' values from thousands of entries in a PG table.
The rows in the table have many different columns, each useful in calculating that rows contribution to the balance. Each row/entry belongs to some profile. I need to calculate the balance value for some profile, from all entries belonging to that profile according to some set of rules. Complexity should be O(N) - N being the number of entries that belong to the profile.
The different approaches I took:
Fetching the rows, calculating balances on backend. This degrades doesn't scale well and degrades quickly, depending on the number of entries that belong to the profile. While fetching the entries is initially fast, once a profile has over 10,000 entries it becomes prohibitively slow.
I figured that a lot of time is being spent on transport, additionally we don't really need the rows only the balances. Since we already do the work of finding the entries, we can calculate the balance and save time on backend calculations as well as the transport of thousands of rows, thus leading to the second approach:
The second approach was creating a PG query that iterates over the rows and calculates the balance. This has proven to be more scalable when there are many entries per profile. This approach however, probably due to the complexity of the PG query, puts a lot of load on the database. It's enough to run 3-4 of these queries concurrently to max out the database CPU.
the third approach is to create a PL/pgSQL function to loop over the relevant entries and return the rows, hoping to reduce the impact on the database. This is the next thing I want to try.
Main question is - what would be the most efficient way to achieve this while being 'database friendly'?
Additionally:
Whether you think these approaches are sane?
Am I missing another obvious solution?
Is it unlikely that I improve over the performance of the query with the help of a function looping over same rows as the query, or is worth trying?
I realize I haven't provided a lot of concrete data, but I figured that since this is probably a common problem, maybe the issue can be understood from a general description.
EDIT:
To be a but more specific, I'm dealing with the following data:
CREATE TABLE entries (
profileid bigint NOT NULL,
programid bigint NOT NULL,
ledgerid text NOT NULL, -- this provides further granularity, on top of 'programid'
startdate timestamptz,
enddate timestamptz,
amount numeric NOT NULL
)
What I want to get is the balances for a certain profileid, separate by (programid, ledgerid).
The desired form is:
RETURNS TABLE (
programid bigint,
ledgerid text,
programid bigint,
currentbalance numeric,
pendingbalance numeric,
expiredbalance numeric,
spentbalance numeric
)
The four balance values are produced by applying arithmetic on certain entries. For example, negative amount would only add to spentbalance, expired balance is generated from entries that have a positive amount and the enddate is after now(), etc...
While I did manage to create a very large aggregate query with many calls to COALESCE(SUM(CASE WHEN ... amount), 0), I was wondering if I have anything to benefit from porting that logic into a PL/pgSQL function. However, when trying to implement this function I realized I don't know how to iterate over one function and return another, different in columns and rows, function. Should I use a temp table for this? Seems like an overkill as this query is expected to execute tens of times every second...
I have a fairly simple data model which consists of a star schema of 2 Fact tables and 2 dimension tables:
Fact 1 - Revenue
Fact 2 - Purchases
Dimension 1 - Time
Dimension 2 - Product
These tables are at different levels of granularity - meaning a given date could have many rows across many products. A specific date and product may have revenue, but no purchases. Likewise it may have purchases but no revenue.
Each fact joins the both dimensions, which contain additional detail such as the product name, product category, etc.
What I would like to do is combine these two facts such that I can report revenue and purchases together (example, by date, by product, or by date and product combined):
I can get very close with data blending, however the issue I run into is that data blending only supports an pseudo 'inner-join'. As you can see, if either of these data sources is specified as primary then dates without purchases/revenue will cause rows in the secondary source to fall off.
What is the best way to blend this data without causing records to fall off
Create a union of your fact tables. There will be mismatched fields, but that is ok.
Perform data blending on the connection to bring in additional dimensions(Not shown in the example)
Build the view
I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks
I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!
What is the best way to store metrics data used in displaying graphs?
Currently I have a table analytics(domain::text, interval_in_days::int, grouping::text, metric::text, type::text, labels[], data[], summary::json)
domain is the overall category of the metrics. Like what part of the application they're under. Could be sales or support etc.
the interval_in_days and grouping are 'view options' the end user can specify at the interface level to have a different view of the data points.
grouping can be date, day_of_week or time_of_day
interval_in_days can be 7, 30 or 90
labels is an array of the labels on the x-axis and data are the corresponding datapoints.
type is either data_series or summary. If data series, the row represent's the data used for drawing the graph, while a summary has the summary:json field populated with an object like {total_number_of_X: 132, median_X: 320.. etc}
metric is simply the metric the corresponding graph represents, so there's a separate graph for each value of metric
From this it follows that for each metric/graph I display, I have 9 (3 intervals * 3 groupings). For each domain I have a single row with type summary.
Every few hours I aggregate a lot of data across multiple tables into the analytics table. So I don't have to perform expensive queries adhoc.
I feel this is not the optimal approach, so I'm really interested in seeing how other people accomplishes the same task or any suggestions.
There is nothing wrong with storing 9 rows of raw data and later aggregating them to something more comfortable. It's a common approach and has performance benefits in some situations.
What I would really re-think in your design are the datatypes. From your description it seems you can transform all ::text fields into something like ::varchar(20). Then you can use STORAGE PLAIN on these columns and your table will become more efficient.
Also, consider adding foreign keys to describe what is stored in individual columns. For example, you stated grouping can be date, day_of_week or time_of_day, so you could have a groupings table that will list these options. But again, the foreign key would have to be covered by an index, so you may want to skip on that due to performance reasons.
I'm trying to design a cube in SSAS 2008 for data whose base unit is Member-Month, meaning that for each member there is demographic data, certain other indicators that may change, and dollar amounts paid per month. I feel like I need to include MemberID and MonthKey in the same dimension, but this seems like the wrong approach in the case when I just want to see dollars by month. If so, would I put both a Month Key and the Member-Month Key in the fact table? Or use a surrogate key in the Member-Month dimension, but include the MemberID and MonthKey in it? It seems wrong to have Month in two different places (Member-Month and Date). Any help is appreciated!
If I understand your question correctly, you should create a member table, month (or dates) table and a fact table that has FactKey,MemberKey,MonthKey,Amount columns in it. Then you may create Member and Month dimensions.
You should not add month data to the member dimension. The relation between month and member dimensions is already built by the fact table which has all data required for cross dimension data existance.
This is a very simple design problem and easily get implemented with SSAS.
Hope this help.