How to accomplish star schema in Tableau with multiple facts (without records falling off) - tableau-api

I have a fairly simple data model which consists of a star schema of 2 Fact tables and 2 dimension tables:
Fact 1 - Revenue
Fact 2 - Purchases
Dimension 1 - Time
Dimension 2 - Product
These tables are at different levels of granularity - meaning a given date could have many rows across many products. A specific date and product may have revenue, but no purchases. Likewise it may have purchases but no revenue.
Each fact joins the both dimensions, which contain additional detail such as the product name, product category, etc.
What I would like to do is combine these two facts such that I can report revenue and purchases together (example, by date, by product, or by date and product combined):
I can get very close with data blending, however the issue I run into is that data blending only supports an pseudo 'inner-join'. As you can see, if either of these data sources is specified as primary then dates without purchases/revenue will cause rows in the secondary source to fall off.
What is the best way to blend this data without causing records to fall off

Create a union of your fact tables. There will be mismatched fields, but that is ok.
Perform data blending on the connection to bring in additional dimensions(Not shown in the example)
Build the view

Related

Segregate Products based on shipping <SQL>

I have 10 different products (A,B,C),..,J)have multiple purchase dates (by various customers) and delivery dates. I want to see which products have the date difference of less than 5 days. If the date difference is less than 5 days, which products have customer rating less more than 3.If the above criteria is satisfied I want to fetch those products that has the minimum date difference from the queue along with the "Important_date".If there are same minimum date difference for a particular product then I would like to select the top one among the same product in recent times and mark the purchase date as the "Important_date".
The columns in the table are: Product,Purchasedate,deliverydate,date_difference,customer_rating.
I am trying to use case statements to solve the problem in PostgreSQL.
I am looking for an output which will give me all the columns of the table along with "Important_date."

Dimension Table Usage when we have a loaded fact table

i am new to data warehouse and i want to ask that on copying all the foreign key data and to the fact table then why we still use dimension as all the data is present in Fact table , can some please guide me.
Short answer: a typical dimension has additional attributes than only a key. Your fact table has a foreign key to a dimension where additional info is available and even grouping is possible.
Recommended reading: "The Data Warehouse Toolkit" by Ralph Kimball
A fact table should only store 1) the business metric that it models (e.g. a sales order/transaction, or some other business transaction that you are measuring); 2) foreign keys to the related dimensions.
A dimension table should only store the context/qualitative data that is necessary to understand your business transactions (your facts).
Let's say, for example, that you are modelling sales on retail stores; a very simplified dimensional model for this would be something like:
Store Dimension: name, street address, city, county, etc
Product Dimension: name, brand, description, sku, etc
Date Dimension: year, month, day, etc
Sales Fact table: fkStore, fkProduct, fkDate, unitsSold, salesAmmount
So, the fact table only holds the metrics/measures and foreign keys, but business users need to use the information stored in dimensions to be able to explore the facts. That's how you enable them to explore unitsSold or salesAmmount according to a specific product, or on a specific store/location, or on a specific date.
The fact table by itself only provides quantitative data ("ammount sold") while the dimensions provide the context that a business user needs to interpret that metric ("ammount of product X sold in store Y in 2017").
The decision on what falls into dimension or fact data is not clear cut in many cases. Typically data that is re-usable ( is meaningful in relation to other fact data) can be considered dimension data.
A lot of times fact data is the most changeable over time. Fact tables contain the history of these records changing over time
Like daily Sales numbers, nightly EndOfDay results, etc.
These are often of numeric type i.e. quantitative measures. Datawarehouse analysis then consists of bucketing ( summing / Grouping ) these numerics so they carry the narrative of a trend over time at varying levels of granularity
Where dimension data is of more of 'static' nature like Trade , Customer , Product details.
I recommend reading:
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/

How to Handle Rows that Change over Time in Druid

I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks
I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!

Best way to store metric data used for graphs

What is the best way to store metrics data used in displaying graphs?
Currently I have a table analytics(domain::text, interval_in_days::int, grouping::text, metric::text, type::text, labels[], data[], summary::json)
domain is the overall category of the metrics. Like what part of the application they're under. Could be sales or support etc.
the interval_in_days and grouping are 'view options' the end user can specify at the interface level to have a different view of the data points.
grouping can be date, day_of_week or time_of_day
interval_in_days can be 7, 30 or 90
labels is an array of the labels on the x-axis and data are the corresponding datapoints.
type is either data_series or summary. If data series, the row represent's the data used for drawing the graph, while a summary has the summary:json field populated with an object like {total_number_of_X: 132, median_X: 320.. etc}
metric is simply the metric the corresponding graph represents, so there's a separate graph for each value of metric
From this it follows that for each metric/graph I display, I have 9 (3 intervals * 3 groupings). For each domain I have a single row with type summary.
Every few hours I aggregate a lot of data across multiple tables into the analytics table. So I don't have to perform expensive queries adhoc.
I feel this is not the optimal approach, so I'm really interested in seeing how other people accomplishes the same task or any suggestions.
There is nothing wrong with storing 9 rows of raw data and later aggregating them to something more comfortable. It's a common approach and has performance benefits in some situations.
What I would really re-think in your design are the datatypes. From your description it seems you can transform all ::text fields into something like ::varchar(20). Then you can use STORAGE PLAIN on these columns and your table will become more efficient.
Also, consider adding foreign keys to describe what is stored in individual columns. For example, you stated grouping can be date, day_of_week or time_of_day, so you could have a groupings table that will list these options. But again, the foreign key would have to be covered by an index, so you may want to skip on that due to performance reasons.

Volume of an Incident Queue at a Point in Time

I have an incident queue, consisting of a record number-string, the open time - datetime, and a close time-datetime. The records go back a year or so. What I am trying to get is a line graph displaying the queue volume as it was at 8PM each day. So if a ticket was opened before 8PM on that day or anytime on a previous day, but not closed as of 8, it should be contained in the population.
I tried the below, but this won't work because it doesn't really take into account multiple days.
If DATEPART('hour',[CloseTimeActual])>18 AND DATEPART('minute',[CloseTimeActual])>=0 AND DATEPART('hour',[OpenTimeActual])<=18 THEN 1
ELSE 0
END
Has anyone dealt with this problem before? I am using Tableau 8.2, cannot use 9 yet due to company license so please only propose 8.2 solutions. Thanks in advance.
For tracking history of state changes, the easiest approach is to reshape your data so each row represents a change in an incident state. So there would be a row representing the creation of each incident, and a row representing each other state change, say assignment, resolution, cancellation etc. You probably want columns to represent an incident number, date of the state change and type of state change.
Then you can write a calculated field that returns +1, -1 or 0 to to express how the state change effects the number of currently open incidents. Then you use a running total to see the total number open at a given time.
You may need to show missing date values or add padding if state changes are rare. For other analytical questions, structuring your data with one record per incident may be more convenient. To avoid duplication, you might want to use database views or custom SQL with UNION ALL clauses to allow both views of the same underlying database tables.
It's always a good idea to be able to fill in the blank for "Each record in my dataset represents exactly one _________"
Tableau 9 has some reshaping capability in the data connection pane, or you can preprocess the data or create a view in the database to reshape it. Alternatively, you can specify a Union in Tableau with some calculated fields (or similarly custom SQL with a UNION ALL clause). Here is a brief illustration:
select open_date as Date,
"OPEN" as Action,
1 as Queue_Change,
<other columns if desired>
from incidents
UNION ALL
select close_date as Date,
"CLOSE" as Action,
-1 as Queue_Change,
<other columns if desired>
from incidents
where close_date is not null
Now you can use a running sum for SUM(Queue_Change) to see the number of open incidents over time. If you have other columns like priority, department, type etc, you can filter and group as usual in Tableau. This data source can be in addition to your previous one. You don't have ta have a single view of the data for every worksheet in your workbook. Sometimes you want a few different connections to the same data at different levels of detail or for perspectives.