Which data type for date-type attributes in a dimensional table, including start & end dates? - date

I'm designing a data warehouse using dimensional modeling. I've read most of the Data Warehouse Toolkit by Kimbal & Ross. My question is regarding the columns in a dimensional table that hold dates. For example, here is a table for Users of the application:
CREATE TABLE user_dim (
user_key BIGINT, -- surrogate key
user_id BIGINT, -- natural key
user_name VARCHAR(100),
...
user_added_date DATE, -- type 0, date user added to the system
...
-- Type-2 SCD administrative columns
row_start_date DATE, -- first effective date for this row
row_end_date DATE, -- last effective date for this row, 9999-12-31 if current
row_current_flag VARCHAR(10), -- current or expired
)
The last three attributes are for implementing type 2 slowly-changing dimensions. See Kimbal page 150-151.
Question 1: Is there are best practice for the data type of the row_start_date and row_end_date columns? The type could be DATE (as shown), STRING/VARCHAR/CHAR ("YYYY-MM-DD"), or even BIGINT (foreign key to Date Dimension). I don't think there would be much filtering on the row start/end dates, so a key to the Date Dimension is not required.
Question 2: Is there a best practice for the data type of dimension attributes such "user_added_date"? I can see someone wanting reports on users added per fiscal quarter, so using a foreign key to Date Dimension would be helpful. Any downsides to this, besides having to join from User Dimension to Date Dimension for display of the attribute?
If it matters, I'm using Amazon Redshift.

Question 1 : For the SCD from and to dates I suggest you use timestamp. My preference is WITHOUT time zone and ensure all of your timestamps are UTC
Question 2 : I always set up a date dimension table with a logical key of the actual date. that way you can join any date (e.g. the start date of the user) to the date dimension to find the eg "fiscal month" or whatever off the date dimension. But also you can see the date without joining to the date dimension as its plain to see (stored as a date)
With redshift (or any columnar MPP DBMS) it is good practice to denormalise a little. e.g. use star schema rather than snowflake schema. This is because of the efficiencies that columnar brings, and deals with the inneficient joins (because there are no indexes)

For Question 1: row_start_date and row_end_date are not part of the incoming data. As you mentioned they are created artifially for SCD Type 2 purposes, so they should not have a key to Date dimension. User dim has no reason to have a key to Date dimension. For data type YYYY-MM-DD should be fine.
For Question 2: If you have a requirement like this I would suggest creating a derived fact table (often called accumulating snapshot fact table) to keep derived measures like user_added_date
For more info see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/accumulating-snapshot-fact-table/

Related

Is there a way to pull just the Year out a VARCHAR datetime value?

I am working on a project, in Snowflake, that requires me to combine pest & weather data tables, but the opposing tables do not share a common column. My solution has been to create a view that extracts the year from the Pest Table dates, format ex.
CREATION_DATE: 03/26/2020 09:11:15 PM,
to match the YEAR column in the Weather tables, format ex.
DATEYEAR: 2021.
However, I have come to find that the dates in the pest report are VARCHAR as opposed to traditional date/datetime values. Is there a way to pull just the Year out the VARCHAR date value? Additional information: I cannot change the tables themselves, I will need to create a view that preserves all other columns and adds a new "DATEYEAR" column.
Yes , we can and below is working example:
create table test (dt string );
insert into test(dt) values ('01/04/2022');
Select dt, DATE_PART( year, dt::date) from test
To make it easy, you can split the string into an array and take the third member of the array (using 2 since arrays are 0 based):
select strtok_to_array('03/26/2020', '/')[2]::int as MY_YEAR;

Index for TIMESTAMPTZ and function immutability

We have a structure similar to the following:
create table company
(
id bigint not null,
tz text not null
);
create table company_data
(
company_id bigint not null,
ts_tz timestamp with time zone not null
);
The tables are simplified.
Fiddle with sample data here: SQL Fiddle
Every company has a fixed TZ. So, when we need to extract some information from company_data we use a query similar to the following:
select
cd.company_id,
cd.ts_tz at time zone c.tz
from company_data cd
join company c on c.id = cd.company_id;
We also have a function to get company tz:
create or replace function tz_company(f_company_id bigint) returns text
language plpgsql
as
$$
declare
f_tz text;
begin
select c.tz from company c where c.id = f_company_id into f_tz;
return f_tz;
end;
$$;
And another to transform a ts in a date applying a tz:
create or replace function tz_date(timestamp with time zone, text) returns date
language plpgsql
immutable strict
as
$$
begin
return ($1 at time zone $2) :: date;
end;
$$;
The problem we are having now is that company_data (and other similar tables) is a large and frequently used table. The majority of the SELECTs in that table performs filtering using a DATE.
For example:
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where tz_date(cd.ts_tz, tz_company(cd.company_id)) >= '2019-08-20'
and tz_date(cd.ts_tz, tz_company(cd.company_id)) <= '2019-08-22';
So, to speed up queries, we need to add an index in the company_data.ts_tz column. The only way for doing this that we found was the following:
create index idx_company_data_ts_tz on company_data
(((company_data.ts_tz at time zone tz_company(company_data.company_id))::date));
For this to work, we need to make the tz_company function immutable.
Some other problems (and ideas) emerged:
1 - The version of the query using tz_date function does not use index.
Not uses index:
explain analyse
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where tz_date(cd.ts_tz, tz_company(cd.company_id)) >= '2019-08-20'
and tz_date(cd.ts_tz, tz_company(cd.company_id)) <= '2019-08-22';
Uses index:
explain analyse
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where (cd.ts_tz at time zone tz_company(cd.company_id))::date >= '2019-08-20'
and (cd.ts_tz at time zone tz_company(cd.company_id))::date <= '2019-08-22';
Why that happens?
2 - We know that, in theory, tz_company should not be immutable, at most stable. But, the company tz is an information that should not change, ever. Yes, it could happen, but it is improbable. In the past three years, we never change the tz of any company. So, is still a problem for tz_company to be immutable? If it is, how could we rewrite the index? Note that a single SELECT could bring information of more than one company and mix different timezones.
3 - Because of the complexity of dealing with indexes in a timestamptz column we consider to add another column in every table that has a ts_tz. This new column would be a date with tz already applied. Is this a good approach?
Besides, we need to apply tz before casting because every client (company) selects only dates to filter and this dates are locale aware (tz aware).
EDIT 1:
The queries used are only for demonstration. But a requirement is that the client sees the timestamps in the timezone where the event has occurred, this is an important requirement. We deal with logistics operations in Brazil and Brazil itself has four different timezones across the country.
A holding could own different companies and every company could be in a different timezone.
So, a lot of queries deals with different companies at different timezones and applying some date filtering. Today, our backend returns all data ready to display, with timezone applied and this would be difficult to change.
What we want to achieve, is an easy and performative way of dealing with those timestamptz columns: applying filter by date (tz aware) and using indexes to speedup queries.
1 - That's because tz_date is not marked as immutable. It is safe to mark it as immutable if postgres allows to create an index on the same expression as in the body of the function -- it only would allow to do it on an immutable expression. Some postgres date-time manipulation functions and type casts are immutable, some aren't. BTW I'm not sure what happens to an index if at time zone operator breaks its immutability contract when tzdata is changed -- that happens quite often on postgres or OS upgrade, depending on the settings.
2 - That's a very dangerous approach, the index becomes corrupted if you change the data. You may lose data. If you absolutely need this pseudo-immutable function I would strongly recommend to add a trigger that disallows deletes, truncates and updates of company.tz. If you ever need to change the time zone data, drop the index first.
3 - The key question is whether you happen to query data across multiple companies?
a) If you do, it's only of numerological sense. 2011-09-13 events from Niue (UTC-11) and 2019-09-13 events from New Zealand (UTC+13) can never happen at the same time. The only common property of these events is they happened on Friday the 13th. That's only notation, it never was 2019-09-13 in both countries at the same time. So please make sure your queries really make sense. In this unlikely case denormalization of the date notation as a separate timestamp without time zone column would make sense, as you're filtering by notation of time, not by the moment of time. I would recommend a trigger to populate it.
b) All your queries are single-company. In this case I would create a plain index on columns only with no expressions and create a function and make queries like this:
create index on company_data(company_id, ts_tz);
create function midnight_at_company(p_date date, p_company_id bigint) strict returns timestamp with time zone as $$
select p_date::timestamp at time zone tz from company where id = p_company_id;
$$ language sql;
-- put your company id instead of $1
explain analyse
select cd.company_id,
cd.ts_tz at time zone tz_company(cd.company_id)
from company_data cd
where company_id = $1
and cd.ts_tz >= midnight_at_company('2019-08-20', $1)
and cd.ts_tz < midnight_at_company('2019-08-23', $1); --note exact `<`, not `<=`
I would standardize all the time zones into one calling it database or server time. I understand that companies are in different places, but that is not a good reason to have timezones all over your data. Using this method will eliminate the need to have a time zone reference table. When you pull the data from any one of these companies write your code to take into account the server time zone so that it reads in your local time.
This will eliminate tons of potential confusion. This is a method used across the world, that is why data timestamps in most APIs only have one timezone.
In response to Edit:
Hi #Luiz
Let me start with there is no right or wrong answer its whatever think works best. In my case I am of the opinion that the front end view and the data should be managed some what separate. On the data side as per this topic I would handle all date stamps using server time. The need to view data one way or another is a front end issue.
In the case of your requirement I would either hard code a js switch like such.
switch("CampanyA") {
case "CompanyA":
return Timezone EST...
// code block
break;
case "CompanyB" :
// code block
break;
default:
// code block
}
or if there are to many companies for a hard code to be handling I would make a table with the "Company ID", "Company name", and "Time Zone Code". Do not link this table to your data. You should add the "Company ID" to the main table with events that have the server time zone.
Use the table with the company time zone codes to populate your look up filter that will be used to run your query. When your script event handler reacts to the drop down menu it will save the current TM Zone code associated with that company and use the value when trying to display the time zone in accordance to your requirement. I would also force your code to load data as async (1000 records or so every few mil seconds) instead of all at once. This will vastly increase performance and the user will not be able to tell that their data is still loading.
This efforts will let you manipulate the time zone to meet the current and future requirements that might come up.
I think the current schema that u are using for your application is not the best for such a problem.
You would have a lot of problems saving different timezones at the same table.
Use UTC, only use UTC on the DB/Schema level, you can set that in Postgres conf also.
Depending on the application, you could send back UTC dates and convert them to their current local time in javascript/server Side. If that's not possible have one place where the user specifies their current UTC offset and then right before you display the date/time convert it to their time.
This is going to make your life super simple and u can achieve great performance on the Query level as u now would have a performant DB Schema, the SQL functions you have makes no sense as you can achieve much better performance just by using indexing in DB.
So as per your specific requirements, I would have the schema as u have with some additions, I would index the id for the table company and would store all the data in UTC for the timestamp in table company_data.
if the company data is being requested we fetch the Timezone(Text) from the company table, using this data we can have the backend code/JS do the timezone change magic.
we have a limited amount of timezones, you can ideally have those set in config to make the lookup easier and faster.

Crystal Reports - create calendar

I need to create an attendence list showing days in rows and employee names in colums. The list will always cover one full month chosen in parameters.
How can I create a recordset of days of chosen month? I've done it in command section but, due to ERP system limitations, it must done otherways.
Thank you,
Przemek
A good approach is to create a Calendar table (aka Date Dimension in data warehousing lingo). It makes it easy to show days without any attendance. If you don't need that aspect, you can simply create a formula that returns the attendance date month's day, and Group on that formula. The Day() function gets you the day of month. For example,
Day ({Orders.Order Date})
If you search 'creating a data dimension or calendar table' you'll find many helpful sources such as this one: https://www.mssqltips.com/sqlservertip/4054/creating-a-date-dimension-or-calendar-table-in-sql-server/
For your case, I agree with the comments in that post about using date instead of integer as the primary key. Integer PK makes more sense for true data warehousing scenarios as opposed to legacy databases.

Postgres timestamp to date

I am building a map in CartoDB which uses Postgres. I'm simply trying to display my dates as: 10-16-2014 but, haven't been able to because Postgres includes an unneeded timestamp in every date column.
Should I alter the column to remove the timestamp or, is it simply a matter of a (correct) SELECT query? I can SELECT records from a date range no problem with:
SELECT * FROM mytable
WHERE myTableDate >= '2014-01-01' AND myTableDate < '2014-12-31'
However, my dates appear in my CartoDB maps as: 2014-10-16T00:00:00Z and I'm just trying to get the popups on my maps to read: 10-16-2014.
Any help would be appreciated - Thank you!
You are confusing storage with display.
Store a timestamp or date, depending on whethether you need time or not.
If you want formatted output, ask the database for formatted output with to_char, e.g.
SELECT col1, col2, to_char(col3, 'DD-MM-YY'), ... FROM ...;
See the PostgreSQL manual.
There is no way to set a user-specified date output format. Dates are always output in ISO format. If PostgreSQL let you specify other formats without changing the SQL query text it'd really confuse client drivers and applications that expect the date format the protocol specifies and get something entirely different.
You have two basic options.
1 Change the column from a timestamp to a date column.
2 Cast to date in your SQL query (i.e. mytimestamp::date works).
In general if this is a presentation issue, I don't usually think that is a good reason to muck around with the database structure. That's better handled by client-side processing or casting in an SQL query. On the other hand if the issue is a semantic one, then you may want to revisit your database structure.

db2 getting business dates between given dates

I have a table called "Publicholidays" where in dates are stored as Varchar.
My query should fetch all values from say table xxxx between the user selected dates that exclude the weekends(sat,sun), public holidays. I am new to DB2 so can anyone suggest me ideas please
Note: in DB dates are stored as String.
Mistake #1 - Storing dates as strings. Let's hope you have at least stored them YYYY-MM-DD and not MM-DD-YYYY.
Mistake #2 - Instead of a "Publicholidays" table, you need a Calendar (aka Dates or date conversion) table. It should have a record for every day along with a few flag columns BUSINESS_DAY, WEEKEND, PUBLIC_HOLIDAY. Alternatively, you could have a single DAY_TYPE column with values for business day, weekend and holiday. You'll also want to have a STRING_DATE column to make conversion between your string date and a true date easier.
Google SQL Calender table and you'll find lots of examples and discussions.
Lastly, strongly consider fixing your DB to store dates in a date column.