Date Table/Dimension Querying and Indexes - tsql

I'm creating a robust date table want to know the best way to link to it. The Primary Key Clustered Index will be on the smart date integer key (per Kimball spec) with a name of DateID. Until now I have been running queries against it like so:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON DTE.date = CAST(FLOOR(CAST(Foo.forderdate AS FLOAT)) AS DATETIME)
Keep in mind that Date is a nonclustered index field with values such as:
2000-01-01 00:00:00.000
It just occured to me that since I have a clustered integer index (DATEID) that perhaps I should be converting the datetime in my database field to match it and linking based upon that field.
What do you folks think?
Also, depending on your first answer, if I am typically pulling those fields from the date table, what kind of index how can I optimize the retrieval of those fields? Covering index?

Even without changing the database structure, you'd get much better performance using a date range join like this:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON Foo.forderdate >= DTE.date AND Foo.forderdate < DATEADD(dd, 1, DTE.date)
However, if you can change it so that your Foo table includes a DateID field then, yes, you'd get the best performance by joining with that instead of any converted date value or date range.
If you change it to join on DateID and DateID is the first column of the clustered index of the MyDateTable then it's already covering (a clustered index always includes all other fields).

Related

Index for self-join on timestamp range and user_id

I have a table in a postgresql (10.2) database something like this...
create table (user_id text, event_time timestamp, ...);
I'd like to use this table in a self join, to match records to other records from the same user_id and an event_time within the next 5 minutes. Something like this...
select
*
from
test as a
inner join
test as b
on
a.user_id = b.user_id
and a.event_time < b.event_time
and a.event_time > b.event_time - interval '5 minutes';
This works fine, but I'd ideally like to make it faster. I've gotten the join to use an index on user_id, but I'm wondering if it's possible to make an index to use both user_id AND the timestamp?
I've tried making a gist index on a tsrange from the event time to the event time plus 5 minutes, but Postgres seemed to just use the user_id index in that case. I tried making a multi-column index on the user_id and the tsrange, but that doesn't seem supported.
Finally, I tried making an index on just the timestamp.
None of that seemed to help.
However, the timestamps cover a long time period, and I'm only interested in a 5-minute window, which intuitively feels like something a good index should help with.
Can this be accomplished?
A multi-column index on the user_id text and the event_time timestamp should work. A gist index on the range would need to include the user id as well, and it would be less versatile since it would work only with the fixed interval of 5 minutes. I wouldn't use it unless what you actually want is to establish an exclusion constraint on the table.

Benefit to adding an Index for an order by column?

We have a large table (2.8M rows) where we are finding a single row by our device_token column
CREATE TABLE public.rpush_notifications (
id bigint NOT NULL,
device_token character varying,
data text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
...
We are constantly doing the following query:
SELECT * FROM rpush_notifications WHERE device_token = '...' ORDER BY updated_at DESC LIMIT 1
I'd like to add a index for our device_token column, and I'm wondering if there is any benefit to creating an additional index for updated_at or creating a multicolumn index for both columns device_token and updated_at given that we are ordering by, i.e.:
CREATE INDEX foo ON rpush_notifications(device_token, updated_at)
I have been unable to find an answer that would help me understand if there would be any performance benefit to adding updated_at to the index given the query we are running above. Any help appreciated. We are running Postgresql11
There is a performance benefit if you combine both columns just like you did ((device_token, updated_at)), because the database can easily find the entries with the specific device_token and it does not need to do the sorting during the query.
Even better would be an index on (device_token, updated_at DESC) as it gives you the requested row as the first one with this device_token, so there is no need to get the first and start a sequential scan from there on to find the last.

PostgreSQL indexed columns choice

I have these two tables :
CREATE TABLE ref_dates(
date_id SERIAL PRIMARY KEY,
month int NOT NULL,
year int NOT NULL,
month_name CHAR(255)
);
CREATE TABLE util_kpi(
kpi_id SERIAL PRIMARY KEY,
kpi_description int NOT NULL,
kpi_value float,
date_id int NOT NULL,
dInsertion timestamp default CURRENT_TIMESTAMP,
CONSTRAINT fk_ref_kpi FOREIGN KEY (date_id) REFERENCES ref_dates(date_id)
);
Usually, the type of request i'd do is :
Selecting kpi_description and kpi_value for a specified month and year:
SELECT kpi_description, kpi_value FROM util_kpi u JOIN ref_dates r ON u.date_id = r.date_id WHERE month=X AND year=XXXX
Selecting kpi_description and kpi_value for a specified kpi_description, month and year:
SELECT kpi_description, kpi_value FROM util_kpi u JOIN ref_dates r ON u.date_id = r.date_id WHERE month=X AND year=XXXX AND kpi_description='XXXXXXXXXXX'
I tought about creating these indexes :
CREATE INDEX idx_ref_date_year_month ON ref_dates(year, month);
CREATE INDEX idx_util_kpi_date ON util_kpi(date_id);
First of all, i want to know if it's a good idea to create these indexes.
Second of all and finally, I was wondering if it's a good idea to add kpi_description to the indexes on util_kpi table.
Can you guys give me your opinion ?
Regards
It's not possible to give exact answer without looking on data.
So it's only possible to give an opinion.
A. ref_dates
This table looks very similar to date dimension in ROLAP-schemas.
So the first what I would do: is change date_id from SERIAL to:
DATE datatype
or even "smart integer": integer datatype but in form YYYYMMDD. E.g. 20210430. It may look strange but it's not uncommon to see such identificators in date dimensions
The main point in using such form is that date_id in fact tables became informative even without joining to date dimension.
B. util_kpi
I suppose that:
ref_dates is a date dimension. So it's ~365 * number of years rows. It could be populated once for 20-30 years for future and it's still will not be really big
util_kpi is fact table. Which must be big like "really big" - millions and more records.
For `util_kpi' I expected id of time dimension but did not found it. So no hourly stats are supposed yet.
I see util_kpi.dInsertion - which I suppose is planned to be used as time dimension. I would think to extract it into time_id where put hours, minutes and seconds (if milliseconds are not needed).
C.Indexing
ref_dates: it does not matters a lot how you index ref_dates because it's a relatively small table. Maybe unique index on date_id with INCLUDE options for all fields would be the best. Don't create individual index for fields with low selectivity like year or month - it will not make much sense but it will not harm a lot too.
util_kpi - you need an index on date_id (as for any foreign keys to other dimension tables that will appear in future).
That's my thoughts that based on what I supposed.

Get max timestamps efficiently for large table for a set of ids

I have a large PostgreSQL db table (Actually lots of partition tables divided up by yearly quarters) that for simplicity sake is defined something like
id bigint
ts (timestamp)
value (float)
For a particular set of ids what is an efficient way of finding the last timestamp in the table for each specified id ?
The table is indexed by (id, timestamp)
If I do something naive like
SELECT sensor_id, MAX(ts)
FROM sensor_values
WHERE ts >= (NOW() + INTERVAL '-100 days') :: TIMESTAMPTZ
GROUP BY 1;
Things are pretty slow.
Is there a way of perhaps narrowing down the times first by a binary search of one id
(I can assume the timestamps are similar for a particular set of ids)
I am accessing the db through psycopg so the solution can be in code or SQL if I am missing something easy to speed this up.
The explain for the query can be seen here. https://explain.depesz.com/s/PVqg
Any ideas appreciated.

date_trunc on timestamp column returns nothing

I have a strange problem when retrieving records from db after comparing a truncated field with date_trunc().
This query doesn't return any data:
select id from my_db_log
where date_trunc('day',creation_date) >= to_date('2014-03-05'::text,'yyyy-mm-dd');
But if I add the column creation_date with id then it returns data(i.e. select id, creation_date...).
I have another column last_update_date having same type and when I use that one, still does the same behavior.
select id from my_db_log
where date_trunc('day',last_update_date) >= to_date('2014-03-05'::text,'yyyy-mm-dd');
Similar to previous one. it also returns record if I do id, last_update_date in my select.
Now to dig further, I have added both creation_date and last_updated_date in my where clause and this time it demands to have both of them in my select clause to have records(i.e. select id, creation_date, last_update_date).
Does anyone encountered the same problem ever? This similar thing works with my other tables which are having this type of columns!
If it helps, here is my table schema:
id serial NOT NULL,
creation_date timestamp without time zone NOT NULL DEFAULT now(),
last_update_date timestamp without time zone NOT NULL DEFAULT now(),
CONSTRAINT db_log_pkey PRIMARY KEY (id),
I have asked a different question earlier that didn't get any answer. This problem may be related to that one. If you are interested on that one, here is the link.
EDITS:: EXPLAIN (FORMAT XML) with select * returns:
<explain xmlns="http://www.postgresql.org/2009/explain">
<Query>
<Plan>
<Node-Type>Result</Node-Type>
<Startup-Cost>0.00</Startup-Cost>
<Total-Cost>0.00</Total-Cost>
<Plan-Rows>1000</Plan-Rows>
<Plan-Width>658</Plan-Width>
<Plans>
<Plan>
<Node-Type>Result</Node-Type>
<Parent-Relationship>Outer</Parent-Relationship>
<Alias>my_db_log</Alias>
<Startup-Cost>0.00</Startup-Cost>
<Total-Cost>0.00</Total-Cost>
<Plan-Rows>1000</Plan-Rows>
<Plan-Width>658</Plan-Width>
<Node/s>datanode1</Node/s>
<Coordinator-quals>(date_trunc('day'::text, creation_date) >= to_date('2014-03-05'::text, 'yyyy-mm-dd'::text))</Coordinator-quals>
</Plan>
</Plans>
</Plan>
</Query>
</explain>
"Impossible" phenomenon
The number of rows returned is completely independent of items in the SELECT clause. (But see #Craig's comment about SRFs.) Something must be broken in your db.
Maybe a broken covering index? When you throw in the additional column, you force Postgres to visit the table itself. Try to re-index:
REINDEX TABLE my_db_log;
The manual on REINDEX. Or:
VACUUM FULL ANALYZE my_db_log;
Better query
Either way, use instead:
select id from my_db_log
where creation_date >= '2014-03-05'::date
Or:
select id from my_db_log
where creation_date >= '2014-03-05 00:00'::timestamp
'2014-03-05' is in ISO 8601 format. You can just cast this string literal to date. No need for to_date(), works with any locale. The date is coerced to timestamp [without time zone] automatically when compared to creation_date (being timestamp [without time zone]). More details about timestamps in Postgres here:
Ignoring timezones altogether in Rails and PostgreSQL
Also, you gain nothing by throwing in date_trunc() here. On the contrary, your query will be slower and any plain index on the column cannot be used (potentially making this much slower)