Benefit to adding an Index for an order by column? - postgresql

We have a large table (2.8M rows) where we are finding a single row by our device_token column
CREATE TABLE public.rpush_notifications (
id bigint NOT NULL,
device_token character varying,
data text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
...
We are constantly doing the following query:
SELECT * FROM rpush_notifications WHERE device_token = '...' ORDER BY updated_at DESC LIMIT 1
I'd like to add a index for our device_token column, and I'm wondering if there is any benefit to creating an additional index for updated_at or creating a multicolumn index for both columns device_token and updated_at given that we are ordering by, i.e.:
CREATE INDEX foo ON rpush_notifications(device_token, updated_at)
I have been unable to find an answer that would help me understand if there would be any performance benefit to adding updated_at to the index given the query we are running above. Any help appreciated. We are running Postgresql11

There is a performance benefit if you combine both columns just like you did ((device_token, updated_at)), because the database can easily find the entries with the specific device_token and it does not need to do the sorting during the query.
Even better would be an index on (device_token, updated_at DESC) as it gives you the requested row as the first one with this device_token, so there is no need to get the first and start a sequential scan from there on to find the last.

Related

Index for self-join on timestamp range and user_id

I have a table in a postgresql (10.2) database something like this...
create table (user_id text, event_time timestamp, ...);
I'd like to use this table in a self join, to match records to other records from the same user_id and an event_time within the next 5 minutes. Something like this...
select
*
from
test as a
inner join
test as b
on
a.user_id = b.user_id
and a.event_time < b.event_time
and a.event_time > b.event_time - interval '5 minutes';
This works fine, but I'd ideally like to make it faster. I've gotten the join to use an index on user_id, but I'm wondering if it's possible to make an index to use both user_id AND the timestamp?
I've tried making a gist index on a tsrange from the event time to the event time plus 5 minutes, but Postgres seemed to just use the user_id index in that case. I tried making a multi-column index on the user_id and the tsrange, but that doesn't seem supported.
Finally, I tried making an index on just the timestamp.
None of that seemed to help.
However, the timestamps cover a long time period, and I'm only interested in a 5-minute window, which intuitively feels like something a good index should help with.
Can this be accomplished?
A multi-column index on the user_id text and the event_time timestamp should work. A gist index on the range would need to include the user id as well, and it would be less versatile since it would work only with the fixed interval of 5 minutes. I wouldn't use it unless what you actually want is to establish an exclusion constraint on the table.

Does the index column order matter on row insert in Postgresql?

I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.

PostgreSQL Trim created_at timestamp column values function pg_catalog.btrim(timestamp without time zone) does not exist

In our PostgreSQL we have created_at column and it has a lot of data inserted as a
timestamp.
Now while we are fetching data using
explain analyze select * from users where created_at > '2018-01-01 11:02:03'::timestamp
and it got tooo many time
So trim() of these values is it possbile
like created_at > trim('2018-01-01 11:02:03')
Please help me
or update users set created_at = trim(created_at) not working
how to trim created_at as a timestamp not date it gave error while I trim all created_at function pg_catalog.btrim(timestamp without time zone) does not exist
Thanks
why I want to Trim created_at timestamp column
as Some of the columns I did trim and it found indexed scan and make select much faster.
But In created_at I am not able to make it faster I though after trim() It will do indexed scan instead of sequential scan
If you do a simple :
SELECT to_char(created_at, 'YYYY-MM-DD');
You will get a string of the date.
SELECT to_char(created_at, 'YYYY-MM-DD')::date;
Will return a date.
Now, if your problem is a perfomance problem, you should make an index on your created_at field.
CREATE INDEX ON users (created_at);
explain analyze select * from users where created_at > '2018-01-01';
Should be really quicker.

date_trunc on timestamp column returns nothing

I have a strange problem when retrieving records from db after comparing a truncated field with date_trunc().
This query doesn't return any data:
select id from my_db_log
where date_trunc('day',creation_date) >= to_date('2014-03-05'::text,'yyyy-mm-dd');
But if I add the column creation_date with id then it returns data(i.e. select id, creation_date...).
I have another column last_update_date having same type and when I use that one, still does the same behavior.
select id from my_db_log
where date_trunc('day',last_update_date) >= to_date('2014-03-05'::text,'yyyy-mm-dd');
Similar to previous one. it also returns record if I do id, last_update_date in my select.
Now to dig further, I have added both creation_date and last_updated_date in my where clause and this time it demands to have both of them in my select clause to have records(i.e. select id, creation_date, last_update_date).
Does anyone encountered the same problem ever? This similar thing works with my other tables which are having this type of columns!
If it helps, here is my table schema:
id serial NOT NULL,
creation_date timestamp without time zone NOT NULL DEFAULT now(),
last_update_date timestamp without time zone NOT NULL DEFAULT now(),
CONSTRAINT db_log_pkey PRIMARY KEY (id),
I have asked a different question earlier that didn't get any answer. This problem may be related to that one. If you are interested on that one, here is the link.
EDITS:: EXPLAIN (FORMAT XML) with select * returns:
<explain xmlns="http://www.postgresql.org/2009/explain">
<Query>
<Plan>
<Node-Type>Result</Node-Type>
<Startup-Cost>0.00</Startup-Cost>
<Total-Cost>0.00</Total-Cost>
<Plan-Rows>1000</Plan-Rows>
<Plan-Width>658</Plan-Width>
<Plans>
<Plan>
<Node-Type>Result</Node-Type>
<Parent-Relationship>Outer</Parent-Relationship>
<Alias>my_db_log</Alias>
<Startup-Cost>0.00</Startup-Cost>
<Total-Cost>0.00</Total-Cost>
<Plan-Rows>1000</Plan-Rows>
<Plan-Width>658</Plan-Width>
<Node/s>datanode1</Node/s>
<Coordinator-quals>(date_trunc('day'::text, creation_date) >= to_date('2014-03-05'::text, 'yyyy-mm-dd'::text))</Coordinator-quals>
</Plan>
</Plans>
</Plan>
</Query>
</explain>
"Impossible" phenomenon
The number of rows returned is completely independent of items in the SELECT clause. (But see #Craig's comment about SRFs.) Something must be broken in your db.
Maybe a broken covering index? When you throw in the additional column, you force Postgres to visit the table itself. Try to re-index:
REINDEX TABLE my_db_log;
The manual on REINDEX. Or:
VACUUM FULL ANALYZE my_db_log;
Better query
Either way, use instead:
select id from my_db_log
where creation_date >= '2014-03-05'::date
Or:
select id from my_db_log
where creation_date >= '2014-03-05 00:00'::timestamp
'2014-03-05' is in ISO 8601 format. You can just cast this string literal to date. No need for to_date(), works with any locale. The date is coerced to timestamp [without time zone] automatically when compared to creation_date (being timestamp [without time zone]). More details about timestamps in Postgres here:
Ignoring timezones altogether in Rails and PostgreSQL
Also, you gain nothing by throwing in date_trunc() here. On the contrary, your query will be slower and any plain index on the column cannot be used (potentially making this much slower)

Date Table/Dimension Querying and Indexes

I'm creating a robust date table want to know the best way to link to it. The Primary Key Clustered Index will be on the smart date integer key (per Kimball spec) with a name of DateID. Until now I have been running queries against it like so:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON DTE.date = CAST(FLOOR(CAST(Foo.forderdate AS FLOAT)) AS DATETIME)
Keep in mind that Date is a nonclustered index field with values such as:
2000-01-01 00:00:00.000
It just occured to me that since I have a clustered integer index (DATEID) that perhaps I should be converting the datetime in my database field to match it and linking based upon that field.
What do you folks think?
Also, depending on your first answer, if I am typically pulling those fields from the date table, what kind of index how can I optimize the retrieval of those fields? Covering index?
Even without changing the database structure, you'd get much better performance using a date range join like this:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON Foo.forderdate >= DTE.date AND Foo.forderdate < DATEADD(dd, 1, DTE.date)
However, if you can change it so that your Foo table includes a DateID field then, yes, you'd get the best performance by joining with that instead of any converted date value or date range.
If you change it to join on DateID and DateID is the first column of the clustered index of the MyDateTable then it's already covering (a clustered index always includes all other fields).