PostgreSQL using timestamp difference in partial index for upsert - postgresql

I need to get real-time data and put it into a Postgres table so compare the columns oid and rcv_time respectively with newly received ones.
If this oid previously has been inserted and its received time is more than two hours from now should be inserted otherwise only need to be updated based on oid
So I want to create a partial index like below which indicates timestamp difference as conditional unique constraint:
CREATE UNIQUE INDEX oid_uqidx ON my_table (oid,rcv_time) where EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - rcv_time)) / 3600 < 2;
and as a sample my upsert query would be :
INSERT INTO my_table (oid, rcv_time, name)
VALUES ('730048b','2020-04-24 02:46:00','test')
ON CONFLICT ON CONSTRAINT oid_uqidx
DO UPDATE SET (rcv_time,name) = (EXCLUDED.rcv_time,EXCLUDED.name);
But when I try to create index the following error occurs:
ERROR: functions in index predicate must be marked IMMUTABLE
I also tried to work around without partial index by putting the where clause in upsert query
and instead, create a unique constraint on oid.
INSERT INTO my_table (oid, rcv_time, name)
VALUES ('730048b','2020-04-24 02:46:00','test')
ON CONFLICT(oid) where EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - rcv_time)) / 3600 < 2
DO UPDATE SET (rcv_time,name) = (EXCLUDED.rcv_time,EXCLUDED.name);
But It doesn't let me have multiple same oid and always do the update.
How can I approach the problem?

Related

Benefit to adding an Index for an order by column?

We have a large table (2.8M rows) where we are finding a single row by our device_token column
CREATE TABLE public.rpush_notifications (
id bigint NOT NULL,
device_token character varying,
data text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
...
We are constantly doing the following query:
SELECT * FROM rpush_notifications WHERE device_token = '...' ORDER BY updated_at DESC LIMIT 1
I'd like to add a index for our device_token column, and I'm wondering if there is any benefit to creating an additional index for updated_at or creating a multicolumn index for both columns device_token and updated_at given that we are ordering by, i.e.:
CREATE INDEX foo ON rpush_notifications(device_token, updated_at)
I have been unable to find an answer that would help me understand if there would be any performance benefit to adding updated_at to the index given the query we are running above. Any help appreciated. We are running Postgresql11
There is a performance benefit if you combine both columns just like you did ((device_token, updated_at)), because the database can easily find the entries with the specific device_token and it does not need to do the sorting during the query.
Even better would be an index on (device_token, updated_at DESC) as it gives you the requested row as the first one with this device_token, so there is no need to get the first and start a sequential scan from there on to find the last.

Would it be possible to select random rows with a little preference for a specific column?

I would like to get a random selection of records from my table but I wonder if it would be possible to give a better chance for items that are newly created. I also have pagination so this is why I'm using setseed
Currently I'm only retrieving items randomly and it works quite well, but I need to give a certain "preference" to newly created items.
Here is what I'm doing for now:
SELECT SETSEED(0.16111981), RANDOM();
I don't know what to do and I can't figure what can be a good solution without being an absolute performance disaster.
Firstly I want to explain how we can select random records on a table. On PostgreSQL, we can use random() function in the order by statement. Example:
select * from test_table
order by random()
limit 1;
I am using limit 1 for selecting only one record. But, using this method our query performance will be very bad for large size tables (over 100 million data)
The second way, you can manually be selecting records using random() if the tables are had id fields. This way is very high performance.
Let's firstly write our own randomize function for using it's easily on our queries.
CREATE OR REPLACE FUNCTION random_between(low integer, high integer)
RETURNS integer
LANGUAGE plpgsql
STRICT
AS $function$
BEGIN
RETURN floor(random()* (high-low + 1) + low);
END;
$function$;
This function returns a random integer value in the range of our input argument values. Then we can write a query using our random function. Example:
select * from test_table
where id = (select random_between(min(id), max(id)) from test_table);
This query I tested on the table has 150 million data and gets the best performance, Duration 12 ms. In this query, if you need many rows but not one, then you can write where id > instead of where id=.
Now, for your little preference, I don't know your detailed business logic and condition statements which you want to set to randomizing. I can write for you some sample queries for understanding the mechanism. PostgreSQL has not a function for doing this process, so randomize data using preferences. We must write this logic manually. I created a sample table for testing our queries.
CREATE TABLE test_table (
id serial4 NOT NULL,
is_created bool NULL,
action_date date NULL,
CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
CREATE INDEX test_table_id_idx ON test_table USING btree (id);
For example, I want to set more preference only to data which are action dates has a closest to today. Sample query:
select
id,
is_created,
action_date,
(extract(day from (now()-action_date))) as dif_days
from
test.test_table
where
id > (select random_between(min(id), max(id)) from test.test_table)
and
(extract(day from (now()-action_date))) = random_between(0, 6)
limit 1;
In this query this (extract(day from (now()-action_date))) as dif_days query will returned difference between action_date and today. On the where clause firstly I select data that are id field values greater than the resulting randomize value. Then using this query (extract(day from (now()-action_date))) = random_between(0, 6) I select from this resulting data only which data are action_date equals maximum 6 days ago (maybe 4 days ago or 2 days ago, mak 6 days ago).
Сan wrote many logic queries (for example set more preferences using boolean fields: closed are opened and etc.)

PostgreSQL: Deleting records to keep the one with the latest timestamps

Let's say I have a table with a column of timestamps and a column of IDs (numeric). For each ID, I'm trying to delete all the rows except the one with the latest timestamp.
Here is the code I have so far:
DELETE FROM table_name t1
WHERE EXISTS (SELECT * FROM table_name t2
WHERE t2."ID" = t1."ID"
AND t2."LOCAL_DATETIME_DTE" > t1."LOCAL_DATETIME_DTE")
This code seems to work, but my question is: why is it a > sign and not a < sign in the timestamp comparison? Is this not selecting for deletion all the rows with a later timestamp than another row? I thought this code would keep only the rows with the earliest timestamps for each ID.
You're using the EXISTS operator to delete records for which a record can be found with a larger, thus >, timestamp. For the newest, there isn't a record with a higher timestamp, so the WHERE clause doesn't resolve to true and therefore the record is kept.
You can use "record" pseudo-type to match tuples:
DELETE FROM table_name
WHERE (ID,LOCAL_DATETIME_DTE) not in
(SELECT ID,max(LOCAL_DATETIME_DTE) FROM table_name group by id);

Primary key duplicate in a table-valued parameter in stored procedure

I am using following code to insert date by Table Valued Parameter in my SP. Actually it works when one record exists in my TVP but when it has more than one record it raises the following error :
'Violation of Primary key constraint 'PK_ReceivedCash''. Cannot insert duplicate key in object 'Banking.ReceivedCash'. The statement has been terminated.
insert into banking.receivedcash(ReceivedCashID,Date,Time)
select (select isnull(Max(ReceivedCashID),0)+1 from Banking.ReceivedCash),t.Date,t.Time from #TVPCash as t
Your query is indeed flawed if there is more than one row in #TVPCash. The query to retrieve the maximum ReceivedCashID is a constant, which is then used for each row in #TVPCash to insert into Banking.ReceivedCash.
I strongly suggest finding alternatives rather than doing it this way. Multiple users might run this query and retrieve the same maximum. If you insist on keeping the query as it is, try running the following:
insert into banking.receivedcash(
ReceivedCashID,
Date,
Time
)
select
(select isnull(Max(ReceivedCashID),0) from Banking.ReceivedCash)+
ROW_NUMBER() OVER(ORDER BY t.Date,t.Time),
t.Date,
t.Time
from
#TVPCash as t
This uses ROW_NUMBER to count the row number in #TVPCash and adds this to the maximum ReceivedCashID of Banking.ReceivedCash.

busy table performance optimization

I have a postgresql table storing data from a table-like form.
id SERIAL,
item_id INTEGER ,
date BIGINT,
column_id INTEGER,
row_id INTEGER,
value TEXT,
some_flags INTEGER,
The issue is we have 5000+ entries per day and the information needs to be kept for years.
So I end up with a huge table witch is busy for the top 1000-5000 rows,
with lots of SELECT, UPDATE, DELETE queries but the old content is rarely used (only in statistics) and is almost never changed.
The question is how can I boost the performance for the daily work (top 5000 entries from 50 millions).
There are simple indexes on almost all columns .. but nothing fancy.
Splitting the table is not possible for now, I`m looking more for Index optimisation .
The advices in the comments from dezso and Jack are good. If you want the simplest then this is how you implement the partial index:
create table t ("date" bigint, archive boolean default false);
insert into t ("date")
select generate_series(
extract(epoch from current_timestamp - interval '5 year')::bigint,
extract(epoch from current_timestamp)::bigint,
5)
;
create index the_date_partial_index on t ("date")
where not archive
;
To avoid having to change all queries adding the index condition rename the table:
alter table t rename to t_table;
And create a view with the old name including the index condition :
create view t as
select *
from t_table
where not archive
;
explain
select *
from t
;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Index Scan using the_date_partial_index on t_table (cost=0.00..385514.41 rows=86559 width=9)
Then each day you archive older rows:
update t_table
set archive = true
where
"date" < extract(epoch from current_timestamp - interval '1 week')
and
not archive
;
The not archive condiditon is to avoid updating millions of already archived rows.