Postgres, operation are slow - postgresql

My Postgres queries are slow on the table records.
A simple request like that can take 15 seconds !
The result: 32k (on 1.5 millions)
SELECT COUNT(*)
FROM project.records
WHERE created_at > NOW() - INTERVAL '1 day'
I have an index on created_at (which is a timestamp)
What can I do to manage this ? Is it my table who is too big ?

As suggested by #Andomar, I removed the large columns to another table.
I made sure to do a VACUUM ANALYZE to really clean the table.
Now the query take 400ms.

Related

Move rows older that x days to archive table or partition table in Postgres 11

I would like to speed up the queries on my big table that contains lots of old data.
I have a table named post that has the date column created_at. The table has over ~31 million rows and ~30 million rows older than 30 days.
Actually, I want this:
move data older than 30 days into the post_archive table or create a partition table.
when the value in column created_at becomes older than 30 days then that row should be moved to the post_archive table or partition table.
Any detailed and concrete solution in PostgresSQL 11.15?
My ideas:
Solution 1. create a cron script in whatever language (e.g. JavaScript) and run it every day to copy data from the post table into post_archive and then delete data from the post table
Solution 2. create a Postgres function that should copy the data from the post table into the partition table, and create a cron job that will call the function every day
Thanks
This is to split your data into a post and post_archive table. It's a common approach, and I've done it (with SQL Server).
Before you do anything else, make sure you have an index on your created_at column on your post table. Important.
Next, you need to use a common expression to mean "thirty days ago". This is it.
(CURRENT_DATE - INTERVAL '30 DAY')::DATE
Next, back everything up. You knew that.
Then, here's your process to set up your two tables.
CREATE TABLE post_archive AS TABLE post; to populate your archive table.
Do these two steps to repopulate your post table with the most recent thirty days. It will take forever to DELETE all those rows, so we'll truncate the table and repopulate it. That's also good because it's like starting from scratch with a much smaller table, which is what you want. This takes a modest amount of downtime.
TRUNCATE TABLE post;
INSERT INTO post SELECT * FROM post_archive
WHERE created_at > (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
DELETE FROM post_archive WHERE created_at > (CURRENT_DATE - INTERVAL '30 DAY')::DATE; to remove the most recent thirty days from your archive table.
Now, you have the two tables.
Your next step is the daily row-migration job. PostgreSQL lacks a built-in job scheduler like SQL Server's Job or MySQL's EVENT so your best bet is a cronjob.
It's probably wise to do the migration daily if that fits with your business rules. Why? Many-row DELETEs and INSERTs cause big transactions, and that can make your RDBMS server thrash. Smaller numbers of rows are better.
The SQL you need is something like this:
INSERT INTO post_archive SELECT * FROM post
WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
DELETE FROM post
WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
You can package this up as a shell script. On UNIX-derived systems like Linux and FreeBSD the shell script file might look like this.
#!/bin/sh
psql postgres://username:password#hostname:5432/database << SQLSTATEMENTS
INSERT INTO post_archive SELECT * FROM post
WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
DELETE FROM post
WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
SQLSTATEMENTS
Then run the shell script from cron a few minutes after 3am each day.
Some notes:
3am? Why? In many places daylight-time switchover messes up the time between 02:00 and 03:00 twice a year. A choice of, say 03:22 as a time to run the daily migration keeps you well away from that problem.
CURRENT_DATE gets you midnight of today. So, if you run the script more than once in any calendar day, no harm is done.
If you miss a day, the next day's migration will catch up.
You could package up the SQL as a stored procedure and put it into your RDBMS, then invoke it from your shell script. But then your migration procedure lives in two different places. You need the cronjob and shell script in any case in PostgreSQL.
Will your application go off the rails if it sees identical rows in both post and post_archive while the migration is in progress? If so, you'll need to wrap your SQL statements in a transaction. That way other users of the database won't see the duplicate rows. Do this.
#!/bin/sh
psql postgres://username:password#hostname:5432/database << SQLSTATEMENTS
START TRANSACTION;
INSERT INTO post_archive SELECT * FROM post
WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
DELETE FROM post
WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
COMMIT;
SQLSTATEMENTS
Cronjobs are quite reliable on Linux and FreeBSD.

How to get the count of hourly inserts on a table in PostgreSQL

I need a setup where rows older than 60 days get removed from the table in PostgreSQL.
I Have created a function and a trigger:
BEGIN
DELETE FROM table
WHERE updateDate < NOW() - INTERVAL '60 days';
RETURN NULL;
END;
$$;
But I believe if the insert frequency is high, this will have to scan the entire table quite often, which will cause high DB load.
I could run this function through a cron job or Lambda function every hour/day. I need to know the insert every hour on that table to take that decision.
Is there a query or job that I can setup which will collect the details?
Just to count the number of records per hour, you could run this query:
SELECT CAST(updateDate AS date) AS day
, EXTRACT(HOUR FROM updateDate) AS hour
, COUNT(*)
FROM _your_table
WHERE updateDate BETWEEN ? AND ?
GROUP BY
1,2
ORDER BY
1,2;
We do about 40 million INSERT's a day on a single table, that is partitioned by month. And after 3 months we just drop the partition. That is way faster than a DELETE.

Postgres query planner filter order affected by using now on sequential scan

I have a query whereby, when enabling sequential scanning on the postgres database and using now() in the where clause, the query planner will prefer a sequential scan of the table and then filter:
EXPLAIN ANALYZE
SELECT
action_id
FROM
events
WHERE
started_at IS NULL
AND deleted_at IS NULL
AND due_at < now()
AND due_at > now() - interval '14 days'
LIMIT 1
FOR UPDATE
SKIP LOCKED;
Example:
https://explain.depesz.com/s/xLlM
Query with enable_seqscan db parameter set to false:
https://explain.depesz.com/s/e8Fe
I am looking to help the query optimiser use the index.
I suspect the fact that the number of rows that match started_at is null and deleted_at is null filter roughly makes up 13% of the total table rows (and due_at column is completed unique and uncorrelated) means that the query optimiser is pessimistic about finding a match quickly enough using the index but in fact that's not the case.
EDIT: For the time being I have restructured the query like so:
SELECT
id,
previous_event_id,
due_at,
action_id,
subscription_url
FROM (
SELECT
id,
previous_event_id,
due_at,
action_id,
subscription_url from events
WHERE
started_at is null
AND deleted_at is null
LIMIT 100
FOR update SKIP LOCKED
) events_to_pick_from
WHERE EXISTS (
SELECT 1
FROM events
WHERE
events_to_pick_from.due_at < now()
AND events_to_pick_from.due_at > now() - interval '14 days'
AND events.action_id = events_to_pick_from.action_id
)
LIMIT 1
FOR UPDATE SKIP LOCKED;
https://explain.depesz.com/s/fz2h
But would be grateful for other suggestions
Both queries have the same execution plan.
The difference is that the query with the constants happens to find a row that matches the condition quickly, after reading only 27 rows from the table.
The query using now() does not find a single matching row in the table (actual rows=0), but it has to scan all 7 million rows before it knows for sure.
An index on due_at should improve the performance considerably.

Posgres 10: "spoofing" now() in an immutable function. A safe idea?

My app reports rainfall and streamflow information to whitewater boaters. Postgres is my data store for the gauge readings that come in 15 minute intervals. Over time, these tables get pretty big and the availabity of range partitioning in Postgres 10 inspired me to leave my shared hosting service and build a server from scratch at Linode. My queries on these large tables became way faster after I partitioned the readings into 2 week chunks. Several months down the road, I checked out the query plan and was very surprised to see that using now() in a query caused PG to scan allof the indexes on my partitioned tables. What the heck?!?! Isn't the point of partitiong data is to avoid situations like this?
Here's my set up: my partitioned table
CREATE TABLE public.precip
(
gauge_id smallint,
inches numeric(8, 2),
reading_time timestamp with time zone
) PARTITION BY RANGE (reading_time)
I've created partitions for every two weeks, so I have about 50 partition tables so far. One of my partitions:
CREATE TABLE public.precip_y2017w48 PARTITION OF public.precip
FOR VALUES FROM ('2017-12-03 00:00:00-05') TO ('2017-12-17 00:00:00-05');
Then each partition is indexed on gauge_id and reading_time
I have a lot of queries like
WHERE gauge_id = xxx
AND precip.reading_time > (now() - '01:00:00'::interval)
AND precip.reading_time < now()
As I mentioned postgres scans all of the indexes on reading_time for every 'child' table rather than only querying the child table that has timestamps in the query range. If I enter literal values (e.g., precip.reading_time > '2018-03-01 01:23:00') instead of now(), it only scans the indexes of appropriate child tables(s). I've done some reading and I understand that now() is volatile and that the planner won't know what the value will be when the query executes. I've also read that query planning is expensive so postgres caches plans. I can understand why PG is programmed to do that. However, one counter argument I read was that a re-planned query is probably way less expensive than a query that end up ignoring partitions. I agree - and that's probably the case in my situation.
As a work arounds, I've created this function:
CREATE OR REPLACE FUNCTION public.hours_ago2(i integer)
RETURNS timestamp with time zone
LANGUAGE 'plpgsql'
COST 100
IMMUTABLE
ROWS 0
AS $BODY$
DECLARE X timestamp with time zone;
BEGIN
X:= now() + cast(i || ' hours' as interval);
RETURN X;
END;
$BODY$;
Note the IMMUTABLE statment. Now when issue queries like
select * from stream
where gauge_id = 2142 and reading_time > hours_ago2(-3)
and reading_time < hours_ago2(0)
PG only searches the partition table that stores data for that time frame. This is the goal I was shooting for when I set up the partitions in the first place. Booyah. But is this safe? Will the query planner ever cache the results of hours_ago2(-3) and use it over and over again for hours down the road? It's ok if it's cached for a few minutes. Again, my app reports rain and streamflow information; it doesn't deal with financial transactions or any other 'critical' types of data processing. I've tested simple statements like select hours_ago2(-3) and it returns new values every time. So it seems safe. But is it really?
That is not safe because at planning time you have no idea if the statement will be executed in the same transaction or not.
If you are in a situation where query plans are cached, this will return wrong results. Query plans are cached for named prepared statements and statements in PL/pgSQL functions, so you could end up with an out-of-date value for the duration of the database session.
For example:
CREATE TABLE times(id integer PRIMARY KEY, d timestamptz NOT NULL);
PREPARE x AS SELECT * FROM times WHERE d > hours_ago2(1);
The function is evaluated at planning time, and the result is a constant in the execution plan (for immutable functions that is fine).
EXPLAIN (COSTS off) EXECUTE x;
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on times
Filter: (d > '2018-03-12 14:25:17.380057+01'::timestamp with time zone)
(2 rows)
SELECT pg_sleep(100);
EXPLAIN (COSTS off) EXECUTE x;
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on times
Filter: (d > '2018-03-12 14:25:17.380057+01'::timestamp with time zone)
(2 rows)
The second query definitely does not return the result you want.
I think you should evaluate now() (or better an equivalent function on the client side) first, perform your date arithmetic and supply the result as parameter to the query. Inside of PL/pgSQL functions, use dynamic SQL.
Change the queries to use 'now'::timestamptz instead of now(). Also, interval math on timestamptz is not immutable.
Change your query to something like:
WHERE gauge_id = xxx
AND precip.reading_time > ((('now'::timestamptz AT TIME ZONE 'UTC') - '01:00:00'::interval) AT TIME ZONE 'UTC')
AND precip.reading_time < 'now'::timestamptz

Huge PostgreSQL table - Select, update very slow

I am using PostgreSQL 9.5. I have a table which is almost 20GB's. It has a primary key on the ID column which is an auto-increment column, however I am running my queries on another column which is a timestamp... I am trying to select/update/delete on the basis of a timestamp column but the queries are very slow. For example: A select on this table `where timestamp_column::date (current_date - INTERVAL '10 DAY')::date) is taking more than 15 mins or so..
Can you please help on what kind of Index should I add to this table (if needed) to make it perform faster?
Thanks
You can create an index with your clause expression:
CREATE INDEX ns_event_last_updated_idx ON ns_event (CAST(last_updated AT TIME ZONE 'UTC' AS DATE));
But, keep in mind that you're using timestamp with timezone, cast this type to date can let you get undesirable side effects.
Also, remove all casting in your sql:
select * from ns_event where Last_Updated < (current_date - INTERVAL '25 DAY');