How to Truncate a postgreSQL table with conditions - postgresql

I'm trying to truncate a PostgreSQL Table with some conditions.
Truncate all the data in the table and just let the data of the last 6 months
For that i have written this Query
select distinct datecalcul
from Table
where datecalcul > now() - INTERVAL '6 months'
order by datecalcul asc
How could I add the truncate clause?

TRUNCATE does not support a WHERE condition. You will have to use a DELETE statement.
delete from the_table
where ...
If you want to get rid of old ("expired") rows efficiently based on a timestamp, you can think about partitioning. Then you can just drop the old partitions.

Related

Can redshift stored procs be used to make a date range UNION ALL query

Since redshift does not natively support date partitioning, other than in redshift spectrum, all our tables are date partitioned
my_table_name_YYYY_MM_DD
So every time we do queries it's usually looks like this
select columns, i, want from
(select * from tbl1_date UNION ALL
select * from tbl2_date UNION ALL
select * from tbl3_date UNION ALL
select * from tbl4_date);
Where there's one UNION ALL per day.
Can stored procedures generate a date rangeso our business analysts stop losing their hair when I send them a python or bash script to generate the date range?
Yes, you could create a stored procedure that generates dynamic SQL using only the needed tables. See my answer here for a template to start from: Issue with passing column name as a parameter to "PREPARE" in Redshift
However, you should be aware that Redshift is able to achieve most of what you want automatically using a "Time Series Table" view. This documented here:
Using Time Series Tables
Use Time-Series Tables
You define a view that is composed of a UNION ALL over a sequence of identical tables with a sort key defined on a commonly filtered date or timestamp column. When you query that view Redshift is able to eliminate the scans on any UNION'ed tables that would not contain relevant data.
For example:
CREATE OR REPLACE VIEW store_sales_vw
AS SELECT * FROM store_sales_1998
UNION ALL SELECT * FROM store_sales_1999
UNION ALL SELECT * FROM store_sales_2001
UNION ALL SELECT * FROM store_sales_2002
UNION ALL SELECT * FROM store_sales_2003
;
SELECT cd.cd_education_status
,COUNT(*) sales_count
,AVG(ss_quantity) avg_quantity
FROM store_sales_vw vw
JOIN customer_demographics cd
ON vw.ss_cdemo_sk = cd.cd_demo_sk
WHERE ss_sold_ts BETWEEN '1999-09-01' AND '2000-08-31'
GROUP BY cd.cd_education_status
In this example Redshift will only use the store_sales_1999 and store_sales_2000 tables, skipping the other tables in the view. Note that the table skipping is not based the name of the table. Redshift knows the MIN and MAX values of the sort key timestamp in each table.
If you purse this approach please be sure to keep the total size of the UNION fairly low. I recommend (at most) daily tables for the last week [7], weekly tables for the last month [5], quarterly tables for the last year [4], and then yearly tables for older data.
You can use ALTER TABLE … APPEND to merge the daily tables in weekly tables and so on.

Delete from a table on basis of indexed columns is taking for ever

We have a table having three indexed columns say
column1 of type bigint
column2 of type timestamp without time zone
column3 of type timestamp without time zone
The table is having more than 12 crores of records and we are trying to delete all the records which are greater than current date - 45 days using below query
delete from tableA
where column2 <= '2019-04-15 00:00:00.00'
OR column3 <= '2019-04-15 00:00:00.00';
This is executing for ever and never completes.
Is there any way we can improve the performance of this query.
Drop indexes, delete data and recreate indexes. But this is not working as I am not able to delete data even after dropping the indexes.
delete
from tableA
where column2 <= '2019-04-15 00:00:00.00'
OR column3 <= '2019-04-15 00:00:00.00'
I do not want to change the query but want the Postgres configured through some property so that it is able to delete the records
See also for a good discussion of the issue Best way to delete millions of rows by ID
12 crores == 120 million rows?
Deleting from a large indexed table is slow because the index is rebuilt many times during the process. If you can select the rows you want to keep and use them to create a new table, then drop the old one, the process is much faster. If you do this regularly, use table partitioning and disconnect a partition when required, this can then be dropped.
1) Check the logs, you are probably suffering from deadlocks.
2) Try creating a new table selecting the data you need, then drop and rename. Use all the columns in your index in the query. DROP TABLE is much faster than DELETE .. FROM
CREATE TABLE new_table AS (
SELECT * FROM old_table WHERE
column1 >= 1 AND column2 >= current_date - 45 AND column3 >= current_date - 45);
DROP TABLE old_table;
ALTER TABLE new_table RENAME TO old_table;
CREATE INDEX ...
3) Create a new table using partitions based on date, with a table for say 15, 30 or 45 days (if you regularly remove data that is 45 days old). See https://www.postgresql.org/docs/10/ddl-partitioning.html for details.

Huge PostgreSQL table - Select, update very slow

I am using PostgreSQL 9.5. I have a table which is almost 20GB's. It has a primary key on the ID column which is an auto-increment column, however I am running my queries on another column which is a timestamp... I am trying to select/update/delete on the basis of a timestamp column but the queries are very slow. For example: A select on this table `where timestamp_column::date (current_date - INTERVAL '10 DAY')::date) is taking more than 15 mins or so..
Can you please help on what kind of Index should I add to this table (if needed) to make it perform faster?
Thanks
You can create an index with your clause expression:
CREATE INDEX ns_event_last_updated_idx ON ns_event (CAST(last_updated AT TIME ZONE 'UTC' AS DATE));
But, keep in mind that you're using timestamp with timezone, cast this type to date can let you get undesirable side effects.
Also, remove all casting in your sql:
select * from ns_event where Last_Updated < (current_date - INTERVAL '25 DAY');

PostgreSQL: Deleting records to keep the one with the latest timestamps

Let's say I have a table with a column of timestamps and a column of IDs (numeric). For each ID, I'm trying to delete all the rows except the one with the latest timestamp.
Here is the code I have so far:
DELETE FROM table_name t1
WHERE EXISTS (SELECT * FROM table_name t2
WHERE t2."ID" = t1."ID"
AND t2."LOCAL_DATETIME_DTE" > t1."LOCAL_DATETIME_DTE")
This code seems to work, but my question is: why is it a > sign and not a < sign in the timestamp comparison? Is this not selecting for deletion all the rows with a later timestamp than another row? I thought this code would keep only the rows with the earliest timestamps for each ID.
You're using the EXISTS operator to delete records for which a record can be found with a larger, thus >, timestamp. For the newest, there isn't a record with a higher timestamp, so the WHERE clause doesn't resolve to true and therefore the record is kept.
You can use "record" pseudo-type to match tuples:
DELETE FROM table_name
WHERE (ID,LOCAL_DATETIME_DTE) not in
(SELECT ID,max(LOCAL_DATETIME_DTE) FROM table_name group by id);

busy table performance optimization

I have a postgresql table storing data from a table-like form.
id SERIAL,
item_id INTEGER ,
date BIGINT,
column_id INTEGER,
row_id INTEGER,
value TEXT,
some_flags INTEGER,
The issue is we have 5000+ entries per day and the information needs to be kept for years.
So I end up with a huge table witch is busy for the top 1000-5000 rows,
with lots of SELECT, UPDATE, DELETE queries but the old content is rarely used (only in statistics) and is almost never changed.
The question is how can I boost the performance for the daily work (top 5000 entries from 50 millions).
There are simple indexes on almost all columns .. but nothing fancy.
Splitting the table is not possible for now, I`m looking more for Index optimisation .
The advices in the comments from dezso and Jack are good. If you want the simplest then this is how you implement the partial index:
create table t ("date" bigint, archive boolean default false);
insert into t ("date")
select generate_series(
extract(epoch from current_timestamp - interval '5 year')::bigint,
extract(epoch from current_timestamp)::bigint,
5)
;
create index the_date_partial_index on t ("date")
where not archive
;
To avoid having to change all queries adding the index condition rename the table:
alter table t rename to t_table;
And create a view with the old name including the index condition :
create view t as
select *
from t_table
where not archive
;
explain
select *
from t
;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Index Scan using the_date_partial_index on t_table (cost=0.00..385514.41 rows=86559 width=9)
Then each day you archive older rows:
update t_table
set archive = true
where
"date" < extract(epoch from current_timestamp - interval '1 week')
and
not archive
;
The not archive condiditon is to avoid updating millions of already archived rows.