Delete in Postgres citus columnar storage alternatives - postgresql

I am planning to use citus to store system logs for upto n no of days after which they should be deleted.Citus columnar store looked like the perfect database for this until I read
this where its mentioned no deletes can be performed on columnar.
So my question is there an alternate way of achieving delete in the columnar store?

You can temporarily switch table access method to row mode to delete or update the table. Then after the operation you can switch back to columnar access method. An example usage is shown below:
-- create table and fill with generated data until 20 days before
CREATE TABLE logs (
id int not null,
log_date timestamp
);
-- set access method columnar
SELECT alter_table_set_access_method('logs', 'columnar');
-- fill the table with generated data which goes until 20 days before
INSERT INTO logs select i, now() - interval '1 hour' * i from generate_series(1,480) i;
-- now you want to drop last 10 days data, you can switch to row access method temporarily to execute delete or updates
SELECT alter_table_set_access_method('logs', 'heap');
DELETE FROM logs WHERE log_date < (now() - interval '10 days');
-- switch back to columnar access method
SELECT alter_table_set_access_method('logs', 'columnar');
A better alternative for log archiving: We are creating a whole copy of the source table to have a table with new access method. The bigger the table, the more resources will be consumed. A better option is that if you can divide your log table into partitions of days or months, you will only need to change access method for single partition. Note that you should set access method for each partition separately. Columnar currently do not support to set access method of partitioned table directly.
Learn more:
Citus docs
Columnar demo
Archiving logs with columnar

Related

PostgreSQL query - show which rows are locked

I would like to query data from a table, and if a row is locked, show it as a different color. Is this possible using postgresql's locking for update?
e.g.
select
*,
(select from pg_x -- link row somehow )
from table
thank you
There is no good way to do that. The row locks are stored in the row (system column xmax), but this attribute serves other purposes too, and the flags that determine if it is indeed a row lock or perhaps a rolled back update are not exposed via SQL.
There are only unpleasant alternatives:
Use the pageinspect contrib module to examine those flags. That would be a second scan of the table, and such a query doesn't respect MVCC visibility.
Run a second query:
SELECT * FROM atable
FOR UPDATE SKIP LOCKED;
That would lock all rows in the table and be very bad for concurrency.
Besides, that information would be pretty useless for the user. In a well-written application, row locks are only held for split seconds, so the information would be outdated by the time it reaches the user.

Postgres import data optimization

I have a postgres question :
I have a database with a table which contains weeks data (history : 3 years). An application displays the data. I change the configuration by default in postgres to improve the import of data.
Everyday the table is refresh only with the current week, so the script deletes the current week and import the current week in the table.
It takes 40 minutes but I think I could improve this.
If I truncate the entire table and import all the data, it tooks 3 hours (7 gigas).
Is there a better way than the delete/insert ?
I could create another table with only the data of the current week and use in the application "union"
select * from tb_data union all select * from tb_data_week
I think it will be faster because a truncate/insert on the data week table will be faster than the delete/insert in the big table.
But it will be maybe slowly with the union all in the application
Thanks a lot
You could partition the table by week.
To import the weekly data, insert into a new empty table. Once the import is finished, drop the old week partition and attach the new partition using alter table base_table attach partition ...
The manual has an example for this process: Partition Maintenance

Unable to optimise Redshift query

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?
If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

Add datetime constraint to a PostgreSQL multi-column partial index

I've got a PostgreSQL table called queries_query, which has many columns.
Two of these columns, created and user_sid, are frequently used together in SQL queries by my application to determine how many queries a given user has done over the past 30 days. It is very, very rare that I query these stats for any time older than the most recent 30 days.
Here is my question:
I've currently created my multi-column index on these two columns by running:
CREATE INDEX CONCURRENTLY some_index_name ON queries_query (user_sid, created)
But I'd like to further restrict the index to only care about those queries in which the created date is within the past 30 days. I've tried doing the following:
CREATE INDEX CONCURRENTLY some_index_name ON queries_query (user_sid, created)
WHERE created >= NOW() - '30 days'::INTERVAL`
But this throws an exception stating that my function must be immutable.
I'd love to get this working so that I can optimize my index, and cut back on the resources Postgres needs to do these repeated queries.
You get an exception using now() because the function is not IMMUTABLE (obviously) and, quoting the manual:
All functions and operators used in an index definition must be "immutable" ...
I see two ways to utilize a (much more efficient) partial index:
1. Partial index with condition using constant date:
CREATE INDEX queries_recent_idx ON queries_query (user_sid, created)
WHERE created > '2013-01-07 00:00'::timestamp;
Assuming created is actually defined as timestamp. It wouldn't work to provide a timestamp constant for a timestamptz column (timestamp with time zone). The cast from timestamp to timestamptz (or vice versa) depends on the current time zone setting and is not immutable. Use a constant of matching data type. Understand the basics of timestamps with / without time zone:
Ignoring time zones altogether in Rails and PostgreSQL
Drop and recreate that index at hours with low traffic, maybe with a cron job on a daily or weekly basis (or whatever is good enough for you). Creating an index is pretty fast, especially a partial index that is comparatively small. This solution also doesn't need to add anything to the table.
Assuming no concurrent access to the table, automatic index recreation could be done with a function like this:
CREATE OR REPLACE FUNCTION f_index_recreate()
RETURNS void
LANGUAGE plpgsql AS
$func$
BEGIN
DROP INDEX IF EXISTS queries_recent_idx;
EXECUTE format('
CREATE INDEX queries_recent_idx
ON queries_query (user_sid, created)
WHERE created > %L::timestamp'
, LOCALTIMESTAMP - interval '30 days'); -- timestamp constant
-- , now() - interval '30 days'); -- alternative for timestamptz
END
$func$;
Call:
SELECT f_index_recreate();
now() (like you had) is the equivalent of CURRENT_TIMESTAMP and returns timestamptz. Cast to timestamp with now()::timestamp or use LOCALTIMESTAMP instead.
Select today's (since midnight) timestamps only
db<>fiddle here
Old sqlfiddle
If you have to deal with concurrent access to the table, use DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY. But you can't wrap these commands into a function because, per documentation:
... a regular CREATE INDEX command can be performed within a
transaction block, but CREATE INDEX CONCURRENTLY cannot.
So, with two separate transactions:
CREATE INDEX CONCURRENTLY queries_recent_idx2 ON queries_query (user_sid, created)
WHERE created > '2013-01-07 00:00'::timestamp; -- your new condition
Then:
DROP INDEX CONCURRENTLY IF EXISTS queries_recent_idx;
Optionally, rename to old name:
ALTER INDEX queries_recent_idx2 RENAME TO queries_recent_idx;
2. Partial index with condition on "archived" tag
Add an archived tag to your table:
ALTER queries_query ADD COLUMN archived boolean NOT NULL DEFAULT FALSE;
UPDATE the column at intervals of your choosing to "retire" older rows and create an index like:
CREATE INDEX some_index_name ON queries_query (user_sid, created)
WHERE NOT archived;
Add a matching condition to your queries (even if it seems redundant) to allow it to use the index. Check with EXPLAIN ANALYZE whether the query planner catches on - it should be able to use the index for queries on an newer date. But it won't understand more complex conditions not matching exactly.
You don't have to drop and recreate the index, but the UPDATE on the table may be more expensive than index recreation and the table gets slightly bigger.
I would go with the first option (index recreation). In fact, I am using this solution in several databases. The second incurs more costly updates.
Both solutions retain their usefulness over time, performance slowly deteriorates as more outdated rows are included in the index.