Move partition from one table to another table PostgreSQL 10.11 - postgresql

I am new to postgreSQL, I am working on a project where I am requested to move all the partitions older than 6 months to a legacy table so that the query on the table would be faster. I have the partition table with 10 years of data.
Lets assume if myTable is the table with current 6 months data and myTable_legacy is going to have all the data older than 6 months for up-to 10 years. The table is partitioned by monthly range
My questions that I researched online and unable to conclude are
I am currently testing before finalizing the steps, I was using below link as reference for my lab testing, and before performing the actual migration.
How to migrate an existing Postgres Table to partitioned table as transparently as possible?
create table myTable(
forDate date not null,
key2 int not null,
value int not null
) partition by range (forDate);
create table myTable_legacy(
forDate date not null,
key2 int not null,
value int not null
) partition by range (forDate);
1)Daily application query will be only on the current 6 month data. Is it necessary to move data older than 6 months to a new partition to get a better response of query. I researched online but wasn't able to find any solid evidence related to the same.
2)If performance going to be better, How to move older partitions from myTable to myTable_legacy. Based on my research, I can see that we don't have option of exchange partition in PostgreSQL.
Any help or guidance would help me proceed further with the requirement.
When I try to attach partition to mytable_legacy, I am getting error
alter table mytable detach partition mytable_200003;
alter table mytable_legacy attach partition mytable_200003
for values from ('2003-03-01') to ('2003-03-30');
results in:
ERROR: partition constraint is violated by some row
SQL state: 23514
The contents of the partition:
select * from mytable_200003;
"2000-03-02" 1 19
"2000-03-30" 15 8

It's always better to keep the production table light, One of the practices that i do is to use timestamp and write trigger function that will insert row in the other table if timestamp is less than now() (6 months old data).

Quote from the manual
When creating a range partition, the lower bound specified with FROM is an inclusive bound, whereas the upper bound specified with TO is an exclusive bound
(emphasis mine)
So the expression to ('2003-30-03') does not allow March, 30st to be inserted into the partition.
Additionally your data in mytable_200003 is for the year 2000, not for the year 2003 (which you use in your partition definition). To specify the complete march, simply use April, 1st as the upper bound
So you need to change the partition definition to cover March 2000 not March 2003.
alter table mytable_legacy
attach partition mytable_200003
for values from ('2000-03-01') to ('2000-04-01');
^ here ^ here
Online example

Related

how to test if a postgres partition has been populated or not

How can I (quickly) test if a postgres partition has any rows in it?
I have a partitioned postgres table 'TABLE_A', partitioned by date-range. The name of each individual partition indicates the date-range i.e. TABLE_A_20220101 (1st Jan this year) TABLE_A_20220102 (2nd Jan 2022)
The table includes many years of data, so it includes several thousand individual partitions, each partition contains many millions of rows.
Is there a quick way of testing if a partition has any data in it? There are several solutions I've found, but they all involve count(*) and all take ages.
Please note - I'm NOT trying to accurately determine the row-count, just determine if each partition has any rows in it.
You can use an exists condition:
select exists (select * from partition_name limit 1)
That will return true if partition_name contains at least one row

Predict partition number for Postgres hash partitioning

I'm writting an app which uses partitions in Postgres DB. This is will be send to customers and run on their server. This implies that I have to be prepared for many different scenarios.
Lets start with simple table schema:
CREATE TABLE dir (
id SERIAL,
volume_id BIGINT,
path TEXT
);
I want to partition that table by volume_id column.
What I would like to achieve:
limited number of partitions (right now it's 500 but I'm will be tweaking this parameter later)
Do not create all partitions at once - add them only when they are needed
support volume ids up to 100K
[NICE TO HAVE] - been able for human to calculate partition number from volume_id
Solution that I have right now:
partition by LIST
each partition handles volume_id % 500 like this:
CREATE TABLE dir_part_1 PARTITION OF dir FOR VALUES IN (1, 501, 1001, 1501, ..., 9501);
This works great because I can create partition when it's needed, and I know exactly to which partition given volume_id belongs. But I have to manually declare numbers and I cannot support high volume_ids because speed of insert statements decrease drastically (more than 2 times).
It looks like I could try HASH partitioning but my biggest concern is that I have to create all partitions at the very beginning and I would like to be able to create them dynamically when they are needed, because planning time increases significantly up to 5 seconds for 500 partitions. For example I know that I will be adding rows with volume_id=5. How can I tell which partition should I create?
I was able to force Postgres to use dummy hash function by adding hash operator for partitioned table.
CREATE OR REPLACE FUNCTION partition_custom_bigint_hash(value BIGINT, seed BIGINT)
RETURNS BIGINT AS $$
-- this number is UINT64CONST(0x49a0f4dd15e5a8e3) from
-- https://github.com/postgres/postgres/blob/REL_13_STABLE/src/include/common/hashfn.h#L83
SELECT value - 5305509591434766563;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CREATE OPERATOR CLASS partition_custom_bigint_hash_op
FOR TYPE int8
USING hash AS
OPERATOR 1 =,
FUNCTION 2 partition_custom_bigint_hash(BIGINT, BIGINT);
Now you can declare partitioned table like this:
CREATE TABLE some_table (
id SERIAL,
partition_id BIGINT,
value TEXT
) PARTITION BY HASH (partition_id);
CREATE TABLE some_table_part_2 PARTITION OF some_table FOR VALUES WITH (modulus 3, remainder 2);
Now you can safely assume that allow rows with partition_id % 3 = 2 will land in some_table_part_2 partition. So if you are sure what values you will receive in partition_id column you can create only required partitions.
DISCLAIMER 1: Unfortunately this will not work correctly right now (Postgres 13.1) because of bug #16840
DISCLAIMER 2: There is not point of using this technic unless you are planning to create large number of partitions (I would say 50 or more) and prolonged planning time is an issue.

Can I convert from Table to Stream in KSQL?

I am working in the kafka with KSQL. I would like to find out the last row within 5 min in different DEV_NAME(ROWKEY). Therefore, I have created the stream and aggregated table for further joining.
By below KSQL, I have created the table for finding out the last row within 5 min for different DEV_NAME
CREATE TABLE TESTING_TABLE AS
SELECT ROWKEY AS DEV_NAME, max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY ROWKEY;
Then, I would like to join together:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
However, it occured the error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (org.apache.kafka.streams.kstream.TimeWindowedSerializer) is not compatible to the actual key type (key type: org.apache.kafka.connect.data.Struct). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
It should be the WINDOW TUMBLING function changed my ROWKEY style
(e.g. DEV_NAME_11508 -> DEV_NAME_11508 : Window{start=157888092000 end=-}
Therefore, without setting the Serdes, could I convert from the table to stream and set the PARTITION BY DEV_NAME?
As you've identified, the issue is that your table is a windowed table, meaning the key of the table is windowed, and you can not look up into a windowed table with a non-windowed key.
You're table, as it stands, will generate one unique row per-ROWKEY for each 5 minute window. Yet it seems like you don't care about anything but the most recent window. It may be that you don't need the windowing in the table, e.g.
CREATE TABLE TESTING_TABLE AS
SELECT
ROWKEY AS DEV_NAME,
max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM
WHERE ROWTIME > (UNIX_TIMESTAMP() - 300000)
GROUP BY ROWKEY;
Will track the max timestamp per key, ignoring any timestamp that is over 5 minutes old. (Of course, this check is only done at the time the event is received, the row isn't removed after 5 minutes).
Also, this join:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
Almost certainly isn't doing what you think and wouldn't work in the way you want due to race conditions.
It's not clear what you're trying to achieve. Adding more information about your source data and required output may help people to provide you with a solution.

Large table advice for Postgres

working some numbers for a new Postgres build and wanted some advice on partitioning/sizing as I have belatedly realised that I'm about to create a 40+ billion row table and keep adding another 1.5 billion rows per year.
I'm a recent immigrant to Postgres from MSSQL and so still trying to work out what is possible/advisable...
This is the current table design:
security_id int NOT NULL, -- 5,000-10,000 securities
ratio_id smallint NOT NULL, -- ~100 ratios
period_id smallint NOT NULL, -- between 1 and 5 periods
rank_id smallint NOT NULL, -- between 1 and 5 different ways to rank
rankvalue smallint NOT NULL CHECK (ratiovalue between 0 and 101),
validrangez tstzrange NOT NULL -- 30 years of dailyish data.
With the date range some records don't change for months, others change daily, and timezone matters which is why I'm using a range. There is a gist constraint to avoid overlaps.
Most of the queries will be looking at a particular date in the validrangez and then joining with other tables for everything at that date.
I am thinking of partitioning by the year of the upper(validrangez).
Question 1. Should I turn the period_id and rank_id fields into columns?
The upside seems to be that this would turn the table from a 40 billion row table into a 3-4 billion row table which seems more manageable as each partition would only be 100-150m rows rather than a billion. Also the ids and the range will be the same and so the indexes should be smaller.
The downside is about 1/3rd of the columns will be NULLS / wouldn't have had rows in the original structure. Also the joins will be less normalised. I'm unlikely to add more periods or ranks, but I can't rule it out.
Question 2. Should I instead try to create multiple tables?
Its a similar question to the above - basically should I make writing queries harder (infrequently) in the interest of being able to do joins faster every day.
Question 3. How much benefit would I get from having rankvalue as a smallint rather than a numeric?
I would prefer to store it as a percentile (between 0 and 1) so that I don't have to keep dividing by 100 when I use it but thought that across 40b records that the memory savings would add up. Given rankvalue is not in any indexes I suspect I have overthought this one...
Question 4. Anything else that I might have missed?
Thanks
May be creating views year wise would help. Plus also check the CURSOR option

Slow SQL Server 2008 R2 performance?

I'm using SQL Server 2008 R2 on my development machine (not a server box).
I have a table with 12.5 million records. It has 126 columns, half of which are int. Most columns in most rows are NULL. I've also tested with an EAV design which seems 3-4 times faster to return the same records (but that means pivoting data to make it presentable in a table).
I have a website that paginates the data. When the user tries to go to the last page of records (last 25 records), the resulting query is something like this:
select * from (
select
A.Id, part_id as PartObjectId,
Year_formatted 'year', Make_formatted 'Make',
Model_formatted 'Model',
row_number() over ( order by A.id ) as RowNum
FROM vehicles A
) as innerQuery where innerQuery.RowNum between 775176 and 775200
... but this takes nearly 3 minutes to run. That seems excessive? Is there a better way to structure this query? In the browser front-end I'm using jqGrid to display the data. The user can navigate to the next, previous, first, or last page. They can also filter and order data (example: show all records whose Make is "Bugatti").
vehicles.Id is int and is the primary key (clustered ASC). part_id is int, Make and Model are varchar(100) and typically only contain 20 - 30 characters.
Table vehicles is updated ~100 times per day in individual transactions, and 20 - 30 users use the webpage to view, search, and edit/add vehicles 8 hours/day. It gets read from and updated a lot.
Would it be wise to shard the vehicles table into multiple tables only containing say 3 million records each? Would that have much impact on performance?
I see lots of videos and websites talking about people having tables with 100+ million rows that are read from and updated often without issue.
Note that the performance issues I observe are on my own development computer. The database has a dedicated 16GB of RAM. I'm not using SSD or even SCSI for that matter. So I know hardware would help, but 3 minutes to retrieve the last 25 records seems a bit excessive no?
Though I'm running these tests on SQL Server 2008 R2, I could also use 2012 if there is much to be gained from doing so.
Yes there is a better way, even on older releases of MsSQL But it is involved. First, this process should be done in a stored procedure. The stored procedure should take as 2 of it's input parameters, the page requested (#page)and the page size (number of records per page - #pgSiz).
In the stored procedure,
Create a temporary table variable and put into it a sorted list of the integer Primary Keys for all the records, with a rowNumber column that is itself an indexed, integer, Primary Key for the temp table
Declare #PKs table
(rowNo integer primary key Identity not null,
vehicleId integer not null)
Insert #PKS (vehicleId)
Select vehicleId from Vehicles
Order By --[Here put sort criteria as you want pages sorted]
--[Try to only include columns that are in an index]
then, based on which page (and the page size), (#page, #pgSiz) the user requested, the stored proc selects the actual data for that page by joining to this temp table variable:
Select [The data columns you want]
From #PKS p join Vehicles v
on v.VehicleId = p.VehicleId
Where rowNo between #page*#pgSiz+1 and (#page+1)*#pgSiz
order by rowNo -- if you want to sort page of records on server
assuming #page is 0-based. Also, the Stored proc will need some input argument validation to ensure that the #page, #pgSize values are reasonable (do not take the code pas the end of the records.)