working some numbers for a new Postgres build and wanted some advice on partitioning/sizing as I have belatedly realised that I'm about to create a 40+ billion row table and keep adding another 1.5 billion rows per year.
I'm a recent immigrant to Postgres from MSSQL and so still trying to work out what is possible/advisable...
This is the current table design:
security_id int NOT NULL, -- 5,000-10,000 securities
ratio_id smallint NOT NULL, -- ~100 ratios
period_id smallint NOT NULL, -- between 1 and 5 periods
rank_id smallint NOT NULL, -- between 1 and 5 different ways to rank
rankvalue smallint NOT NULL CHECK (ratiovalue between 0 and 101),
validrangez tstzrange NOT NULL -- 30 years of dailyish data.
With the date range some records don't change for months, others change daily, and timezone matters which is why I'm using a range. There is a gist constraint to avoid overlaps.
Most of the queries will be looking at a particular date in the validrangez and then joining with other tables for everything at that date.
I am thinking of partitioning by the year of the upper(validrangez).
Question 1. Should I turn the period_id and rank_id fields into columns?
The upside seems to be that this would turn the table from a 40 billion row table into a 3-4 billion row table which seems more manageable as each partition would only be 100-150m rows rather than a billion. Also the ids and the range will be the same and so the indexes should be smaller.
The downside is about 1/3rd of the columns will be NULLS / wouldn't have had rows in the original structure. Also the joins will be less normalised. I'm unlikely to add more periods or ranks, but I can't rule it out.
Question 2. Should I instead try to create multiple tables?
Its a similar question to the above - basically should I make writing queries harder (infrequently) in the interest of being able to do joins faster every day.
Question 3. How much benefit would I get from having rankvalue as a smallint rather than a numeric?
I would prefer to store it as a percentile (between 0 and 1) so that I don't have to keep dividing by 100 when I use it but thought that across 40b records that the memory savings would add up. Given rankvalue is not in any indexes I suspect I have overthought this one...
Question 4. Anything else that I might have missed?
Thanks
May be creating views year wise would help. Plus also check the CURSOR option
Related
PostgreSQL 14 for Windows on a medium sized machine. I'm using the default settings - literally as shipped. New to PostgreSQL from MS SQL Server.
A seemingly simple statement that runs in a minute in MS is taking hours in PostgreSQL - not sure why? I'm busy migrating over, i.e. it is the exact same data on the exact same hardware.
It's an update statement that joins a master table (roughly 1000 records) and fact table (roughly 8 million records). I've masked the tables and exact application here, but the structure is exactly reflective of the real data.
CREATE TABLE public.tmaster(
masterid SERIAL NOT NULL,
masterfield1 character varying,
PRIMARY KEY(masterid)
);
-- I've read that the primary key tag creates an index on that field automatically - correct?
CREATE TABLE public.tfact(
factid SERIAL NOT NULL,
masterid int not null,
fieldtoupdate character varying NULL,
PRIMARY KEY(factid),
CONSTRAINT fk_public_tfact_tmaster
FOREIGN KEY(masterid)
REFERENCES public.tmaster(masterid)
);
CREATE INDEX idx_public_fact_master on public.tfact(masterid);
The idea is to set public.tfact.fieldtoupdate = public.tmaster.masterfield1
I've tried the following ways (all taking over an hour to complete):
update public.tfact b
set fieldtoupdate = c.masterfield1
from public.tmaster c
where c.masterid = b.masterid;
update public.tfact b
set fieldtoupdate = c.masterfield1
from public.tfact bb
join public.tmaster c
on c.masterid = bb.masterid
where bb.factid = b.factid;
with t as (
select b.factid,
c.fieldtoupdate
from public.tfact b
join public.tmaster c
on c.masterid = b.masterid
)
update public.tfact b
set fieldtoupdate = t.fieldtoupdate
from t
where t.factid = b.factid;
What am I missing? This should take no time at all, but yet takes over an hour??
Any help is appreciated...
If the table was tightly packed to start with, there will be no room to use the HOT (Heap-only-tuple) UPDATE short cut. Updating 8 million rows will mean inserting 8 million new rows and doing index maintenance on each one.
If your indexed columns on tfact are not clustered, this can involve large amounts of random IO, and if your RAM is small most of this may be uncached. With slow disk, I can see why this would take a long time, maybe even much longer than an hour.
If you will be doing this on a regular basis, you should change the table's fillfactor to keep it loosely packed.
Note that the default settings are generally suited for a small machine, or at least a machine here running the database is a small one of its tasks. But the only thing likely to effect you here is work_mem, and even that is probably not really a problem for this task.
If you use psql, then the command \d+ tfact would show you what the fillfactor is set to if it is not the default. But note that this only applies to new tuples, not to existing ones. To see the fill on an existing table, you would want to check the freespacemap using the extension pg_freespacemap and see that every block has about half space available.
To see if an index is well clustered, you can check the correlation column of pg_stats on the table for the leading column ("attname") of the index.
I have these two tables :
CREATE TABLE ref_dates(
date_id SERIAL PRIMARY KEY,
month int NOT NULL,
year int NOT NULL,
month_name CHAR(255)
);
CREATE TABLE util_kpi(
kpi_id SERIAL PRIMARY KEY,
kpi_description int NOT NULL,
kpi_value float,
date_id int NOT NULL,
dInsertion timestamp default CURRENT_TIMESTAMP,
CONSTRAINT fk_ref_kpi FOREIGN KEY (date_id) REFERENCES ref_dates(date_id)
);
Usually, the type of request i'd do is :
Selecting kpi_description and kpi_value for a specified month and year:
SELECT kpi_description, kpi_value FROM util_kpi u JOIN ref_dates r ON u.date_id = r.date_id WHERE month=X AND year=XXXX
Selecting kpi_description and kpi_value for a specified kpi_description, month and year:
SELECT kpi_description, kpi_value FROM util_kpi u JOIN ref_dates r ON u.date_id = r.date_id WHERE month=X AND year=XXXX AND kpi_description='XXXXXXXXXXX'
I tought about creating these indexes :
CREATE INDEX idx_ref_date_year_month ON ref_dates(year, month);
CREATE INDEX idx_util_kpi_date ON util_kpi(date_id);
First of all, i want to know if it's a good idea to create these indexes.
Second of all and finally, I was wondering if it's a good idea to add kpi_description to the indexes on util_kpi table.
Can you guys give me your opinion ?
Regards
It's not possible to give exact answer without looking on data.
So it's only possible to give an opinion.
A. ref_dates
This table looks very similar to date dimension in ROLAP-schemas.
So the first what I would do: is change date_id from SERIAL to:
DATE datatype
or even "smart integer": integer datatype but in form YYYYMMDD. E.g. 20210430. It may look strange but it's not uncommon to see such identificators in date dimensions
The main point in using such form is that date_id in fact tables became informative even without joining to date dimension.
B. util_kpi
I suppose that:
ref_dates is a date dimension. So it's ~365 * number of years rows. It could be populated once for 20-30 years for future and it's still will not be really big
util_kpi is fact table. Which must be big like "really big" - millions and more records.
For `util_kpi' I expected id of time dimension but did not found it. So no hourly stats are supposed yet.
I see util_kpi.dInsertion - which I suppose is planned to be used as time dimension. I would think to extract it into time_id where put hours, minutes and seconds (if milliseconds are not needed).
C.Indexing
ref_dates: it does not matters a lot how you index ref_dates because it's a relatively small table. Maybe unique index on date_id with INCLUDE options for all fields would be the best. Don't create individual index for fields with low selectivity like year or month - it will not make much sense but it will not harm a lot too.
util_kpi - you need an index on date_id (as for any foreign keys to other dimension tables that will appear in future).
That's my thoughts that based on what I supposed.
I am new to postgreSQL, I am working on a project where I am requested to move all the partitions older than 6 months to a legacy table so that the query on the table would be faster. I have the partition table with 10 years of data.
Lets assume if myTable is the table with current 6 months data and myTable_legacy is going to have all the data older than 6 months for up-to 10 years. The table is partitioned by monthly range
My questions that I researched online and unable to conclude are
I am currently testing before finalizing the steps, I was using below link as reference for my lab testing, and before performing the actual migration.
How to migrate an existing Postgres Table to partitioned table as transparently as possible?
create table myTable(
forDate date not null,
key2 int not null,
value int not null
) partition by range (forDate);
create table myTable_legacy(
forDate date not null,
key2 int not null,
value int not null
) partition by range (forDate);
1)Daily application query will be only on the current 6 month data. Is it necessary to move data older than 6 months to a new partition to get a better response of query. I researched online but wasn't able to find any solid evidence related to the same.
2)If performance going to be better, How to move older partitions from myTable to myTable_legacy. Based on my research, I can see that we don't have option of exchange partition in PostgreSQL.
Any help or guidance would help me proceed further with the requirement.
When I try to attach partition to mytable_legacy, I am getting error
alter table mytable detach partition mytable_200003;
alter table mytable_legacy attach partition mytable_200003
for values from ('2003-03-01') to ('2003-03-30');
results in:
ERROR: partition constraint is violated by some row
SQL state: 23514
The contents of the partition:
select * from mytable_200003;
"2000-03-02" 1 19
"2000-03-30" 15 8
It's always better to keep the production table light, One of the practices that i do is to use timestamp and write trigger function that will insert row in the other table if timestamp is less than now() (6 months old data).
Quote from the manual
When creating a range partition, the lower bound specified with FROM is an inclusive bound, whereas the upper bound specified with TO is an exclusive bound
(emphasis mine)
So the expression to ('2003-30-03') does not allow March, 30st to be inserted into the partition.
Additionally your data in mytable_200003 is for the year 2000, not for the year 2003 (which you use in your partition definition). To specify the complete march, simply use April, 1st as the upper bound
So you need to change the partition definition to cover March 2000 not March 2003.
alter table mytable_legacy
attach partition mytable_200003
for values from ('2000-03-01') to ('2000-04-01');
^ here ^ here
Online example
I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:
12yo system, similar to banking system, with most queried primary table called transactions.
CREATE TABLE jrn.transactions (
ID BIGSERIAL,
type_id VARCHAR(200),
account_id INT NOT NULL,
date_issued DATE,
date_accounted DATE,
amount NUMERIC,
..
)
In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:
card_payment, cash_withdrawl, cash_in, ...
14 types of transaction are known.
In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):
select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734
select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734
select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100
several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'
In last few month we had unexpected row count growth, now 120M.
We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html
Options:
partition table by type_id into 14 partitions
add column year and partition table by year (or year_month) into 12 (or 144) partitions.
I am now restoring data into out test environment, I am going to test both options.
What do you consider the most appropriate partitioning rule for such situation? Any other options?
Thanks for any feedback / advice etc.
Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.
The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.
Based on your queries, you should have these indexes (apart from the primary key index):
CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);
The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):
CREATE INDEX ON jrn.transactions (account_id, type_id);
What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.
Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.
From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:
transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.
Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:
create table transactions_aggr (
account_id INT NOT NULL,
date_issued DATE,
sumamount NUMERIC,
primary key (account_id, date_issued)
)
which will give you a table of per-day preaggregated values.
To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:
select account_id, date_issued, sum(amount) as sumamount from
(
select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
union all
select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
)
group by account_id, date_issued
Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.
Updating transactions_aggr can be done using multiple approaches:
You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).
You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.
On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program.
And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing
select type_id, count(*) from transactions group by type_id
Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".
Hope this helps a little - and good luck!
I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows:
I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want.
However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination.
The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way?
I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this?
Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
select whatever_columns from allocated limit 1000;
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.