I am trying to redesigning Pg database to gain more performance. Db is for ERP IS and it holds larger amount of date (four years). Every year was in separate database, which was a bad solution (building reports was pain in the a??), so I consolidated all four db's into one... but... some tables are just to large! In order to gain some performance I decided to divide data in tables. I have 2 ways to do this.
First: dividing tables into "arch_table" and "working_table" and using views for reporting.
or
Second: using partitioning (say separate partition for every year).
So, my question is which way is better ? Partitioning or some archiving system ?
PostgreSQL's partitioning is, effectively, a bunch of views that use a check constraint to verify that only correct data is in each partition. A parent table is created and additional partitions are created that inherit from the master:
CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
);
CREATE TABLE measurement_y2006m02 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2006m03 ( ) INHERITS (measurement);
...
CREATE TABLE measurement_y2007m11 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2007m12 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2008m01 ( ) INHERITS (measurement);
Obviously, I've omitted a bit of code, but you can check out the documentation on PostgreSQL table partitioning. The most important part of partitioning is to make sure you build automatic scripts to create new partitions into the future as well as merge old partitions.
Operationally, when PostgreSQL goes to run your query it looks at SELECT * FROM measurement WHERE logdate BETWEEN '2006-02-13' AND '2006-02-22'; the optimizer goes "AH HA! I know what's up here, there's a partition. I'll just look at table measurement_y2006m02 and pull back the appropriate data."
As you age data out of the main partitions, you can either just drop the old tables or else merge them into an archive partition. Much of this work can be automated through scripting - all you really need to do is write the scripts once and test it. A side benefit is that older data tends to not change - many partitions will require no index maintenance or vacuuming.
Keep in mind that partitioning is largely a data management solution and may not provide the performance benefit that you're looking for. Tuning queries, applying indexes, and examining the PostgreSQL configuration (postgresql.conf, storage configuration, and OS configuration) may lead to far bigger performance gains that partitioning your data.
You should use partitioning with any of those ways. It's exactly what you need.
Related
What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.
Since Postgres also supports partitioned tables, what is the use of child table.
Suppose there is a table of users which has a column created_date. We can store data in 2 ways:
We create many child tables of this user table and distribute the data of users on the basis of created_date (say, one table for every date, like user_jan01_21).
We can create a partitioned table with the partitioning key created_date
Then what is the difference between these solution?
Basically, I want to know what problem table inheritance can solve that partitioning cannot.
Another doubt I have: if I follow solution 1, and I query the user table without the ONLY keyword, will it scan all the child tables?
For example:
SELECT * FROM WHERE where created_date = current_date - 10;
If the objective is partitioning, as in your example, then there is no advantage in using table inheritance. Declarative partitioning is far superior in ease of use, performance and available features.
Table inheritance has uses that are unrelated to partitioning. Features that partitioning doesn't offer are:
the child table can have additional columns
a table can inherit from more than one table
With table inheritance, if you select from the parent table, you will also get all results from the child tables, just as if you had used UNION ALL to combine the results.
Using PostgreSQL 12, I'd like to take advantage of partitioning to 1: Aid in query performance, 2: Allow removing historic data more easily to keep mitigate database growth.
Unfortunately, declarative partitioning requires the key to be part of the PKs. A temporal field as primary key doesn't work well for my model -- so I'm exploring using inheritance instead (as per the docs).
My question is whether using this approach will similarly isolate the amount of rows that my SELECT statement will be exposed to if an item in my WHERE statement limits the results to a single child table.
eg.
Books => BooksJan2020, BooksFeb2020, BooksMar2020.
SELECT * FROM Books WHERE created < '01 20 2020' and author LIKE 'John%';
In declarative partitioning, I would expect the 'LIKE' statement to only be exposed to rows within the January table. Can I expect the same with inheritance? When studying how to create inherited tables, I don't see a mechanism that would tell the planner which child table to pull from.
SteveJ
You can do that by creating the appropriate check constraints on the inheritance children and leaving constraint_exclusion at its default value on.
But I want to dissuade you from using anything but declarative partitioning in v12. Partitioning by inheritance hurts. Besides, you cannot get a true primary key on anything that does not contain the partitioning key that way: even though you have a primary key on all partitions, nothing can prevent you from inserting the same key in different partitions.
My advice is to go with a primary key on (id, created). True, that does not guarantee global uniqueness of id, but it goes a long way towards that goal. With values generated from a single sequence, the risk of duplicates is marginal.
The remaining down side of a composite primary key is that you have to include both columns into any table that has a foreign key constraint to the partitioned table, but I'd say that is the price you pay for the advantages of partitioning. Besides, with inheritance partitioning you couldn't have foreign keys pointing to the partitioned table at all.
I know how partitioning in DB2 works but I am unaware about where this partition values exactly get stored. After writing a create partition query, for example:
CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
STARTING '1/1/2006' ENDING '12/31/2006'
EVERY 3 MONTHS
)
after running the above query we know that partitions are created on order for every 3 month but when we run a select query the query engine refers this partitions. I am curious to know where this actually get stored, whether in the same table or DB2 has a different table where partition value for every table get stored.
Thanks,
table partitions in DB2 are stored in tablespaces.
For regular tables (if table partitioning is not used) table data is stored in a single tablespace (not considering LOBs).
For partitioned tables multiple tablespaces can used for its partitions.
This is achieved by the "" clause of the CREATE TABLE statement.
CREATE TABLE parttab
...
in TBSP1, TBSP2, TBSP3
In this example the first partition will be stored in TBSP1, the second in TBSP2, The third in TBSP3, the fourth in TBSP1 and so on.
Table partitions are named in DB2 - by default PART1 ..PARTn - and all these details can be looked up in the system catalog view SYSCAT.DATAPARTITIONS including the specified partition ranges.
See also http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?cp=SSEPGG_10.5.0%2F2-12-8-27&lang=en
The column used as partitioning key can be looked up in syscat.datapartitionexpression.
There is also a long syntax for creating partitioned tables where partition names can be explizitly specified as well as the tablespace where the partitions will get stored.
For applications partitioned tables look like a single normal table.
Partitions can be detached from a partitioned table. In this case a partition is "disconnected" from the partitioned table and converted to a table without moving the data (or vice versa).
best regards
Michael
After a bit of research I finally figure it out and want to share this information with others, I hope it may come useful to others.
How to see this key values ? => For LUW (Linux/Unix/Windows) you can see the keys in the Table Object Editor or the Object Viewer Script tab. For z/OS there is an Object Viewer tab called "Limit Keys". I've opened issue TDB-885 to create an Object Viewer tab for LUW tables.
A simple query to check this values:
SELECT * FROM SYSCAT.DATAPARTITIONS
WHERE TABSCHEMA = ? AND TABNAME = ?
ORDER BY SEQNO
reference: http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?lang=en
DB2 will create separate Physical Locations for each partition. So each partition will have its own Table-space. When you SELECT on this partitioned Table your SQL may directly go to a single partition or it may span across many depending on how your SQL is. Also, this may allow your SQL to run in parallel i.e. many TS can be accessed concurrently to speed up the SELECT.
I am having a problem regarding a Log Searching Speed and Disk Size.
It is extremely big, it has about 220 millions rows and 25 gigabyte disk size and takes several minutes to fetch some selects.
How does it work?
The log is saved in the database using Sql Anywhere, currently version 9 and soon will be migrated to 11 (we tried to 12, but due some driver and some problems, we went back to 11).
The log consists with two tables (name changed to english so the people here are able to understand):
LogTable
Id, DateTime, User, Url, Action and TableName.
Action is what the used did: insert/delete/update
TableName is which table in the database was affected.
LogTableFields
Id, LogTable_Id, FieldName, NewValue, OldValue.
LogTable_Id is foreign key from LogTable.
FieldName is the field of the table from DB.
Important to note that NewValue and OldValue are type of varchar. Because it's recorded every kind of fields from other tables (datetime, int, etc).
Why it was made this way?
Because we must record everything important. The system is made to an Institutional Department of Traffic (i don't know if it's spelled this way in proper english, but now you can an ideia what this is about) and sometimes they demand some kind of random report.
Until now, we have made our report simply doing some SQL select. However it takes several minutes to complete, even if datetime filtered. Isn't and issue to complain when it's not request often.
But they are demanding more and more reports that it is necessary to create a feature in the software with a nice and beauty report. As we never know theirs needs, we must go back to log and unbury the data.
Some information requested are only in the log. (e.g what user gave improperly access of the vehicle to someone)
Some ideas suggested until now:
Idea 1:
I did some researches and I was told to work with NoSql using CouchDB. But the little i read i feel NoSql isn't a solution for my problem. I can't argue why for non experience in it.
Idea 2:
Separate the Log Tables physically from the Database or from the machine.
Idea 3: Create a mirror from every table with a version field to keep history.
I'd like a macro optimization or architecture change if needed.
This seems like a pretty standard audit table. I'm not sure you need to go to a NoSQL solution for this. 220mil rows will be comfortably handled by most RDBMs.
It seems that the biggest problem is the table structure. Generally you flatten the table to improve logging speed and normalize it to improve reporting speed. As you can see these are conflicting.
If you were using something like MS SQL, you could build a single flat table for logging performance, then build a simple Analysis Services cube on top of it.
Another option would be to just optimize for reporting assuming you could maintain sufficient logging throughput. To do that, you may want to create a structure like this:
create table LogTable (
LogTableID int identity(1,1),
TableName varchar(100),
Url varchar(200)
)
create table LogUser (
LogUserID int indentity(1,1),
UserName varchar(100)
)
create table LogField (
LogFieldID int identity(1,1),
FieldName varchar(100),
)
create table LogData (
LogDataID bigint identity(1,1),
LogDate datetime,
LogTableID int references LogTable(LogTableID),
LogFieldID int references LogField(LogFieldID),
LogUserID int references LogUserID(LogUserID),
Action char(1), -- U = update, I = insert, D = delete
OldValue varchar(100),
NewValue varchar(100)
)
This should still be fast enough to log data quickly, but provide enough performance for reporting. Index design is also important, generally done in order of increasing cardinality, so something like LogData(LogTableID, LingFieldID, LogDate). You can also get fancy with partitioning to allow for parallelized queries.
Adding proper indices is going to be the biggest improvement you can make. You don't mention having any indices, so I assume you don't have any. That would make it very slow.
For example, limiting your query to a particular range of DateTime doesn't help at all unless you have an index on DateTime. Without an index, the database still needs to touch nearly all 25GB of data to find the few rows that are in the right time range. But with an index, it could quickly identify the few rows that are in the time range you care about.
In general, you should always ask your database what plan it is using to execute a query that is taking too long. I'm not particularly familiar with Sql Anywhere, but I know it has a Plan Viewer that can do this. You want to identify big sequential scans and put indices on those fields instead.
I doubt you would see a measurable improvement from breaking up the table and using integer foreign keys. To the extent that your queries touch many columns, you'll just end up joining all those tables back together anyway.