How do I cluster my PRIMARY KEY in postgres - postgresql

I noticed in postgres when we create a table, it seems to automatically creates a btree index on the PRIMARY KEY CONSTRAINT. Looking at the properties of the CONSTRAINT, it would appear it is not clustered. How do I cluster it and should I cluster it?

You have to use the CLUSTER command:
CLUSTER stone.unitloaddetail USING pk10;
Remember that this rewrites the table and blocks it for others during that time.
Also, clustering is not maintained when table data are modified, so you have to schedule regular CLUSTER runs if you want to keep the table clustered.

Addressing the "should you" part, it depends on the likelihood of queries needing to access multiple rows having adjacent values of the clustering key.
For a table with a synthetic primary key, it probably makes more sense to cluster on a foreign key column.
Imagine that you have a table or products. Are you more likely to request multiple products having:
consecutive product_id?
the same location_id?
the same type_id?
the same manufacturer_id?
If it would solve a problem for you to improve the performance of the system for one of these particular cases, then that is the column by which you should consider clustering.
If doing so would not solve a problem, then do not do it.

Related

Different select results when using multimaster via pglogical in PostgreSQL

There are two PostgreSQL 9.6 nodes subscribed to each other via pglogical. If node A inserts a row into the replicated table then node B sees it and vice versa.
However, when I update a row on one node, then subsequent SELECT queries on both nodes will keep returning different results - the current one and some of the previous ones.
Moreover, there are log entries about replication conflicts in the logs of both nodes.
Why does that happen and how do I fix that?
upd: setting pglogical.conflict_resolution to last_update_wins helps. Might consider other options of conflict resolution too
Multi-master replication is difficult.
There are conflicts that are bound to occur unless your application is aware of and specifically tailored to multi-master replication:
Rows inserted on different nodes with the same (automatically generated primary key must conflict.
If you modify the primary key of a row on one node while updating or deleting it on another, the databases will “drift apart”, leading to future conflicts.
You will have to fix your application so that it avoids problems like the above, and you will have to manually find and resolve all the conflicts that occurred so far.
Here is an example of the second case:
-- node one:
UPDATE person
SET id = 1234
WHERE id = 6543;
-- at the same time on node two
DELETE FROM person
WHERE id = 6543;
Both statements will be replicated to the other node, but do nothing there, because both nodes no longer have a person with id 6543 any more. There will be no replication conflict right away, but node one now has a person that node two doesn't have. It is easy to see how this can lead to replication conflicts later (imagine you insert a row on node one that has a foreign key relationship to person 1234).
This is why it is in most cases a good idea to consider an architecture that does not include multi-master replication.

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

(How) Is it possible to convert tables into foreign tables in Postgres?

We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.

Restore PostgreSQL dump with new primary key values

I've got a problem with a PostgreSQL dump / restore. We have a production appliaction running with PostgresSQL 8.4. I need to create some values in the database in the testing environment and then import just this chunk of data into the production environment. The data is generated by the application and I need to use this approach because it needs testing before going into production.
Now that I described the environment, here is my problem:
In the testing database, I leave nothing but the data I need to move to the production database. The data is spread across multiple tables linked with foreign keys with multiple levels (like a tree). I then use pg_dump to export the desired tables into binary format.
When I try to import, the database will correctly import the root table entries with new primary key values, but does not import any of the data from the other tables. I believe that the problem is that foreign keys on child tables no longer recognizes the new primary keys.
Is there a way to achieve such an import which will update all the primary key values of all affected tables in the tree to correct serial (auto increment) values automatically and also update all foreign keys according to these new primary key values?
I have and idea how to do this with assistance of programming language while connected to both databases, but that would be very problematic to achieve for me since I don't have direct access to customers production server.
Thanks in advance!
That one seems to me like a complex migration issue. You can create PL/pgSQL migration scripts with inserts and use returning to get serials and use as foreign keys for other tables up the tree. I do not know the structure of your tree but in some cases reading sequence values in advance into arrays may be required due to complexity or performance reasons.
Other approach can be to examine production sequence values and estimate sequence values that will not be used in the near future. Fabricate test data in the test environment to have serial values that will not collide with production sequence values. Then load that data into the prod database and adjust sequence values of the prod environment so that test sequence values will not be used. It will leave a gap in your ID sequence so you must examine whether anything (like other processes) rely on the sequence values to be continuos.

How to stop clustering a table in PostgreSQL

I've clustered a table using the following:
CLUSTER foos USING idx_foos_on_bar;
Now every time I run CLUSTER it reclusters that table (and all other tables with clustering) appropriately.
Now I want to stop reordering that one table (but still reorder all the others with a single CLUSTER command).
I don't see anything in the documentation about how to uncluster. Is this possible? Or do I have to completely drop and recreate the table?
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
When a table is clustered, PostgreSQL remembers which index it was
clustered by. The form CLUSTER table_name reclusters the table using
the same index as before. You can also use the CLUSTER or SET WITHOUT
CLUSTER forms of ALTER TABLE to set the index to be used for future
cluster operations, or to clear any previous setting.
I think older versions didn't support the set without cluster option.