How to write another query in IN function when partitioning - postgresql

I have 2 local docker postgresql-10.7 servers set up. On my hot instance, I have a huge table that I wanted to partition by date (I achieved that). The data from the partitioned table (Let's call it PART_TABLE) is stored on the other server, only PART_TABLE_2019 is stored on HOT instance. And here comes the problem. I don't know how to partition 2 other tables that have foreign keys from PART_TABLE, based on FK. PART_TABLE and TABLE2_PART are both stored on HOT instance.
I was thinking something like this:
create table TABLE2_PART_2019 partition of TABLE2_PART for values in (select uuid from PART_TABLE_2019);
But the query doesn't work and I don't know if this is a good idea (performance wise and logically).
Let me just mention that I can solve this with either function or script etc. but I would like to do this without scripting.

From doc at https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE
"While primary keys are supported on partitioned tables, foreign keys
referencing partitioned tables are not supported. (Foreign key
references from a partitioned table to some other table are
supported.)"

With PostgreSQL v10, you can only define foreign keys on the individual partitions. But you could create foreign keys on each partition.
You could upgrade to PostgreSQL v11 which allows foreign keys to be defined on partitioned tables.
Can you explain what a HOT instance is and why it would makes this difficult?

Related

In PostgreSQL 12, Does creating partitioning via inheritance improve query performance if queries are contained with a child table?

Using PostgreSQL 12, I'd like to take advantage of partitioning to 1: Aid in query performance, 2: Allow removing historic data more easily to keep mitigate database growth.
Unfortunately, declarative partitioning requires the key to be part of the PKs. A temporal field as primary key doesn't work well for my model -- so I'm exploring using inheritance instead (as per the docs).
My question is whether using this approach will similarly isolate the amount of rows that my SELECT statement will be exposed to if an item in my WHERE statement limits the results to a single child table.
eg.
Books => BooksJan2020, BooksFeb2020, BooksMar2020.
SELECT * FROM Books WHERE created < '01 20 2020' and author LIKE 'John%';
In declarative partitioning, I would expect the 'LIKE' statement to only be exposed to rows within the January table. Can I expect the same with inheritance? When studying how to create inherited tables, I don't see a mechanism that would tell the planner which child table to pull from.
SteveJ
You can do that by creating the appropriate check constraints on the inheritance children and leaving constraint_exclusion at its default value on.
But I want to dissuade you from using anything but declarative partitioning in v12. Partitioning by inheritance hurts. Besides, you cannot get a true primary key on anything that does not contain the partitioning key that way: even though you have a primary key on all partitions, nothing can prevent you from inserting the same key in different partitions.
My advice is to go with a primary key on (id, created). True, that does not guarantee global uniqueness of id, but it goes a long way towards that goal. With values generated from a single sequence, the risk of duplicates is marginal.
The remaining down side of a composite primary key is that you have to include both columns into any table that has a foreign key constraint to the partitioned table, but I'd say that is the price you pay for the advantages of partitioning. Besides, with inheritance partitioning you couldn't have foreign keys pointing to the partitioned table at all.

Is it possible to merge two Postgres databases

We have two copies of a simple application that is based on SQLite. The application has 10 tables with a variety of relations between the tables. We would like to merge the databases to a single Postgres database with the same schema. We can use Talend to facilitate this, however the issue is that there would be duplicate keys (as both the source databases are independent). Is there a systematic method by which we can insert data into Postgres with the original key plus an offset resulting from loading the first database?
Step 1. Restore the first database.
Step 2. Change foreign keys of all tables by adding the option on update cascade.
For example, if the column table_b.a_id refers to the column table_a.id:
alter table table_b
drop constraint table_b_a_id_fkey,
add constraint table_b_a_id_fkey
foreign key (a_id) references table_a(id)
on update cascade;
Step 3. Update primary keys of the tables by adding the desired offset, e.g.:
update table_a
set id = 10000+ id;
Step 4. Restore the second database.
If you have the possibility to edit the script with database schema (or do the transfer manually with your own script), you can merge steps 1 and 2 and edit the script before the restore (adding the option on update cascade for foreign keys in tables declarations).

Redshift: Is using a foreign key necessary to take advantage of distribution keys?

In Amazon's guide, they mention specifying PRIMARY and FOREIGN KEYs for all of your tables, and then designating distribution keys where it makes sense, like on columns that often get used to join tables together. I understand that even with a single table query, the right DISTKEY specification would help in doing GROUP BY, but for JOINing two or more tables, do the DISTKEY columns have to be specified as FOREIGN KEYs as well? Or will Redshift co-locate rows from different tables to the same nodes based on the data-type (and maybe name) of columns used as the DISTKEY?
The reason I'm asking is because I'm not really using dimension tables in my application. I could create them simply to use as a foreign key reference to help with the distribution, but then the dimensions tables would have to be maintained.
Consider the following example where I have two tables that are frequently joined:
CREATE TABLE motorcycles
(
id INT,
hexcolor CHAR(6)
);
CREATE TABLE helmets
(
id INT,
hexcolor CHAR(6)
);
Now suppose in my application, we frequently join the motorcycles table to the helmets table on the hexcolor column. Then it would make sense to use DISTSTYLE KEY and use DISTKEY (hexcolor), right? However, you can't really say that the hexcolor column from the motorcycles table is a foreign key to the helmets table or vice-versa. I could create a dimension table that just had a list of all the possible hexcolor values, and then both the motorcycles and helmets tables could have a foreign key to this dimension table, but it would be a pain to have to maintain this dimension table (Amazon's guide also warns against specifying primary or foreign keys that are not properly maintained, because it will confuse the query planner).
So, with my motorcycles and helmets example, would a foreign key to a dimension table be necessary? Or will Redshift make an assumption that it should distribute the rows for both of these tables the same way based on the fact that the data type of the column used as the distribution key is the same?
As long as the columns have the same data type, you should expect Redshift to distribute the motorcycles and helmets tables in the same fashion.
There is no justification for a foreign-key in your case. The query planner will be able to take advantage of the fact that the tables are distributed by the same key.
But it's always good to read the execution plan and make sure that it says DS_DIST_NONE - which means that no data redistribution was needed.

Redshift - extracting constraints

How to get exported keys (database metadata).Even though redshift does not support foreign keys and primary keys I am able to see them in system tables.
The problem here is in the system table the multiple columns of a foreign key exist as an array in one column(though redshift doesn't support arrays). Is it possible to extract them in one query.
Use table_constraints table:
SELECT * FROM information_schema.table_constraints;

cassandra 2.0.9: best practices for write-heavy columns

I am a little confused by clustering in Cassandra. I have an application that is very write-heavy and update-heavy. With a traditional relational database, I'd partition data into two tables: one table for data that changes infrequently; and one table (with shorter rows) for the columns that change frequently:
For example:
create table user_def ( id int primary key, email list< varchar > ); # stable
create table user_var ( id int primary key, state int ); # changes all the time
But Cassandra seems to be optimized for accessing sparsely-populated columns, so I'm not sure there is any advantage in mimicking this approach for Cassandra schemas.
With Cassandra, is there any advantage in separating frequently-updated columns to a separate table/column-family (away from infrequently-updated columns) or should I combine all the columns together into one table/column-family? Do circumstances change if I have a compound primary key and clustering comes into play?
Cassandra treats primary keys like this:
The first key in the primary key (which can be a composite) is used to partition your data. This defines which node(s) your data is saved in (and replicated to). Other fields in the primary key is then used to sort entries within a partition. The whole partition is always going to be in one node (and replica nodes) in its entirety. Moreover, each entry within a node is sorted by the "other" fields in the primary key. [The first element of the primary key is called the partition key, while the other fields in the primary key are called clustering keys.]
Based on that, I'd say you might as well simply have a table with id, state and email. It looks like you're using skinny rows, and I don't think you'd gain anything (if any) of creating the separate tables.
I had approved ashic's answer until I came upon this:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
which states (for delete-heavy access):
...consider partitioning data with heavy churn rate into separate rows and deleting the entire rows when you no longer need them. Alternatively, partition it into separate tables and truncate them when they aren’t needed anymore...
This falls under the 'queue' anti-pattern for the product.