Does Postgres offer any way to optimize storage for a single-column table with a primary key on said column?
It seems all of the data is duplicated between the table and the primary key.
Related
I have 2 local docker postgresql-10.7 servers set up. On my hot instance, I have a huge table that I wanted to partition by date (I achieved that). The data from the partitioned table (Let's call it PART_TABLE) is stored on the other server, only PART_TABLE_2019 is stored on HOT instance. And here comes the problem. I don't know how to partition 2 other tables that have foreign keys from PART_TABLE, based on FK. PART_TABLE and TABLE2_PART are both stored on HOT instance.
I was thinking something like this:
create table TABLE2_PART_2019 partition of TABLE2_PART for values in (select uuid from PART_TABLE_2019);
But the query doesn't work and I don't know if this is a good idea (performance wise and logically).
Let me just mention that I can solve this with either function or script etc. but I would like to do this without scripting.
From doc at https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE
"While primary keys are supported on partitioned tables, foreign keys
referencing partitioned tables are not supported. (Foreign key
references from a partitioned table to some other table are
supported.)"
With PostgreSQL v10, you can only define foreign keys on the individual partitions. But you could create foreign keys on each partition.
You could upgrade to PostgreSQL v11 which allows foreign keys to be defined on partitioned tables.
Can you explain what a HOT instance is and why it would makes this difficult?
I have a dataframe with an index that I want to store in a postgresql database. For this I use df.to_sql(table_name,engine,if_exists='replace', index=True,chunksize=10000)
The index column from the pandas dataframe is copied to the database but is not set as primary key.
There are two solutions that require an additional step:
specify a schema df.to_sql(schema=) docs
Set the primary key after the table is ingested. query:
ALTER TABLE table_name ADD PRIMARY KEY (id_column_name)
Is there a way to set the primary key without specifying the schema or altering the table?
After calling to_sql:
import sqlalchemy
engine = create_engine()
engine.execute('ALTER TABLE schema.table ADD PRIMARY KEY (keycolumn);')
Unfortunately, pandas.to_sql doesn't set primary key, it even also destructs the primary key of existing table. One must aware for the primary keys.
I have a postgres table (lets call this table Events) with a composite foreign key to another table (lets call this table Logs). The Events table looks like this:
CREATE TABLE Events (
ColPrimary UUID,
ColA VARCHAR(50),
ColB VARCHAR(50),
ColC VARCHAR(50),
PRIMARY KEY (ColPrimary),
FOREIGN KEY (ColA, ColB, ColC) REFERENCES Logs(ColA, ColB, ColC)
);
In this case, I know that I can efficiently search for Events by the primary key, and join to Logs.
What I am interested in is if this foreign key creates an index on the Events table which can be useful even without joining. For example, would the following query benefit from the FK?
SELECT * FROM Events
WHERE ColA='foo' AND ColB='bar'
Note: I have run the POSTGRES EXPLAIN for a very similar case to this, and see that the query will result in a full table scan. I am not sure if this is because the FK is not helpful for this query, or if my data size is small and a scan is more efficient at my current scale.
PostgreSQL does not automatically create an index on the columns on which a foreign key is defined. If you need such an index, you will have to create it yourself.
It is usually a good idea to have such an index, so that modifications on the parent table that affect the referenced columns are efficient.
I have one table manual_errors_archive. I need to add foreign key to this table keeping reference to values table, values table having 800,000 records and manual_errors_archive table does not have any records.
ALTER TABLE manual_errors_archive
ADD CONSTRAINT fk_manua_reference_value
FOREIGN KEY (value_id)
REFERENCES values;
Postgres version i am using 9.1
But this is running for more than 1 hr after that I canceled the process. Any idea how to optimize this process?
In Amazon's guide, they mention specifying PRIMARY and FOREIGN KEYs for all of your tables, and then designating distribution keys where it makes sense, like on columns that often get used to join tables together. I understand that even with a single table query, the right DISTKEY specification would help in doing GROUP BY, but for JOINing two or more tables, do the DISTKEY columns have to be specified as FOREIGN KEYs as well? Or will Redshift co-locate rows from different tables to the same nodes based on the data-type (and maybe name) of columns used as the DISTKEY?
The reason I'm asking is because I'm not really using dimension tables in my application. I could create them simply to use as a foreign key reference to help with the distribution, but then the dimensions tables would have to be maintained.
Consider the following example where I have two tables that are frequently joined:
CREATE TABLE motorcycles
(
id INT,
hexcolor CHAR(6)
);
CREATE TABLE helmets
(
id INT,
hexcolor CHAR(6)
);
Now suppose in my application, we frequently join the motorcycles table to the helmets table on the hexcolor column. Then it would make sense to use DISTSTYLE KEY and use DISTKEY (hexcolor), right? However, you can't really say that the hexcolor column from the motorcycles table is a foreign key to the helmets table or vice-versa. I could create a dimension table that just had a list of all the possible hexcolor values, and then both the motorcycles and helmets tables could have a foreign key to this dimension table, but it would be a pain to have to maintain this dimension table (Amazon's guide also warns against specifying primary or foreign keys that are not properly maintained, because it will confuse the query planner).
So, with my motorcycles and helmets example, would a foreign key to a dimension table be necessary? Or will Redshift make an assumption that it should distribute the rows for both of these tables the same way based on the fact that the data type of the column used as the distribution key is the same?
As long as the columns have the same data type, you should expect Redshift to distribute the motorcycles and helmets tables in the same fashion.
There is no justification for a foreign-key in your case. The query planner will be able to take advantage of the fact that the tables are distributed by the same key.
But it's always good to read the execution plan and make sure that it says DS_DIST_NONE - which means that no data redistribution was needed.