Is possible to use STDIN with CREATE FOREIGN TABLE? - postgresql

I was looking for the STDIN option, to use FOREIGN TABLE in similar way as COPY... And discovery a "bug" in the Guide: there are no documentation about options at official sql-create-foreign-table Guide. No link, nothing:
OPTIONS ( option 'value' [, ...] )
Options to be associated with the new foreign table or one of its columns. ...
So, to lack of information transformed this question in two:
It is possible to use STDIN with FOREIGN TABLE?
Where the "OPTIONS" documentation?
edit to add example
CREATE FOREIGN TABLE t1 (
aa text,
bb bigint
) SERVER files OPTIONS (
filename '/tmp/bigBigdata.csv',
format 'csv',
header 'true'
;
Is a classic ugly PostgreSQL limitation on use filesystem, so I need a terminal solution ... Imagine on shell something with pipes, as
psql -c "ALTER FOREIGN TABLE t1 ... STDIN; CREATE TABLE t2 AS SELECT trim(aa) as aa, bb+1 as bb FROM t WHERE bb>999" < /thePath/bigBigdata.csv
Is a kind of "no direct copy, only filtering a stream of data", and creating a final table t2 from this filtered stream.

I think you are confused about foreign tables, I'll try to explain.
The data of a foreign table do not reside in PostgreSQL, but in an external data source (a file, a different database, etc.).
The foreign table is just a way to access these data from PostgreSQL as if they were a PostgreSQL table.
You can COPY to a foreign table FROM STDIN if the foreign data wrapper supports it, but that has nothing to with CREATE FOREIGN TABLE. CREATE FOREIGN TABLE defines how PostgreSQL should locate the external data and what the format of the data is.
There is no documentation of the options in CREATE FOREIGN TABLE because they depend on the foreign data wrapper you are using.
Look at the documentation of the foreign data wrapper.
Your example makes clear that what you need is not a foreign table, but a temporary table into which you can COPY the raw data which you later want to modify. You cannot use file_fdw for data that resides on the client machine.

Related

Postgres - CREATE FOREIGN TABLE based on other table

I'm looking way to simplify CREATE FOREIGN command, so do need to specify all columns and types.
What I'm doing now (using file_fdw server):
CREATE FOREIGN TABLE temp_table_csv
(id integer, name text)
SERVER csv_log_server
OPTIONS ( filename 'path_to_file.csv', format 'csv');
What I would like to do:
CREATE FOREIGN TABLE temp_table_csv
(like example_table)
SERVER csv_log_server
OPTIONS ( filename 'path_to_file.csv', format 'csv');
Using LIKE or similar command so Postgres can read structure out of there
But it says "like is not supported"

Moving a table from a database to another - Only insert missing rows

I have two databases that are alike, one called datastore and the other called datarestore.
datarestore is a copy of datastore which was created from a backup image. The problem is that I accidentally deleted a little too much data from datastore.
Both databases are located on different AWS instances and I typically connect to them using pgAdmin III or Python to create scripts that handle the data.
I want to get the rows that I accidentally deleted from datastore which are in datarestore into datastore. Does anyone have any idea of how this can be achieved. Both databases contain close to 1.000.000.000 rows and are on version 9.6.
I have seen some backup/import/restore options within pgAdmin III, I just don't know how they work and if they support my needs? I also thought about creating a python script, but querying my database has become pretty slow, so this seems not to be an option either.
-----------------------------------------------------
| id (serial - auto incrementing int) | - primary key
| did (varchar) |
| sensorid (int) |
| timestamp (bigint) |
| data (json) |
| db_timestamp (bigint) |
-----------------------------------------------------
If you preserved primary keys between those databases then you could create foreign tables pointing from datarestore to datastore and check what keys are missing (using for example select pk from old_table except select pk from new_table) and fetch those missing rows using the same foreign table you created. This should limit your first check for missing PK to just index only scans (+ network transfer) and then it will be index scan to fetch missing data. If you are missing only small part of it then it shouldn't take long.
If you require more detailed example then I'll update my answer.
EDIT:
Example of foreign table/server usage
Those commands need to be exuecuted on datarestore (or datastore if you choose to push data instead of pulling it).
If you don't have foreign data wrapper "installed" yet:
CREATE EXTENSION postgres_fdw;
This will create virtual server on your datarestore host. It is just some metadata pointing at foreign server:
CREATE SERVER foreign_datastore FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'foreign_hostname', dbname 'foreign_database_name',
port '5432_or_whatever_you_have_on_datastore_host');
This will tell your datarestore host what user should it connect as when using fdw on server foreign_datastore. It will be used only for your_local_role_name logged in on datarestore:
CREATE USER MAPPING FOR your_local_role_name SERVER foreign_datastore
OPTIONS (user 'foreign_username', password 'foreign_password');
You need to create schema on datarestore. It is where new foreign tables will be created.
CREATE SCHEMA schema_where_foreign_tables_will_be_created;
This will log in to remote host and create foreign tables on datarestore, pointing to tables at datastore. ONLY tables will be done this way.
No data will be copied, just structure of tables.
IMPORT FOREIGN SCHEMA foreign_datastore_schema_name_goes_here
FROM SERVER foreign_datastore INTO schema_where_foreign_tables_will_be_created;
This will return list of id that are missing in your datarestore database for this table
SELECT id FROM foreign_datastore_schema_name_goes_here.table_a
EXCEPT
SELECT id FROM datarestore_schema.table_a
You can either store them in temp table (CREATE TABLE table_a_missing_pk AS [query from above here]
Or use them right away:
INSERT INTO datarestore_schema.table_a (id, did, sensorid, timestamp, data, db_timestamp)
SELECT id, did, sensorid, timestamp, data, db_timestamp
FROM foreign_datastore_schema_name_goes_here.table_a
WHERE id = ANY((
SELECT array_agg(id)
FROM (
SELECT id FROM foreign_datastore_schema_name_goes_here.table_a
EXCEPT
SELECT id FROM datarestore_schema.table_a
) sub
)::int[])
From my tests, this should push-down (meaning send to remote host) something like that:
Remote SQL: SELECT id, did, sensorid, timestamp, data, db_timestamp
FROM foreign_datastore_schema_name_goes_here.table_a WHERE ((id = ANY ($1::integer[])))
You can make sure it does by running explain verbose on your full query to see what plan it will execute. You should see Remote SQL in there.
In case it does not work as expected, you can instead create temp table as mentioned earlier and make sure that this temp table is on datastore host.
Alternative approach would be to create foreign server on datastore pointing to datarestore and push data from your old database to new one (you can insert into foreign tables). This way you won't have to worry about list of id not being pushed down to datastore and instead fetching all data and filtering them afterwards (with would be extremely slow).

Is it possible to merge two Postgres databases

We have two copies of a simple application that is based on SQLite. The application has 10 tables with a variety of relations between the tables. We would like to merge the databases to a single Postgres database with the same schema. We can use Talend to facilitate this, however the issue is that there would be duplicate keys (as both the source databases are independent). Is there a systematic method by which we can insert data into Postgres with the original key plus an offset resulting from loading the first database?
Step 1. Restore the first database.
Step 2. Change foreign keys of all tables by adding the option on update cascade.
For example, if the column table_b.a_id refers to the column table_a.id:
alter table table_b
drop constraint table_b_a_id_fkey,
add constraint table_b_a_id_fkey
foreign key (a_id) references table_a(id)
on update cascade;
Step 3. Update primary keys of the tables by adding the desired offset, e.g.:
update table_a
set id = 10000+ id;
Step 4. Restore the second database.
If you have the possibility to edit the script with database schema (or do the transfer manually with your own script), you can merge steps 1 and 2 and edit the script before the restore (adding the option on update cascade for foreign keys in tables declarations).

How to add a sort key to an existing table in AWS Redshift

In AWS Redshift, I want to add a sort key to a table that is already created. Is there any command which can add a column and use it as sort key?
UPDATE:
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
source: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-supports-changing-table-sort-keys-dynamically/
At the moment I think its not possible (hopefully that will change in the future). In the past when I ran into this kind of situation I created a new table and copied the data from the old one into it.
from http://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html:
ADD [ COLUMN ] column_name
Adds a column with the specified name to the table. You can add only one column in each ALTER TABLE statement.
You cannot add a column that is the distribution key (DISTKEY) or a sort key (SORTKEY) of the table.
You cannot use an ALTER TABLE ADD COLUMN command to modify the following table and column attributes:
UNIQUE
PRIMARY KEY
REFERENCES (foreign key)
IDENTITY
The maximum column name length is 127 characters; longer names are truncated to 127 characters. The maximum number of columns you can define in a single table is 1,600.
As Yaniv Kessler mentioned, it's not possible to add or change distkey and sort key after creating a table, and you have to recreate a table and copy all data to the new table.
You can use the following SQL format to recreate a table with a new design.
ALTER TABLE test_table RENAME TO old_test_table;
CREATE TABLE new_test_table([new table columns]);
INSERT INTO new_test_table (SELECT * FROM old_test_table);
ALTER TABLE new_test_table RENAME TO test_table;
DROP TABLE old_test_table;
In my experience, this SQL is used for not only changing distkey and sortkey, but also setting the encoding(compression) type.
To add to Yaniv's answer, the ideal way to do this is probably using the CREATE TABLE AS command. You can specify the distkey and sortkey explicitly. I.e.
CREATE TABLE test_table_with_dist
distkey(field)
sortkey(sortfield)
AS
select * from test_table
Additional examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_CTAS_examples.html
EDIT
I've noticed that this method doesn't preserve encoding. Redshift only automatically encodes during a copy statement. If this is a persistent table you should redefine the table and specify the encoding.
create table test_table_with_dist(
field1 varchar encode row distkey
field2 timestam pencode delta sortkey);
insert into test_table select * from test_table;
You can figure out which encoding to use by running analyze compression test_table;
AWS now allows you to add both sortkeys and distkeys without having to recreate tables:
TO add a sortkey (or alter a sortkey):
ALTER TABLE data.engagements_bot_free_raw
ALTER SORTKEY (id)
To alter a distkey or add a distkey:
ALTER TABLE data.engagements_bot_free_raw
ALTER DISTKEY id
Interestingly, the paranthesis are mandatory on SORTKEY, but not on DISTKEY.
You still cannot inplace change the encoding of a table - that still requires the solutions where you must recreate tables.
I followed this approach for adding the sort columns to my table table_transactons its more or less same approach only less number of commands.
alter table table_transactions rename to table_transactions_backup;
create table table_transactions compound sortkey(key1, key2, key3, key4) as select * from table_transactions_backup;
drop table table_transactions_backup;
Catching this query a bit late.
I find that using 1=1 the best way to create and replicate data into another table in redshift
eg:
CREATE TABLE NEWTABLE AS SELECT * FROM OLDTABLE WHERE 1=1;
then you can drop the OLDTABLE after verifying that the data has been copied
(if you replace 1=1 with 1=2, it copies only the structure - which is good for creating staging tables)
it is now possible to alter a sort kay:
Amazon Redshift now supports changing table sort keys dynamically
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
Customers when creating Redshift tables can optionally specify one or more table columns as sort keys. The sort keys are used to maintain the sort order of the Redshift tables and allows the query engine to achieve high performance by reducing the amount of data to read from disk and to save on storage with better compression. Currently Redshift customers who desire to change the sort keys after the initial table creation will need to re-create the table with new sort key definitions.
With the new ALTER SORT KEY command, users can dynamically change the Redshift table sort keys as needed. Redshift will take care of adjusting data layout behind the scenes and table remains available for users to query. Users can modify sort keys for a given table as many times as needed and they can alter sort keys for multiple tables simultaneously.
For more information ALTER SORT KEY, please refer to the documentation.
documentation
as for the documentation itself:
ALTER DISTKEY column_name or ALTER DISTSTYLE KEY DISTKEY column_name A
clause that changes the column used as the distribution key of a
table. Consider the following:
VACUUM and ALTER DISTKEY cannot run concurrently on the same table.
If VACUUM is already running, then ALTER DISTKEY returns an error.
If ALTER DISTKEY is running, then background vacuum doesn't start on a table.
If ALTER DISTKEY is running, then foreground vacuum returns an error.
You can only run one ALTER DISTKEY command on a table at a time.
The ALTER DISTKEY command is not supported for tables with interleaved sort keys.
When specifying DISTSTYLE KEY, the data is distributed by the values in the DISTKEY column. For more information about DISTSTYLE, see CREATE TABLE.
ALTER [COMPOUND] SORTKEY ( column_name [,...] ) A clause that changes
or adds the sort key used for a table. Consider the following:
You can define a maximum of 400 columns for a sort key per table.
You can only alter a compound sort key. You can't alter an interleaved sort key.
When data is loaded into a table, the data is loaded in the order of the sort key. When you alter the sort key, Amazon Redshift reorders the data. For more information about SORTKEY, see CREATE TABLE.
According to the updated documentation it is now possible to change a sort key type with:
ALTER [COMPOUND] SORTKEY ( column_name [,...] )
For reference (https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html):
"You can alter an interleaved sort key to a compound sort key or no sort key. However, you can't alter a compound sort key to an interleaved sort key."
ALTER TABLE table_name ALTER SORTKEY (sortKey1, sortKey2 ...etc)

Creating a "table of tables" in PostgreSQL or achieving similar functionality?

I'm just getting started with PostgreSQL, and I'm new to database design.
I'm writing software in which I have various plugins that update a database. Each plugin periodically updates its own designated table in the database. So a plugin named 'KeyboardPlugin' will update the 'KeyboardTable', and 'MousePlugin' will update the 'MouseTable'. I'd like for my database to store these 'plugin-table' relationships while enforcing referential integrity. So ideally, I'd like a configuration table with the following columns:
Plugin-Name (type 'text')
Table-Name (type ?)
My software will read from this configuration table to help the plugins determine which table to update. Originally, my idea was to have the second column (Table-Name) be of type 'text'. But then, if someone mistypes the table name, or an existing relationship becomes invalid because of someone deleting a table, we have problems. I'd like for the 'Table-Name' column to act as a reference to another table, while enforcing referential integrity.
What is the best way to do this in PostgreSQL? Feel free to suggest an entirely new way to setup my database, different from what I'm currently exploring. Also, if it helps you answer my question, I'm using the pgAdmin tool to setup my database.
I appreciate your help.
I would go with your original plan to store the name as text. Possibly enhanced by additionally storing the schema name:
addin text
,sch text
,tbl text
Tables have an OID in the system catalog (pg_catalog.pg_class). You can get those with a nifty special cast:
SELECT 'myschema.mytable'::regclass
But the OID can change over a dump / restore. So just store the names as text and verify the table is there by casting it like demonstrated at application time.
Of course, if you use each tables for multiple addins it might pay to make a separate table
CREATE TABLE tbl (
,tbl_id serial PRIMARY KEY
,sch text
,name text
);
and reference it in ...
CREATE TABLE addin (
,addin_id serial PRIMARY KEY
,addin text
,tbl_id integer REFERENCES tbl(tbl_id) ON UPDATE CASCADE ON DELETE CASCADE
);
Or even make it an n:m relationship if addins have multiple tables. But be aware, as #OMG_Ponies commented, that a setup like this will require you to execute a lot of dynamic SQL because you don't know the identifiers beforehand.
I guess all plugins have a set of basic attributes and then each plugin will have a set of plugin-specific attributes. If this is the case you can use a single table together with the hstore datatype (a standard extension that just needs to be installed).
Something like this:
CREATE TABLE plugins
(
plugin_name text not null primary key,
common_int_attribute integer not null,
common_text_attribute text not null,
plugin_atttributes hstore
)
Then you can do something like this:
INSERT INTO plugins
(plugin_name, common_int_attribute, common_text_attribute, hstore)
VALUES
('plugin_1', 42, 'foobar', 'some_key => "the fish", other_key => 24'),
('plugin_2', 100, 'foobar', 'weird_key => 12345, more_info => "10.2.4"');
This creates two plugins named plugin_1 and plugin_2
Plugin_1 has the additional attributes "some_key" and "other_key", while plugin_2 stores the keys "weird_key" and "more_info".
You can index those hstore columns and query them very efficiently.
The following will select all plugins that have a key "weird_key" defined.
SELECT *
FROM plugins
WHERE plugin_attributes ? 'weird_key'
The following statement will select all plugins that have a key some_key with the value the fish:
SELECT *
FROM plugins
WHERE plugin_attributes #> ('some_key => "the fish"')
Much more convenient than using an EAV model in my opinion (and most probably a lot faster as well).
The only drawback is that you lose type-safety with this approach (but usually you'd lose that with the EAV concept as well).
You don't need an application catalog. Just add the application name to the keys of the table. This of course assumes that all the tables have the same structure. If not: use the application name for a table name, or as others have suggested: as a schema name( which also would allow for multiple tables per application).
EDIT:
But the real issue is of course that you should first model your data, and than build the applications to manipulate it. The data should not serve the code; the code should serve the data.