Derived variable creation redshfit - amazon-redshift

I have a table of around 3TB on a redshift cluster. As part of some pre-processing step I need to create a few derived variable. The logic for them is very simple, e.g. a variable with difference of two variables etc.
Currently I use a update command to create such variable. The problem with update command is that it bloats the table size and requires a vacuum command to free up space. I am trying to find some way in which I can create such derived variable without the need of vacuum command. I tried creating a different table with the derived variable and joining them on my primary key. But this is equally time consuming as creating the new table and update on it requires almost the same amount of time.
Any other way I can achieve this which is more efficient?

Holding you new data in a separate table and joining it should be relatively quick as long as you use DISTSTYLE KEY on both tables with the same key and you include the DISTKEY in the join between them.

Related

Can pg_dump set a table's sequence while also excluding its data?

I'm running pg_dump -F custom for database backups, with --exclude-table-data for a very large audit table. I'm then exporting that table data in a separate dump file. It isn't referentially integral with the main dump.
As part of my restore strategy, I'd like to be able to restore the main dump, bring my app online and continue using the database immediately, then bring the audit data back in behind it. The trouble is, as soon as new audit data comes in at sequence 1, the import of the audit data fails as soon as it tries to insert over the top of the new data.
Is it possible to include the setting of the sequence in the main dump without including the table data?
I have considered removing the primary key, but there are other tables I'd also like to do this with, and they definitely do need the PK.
I'm using postgresql 13.
Instead of a sequence, which can build with a rownumber use uuids and a timestamp, so you have unique values and the order of insert doesn't matter. Uuids are a bit slower the ints.
Another possibility that you save th last audit Id in another table and set the sequence new like https://www.postgresql.org/docs/9.1/sql-altersequence.html

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

Restore PostgreSQL dump with new primary key values

I've got a problem with a PostgreSQL dump / restore. We have a production appliaction running with PostgresSQL 8.4. I need to create some values in the database in the testing environment and then import just this chunk of data into the production environment. The data is generated by the application and I need to use this approach because it needs testing before going into production.
Now that I described the environment, here is my problem:
In the testing database, I leave nothing but the data I need to move to the production database. The data is spread across multiple tables linked with foreign keys with multiple levels (like a tree). I then use pg_dump to export the desired tables into binary format.
When I try to import, the database will correctly import the root table entries with new primary key values, but does not import any of the data from the other tables. I believe that the problem is that foreign keys on child tables no longer recognizes the new primary keys.
Is there a way to achieve such an import which will update all the primary key values of all affected tables in the tree to correct serial (auto increment) values automatically and also update all foreign keys according to these new primary key values?
I have and idea how to do this with assistance of programming language while connected to both databases, but that would be very problematic to achieve for me since I don't have direct access to customers production server.
Thanks in advance!
That one seems to me like a complex migration issue. You can create PL/pgSQL migration scripts with inserts and use returning to get serials and use as foreign keys for other tables up the tree. I do not know the structure of your tree but in some cases reading sequence values in advance into arrays may be required due to complexity or performance reasons.
Other approach can be to examine production sequence values and estimate sequence values that will not be used in the near future. Fabricate test data in the test environment to have serial values that will not collide with production sequence values. Then load that data into the prod database and adjust sequence values of the prod environment so that test sequence values will not be used. It will leave a gap in your ID sequence so you must examine whether anything (like other processes) rely on the sequence values to be continuos.

How to trigger creation/update of another row of record if one row is created/updated in postgresql

I am receiving a record csv for outside, then when I create or update the entry into the postgresql, I need to create an mirror entry that only have sign differences. This is could be done at program level, I am curious to know would it possible using triggers.
For the examples I can find, they all end with code,
FOR EACH ROW EXECUTE PROCEDURE foo()
And usually deal with checks, add addtional info using NEW.additionalfield, or insert into another table. If I use trigger this way to insert another row in the same table, it seems the trigger will triggered again and the creation become recursive.
Any way to work this out?
When dealing with triggers, the rules of thumb are:
If it changes the current row, based on some business rules or other (e.g. adding extra info or processing calculated fields), it belongs in a BEFORE trigger.
If it has side effects on one or more rows in separate tables, it belongs in an AFTER trigger.
If it runs integrity checks on any table that no other built-in constraints (checks, unique keys, foreign keys, exclude, etc.) can take care of, it belongs in a CONSTRAINT [after] trigger.
If it has side effects on one or more other rows within the same table, you should probably revisit your schema, your code flow, or both.
Regarding that last point, there actually are workarounds in Postgres, such as trying to get a lock or checking xmin vs the transaction's xid, to avoid getting bogged down in recursive scenarios. A recent version additionally introduced pg_trigger_depth(). But I'd still advise against it.
Note that a constraint trigger can be created as deferrable initially deferred. This will delay the constraint trigger until the very end of the transaction, rather than immediately after the statement.
Your question and nickname hint that you're wondering how to automatically balance a set of lines in a double-entry book-keeping application. Assuming so, do NOT create the balancing entry automatically. Instead, begin a transaction, enter each line separately, and have a (for each row, deferrable initially deferred) constraint trigger pick things up from there and reject the entire batch if anything is unbalanced. Proceeding that way will spare you a mountain of headaches when you want to balance more than two or three lines with each other.
Another reading might be that you want to create an audit trail. If so, create additional audit tables and use after triggers to populate them. There are multiple ways to create and manage these audit tables. Look into slowly changing dimensions. (Fwiw, type 6 with a start_end column of type tsrange or tstzrange works well for the audit tables if you're interested in a table's full history including its history of relationships with other audit tables.) Use the "live" tables for your application to keep things fast, and use the audit-tables when you need historical reporting.