"ON UPDATE" equivalent for Amazon Redshift - amazon-redshift

I want a create a table that has a column updated_date that is updated to SYSDATE every time any field in that row is updated. How should I do this in Redshift?

You should be creating table definition like below, that will make sure whenever you insert the record, it populates sysdate.
create table test(
id integer not null,
update_at timestamp DEFAULT SYSDATE);
Every time field update?
Remember, Redshift is DW solution, not a simple database, hence updates should be avoided or minimized.
UPDATE= DELETE + INSERT
Ideally instead of updating any record, you should be deleting and inserting it, so takes care of update_at population while updating which is eventually, DELETE+INSERT.
Also, most of use ETLs, you may using stg_sales table for populating you date, then also, above solution works, where you could do something like below.
DELETE from SALES where id in (select Id from stg_sales);
INSERT INTO SALES select id from stg_sales;
Hope this answers your question.

Redshift doesn't support UPSERTs, so you should load your data to a temporary/staging table first and check for IDs in the main tables, which also exist in the staging table (i.e. which need to be updated).
Delete those records, and INSERT the data from the staging table, which will have the new updated_date.
Also, don't forget to run VACUUM on your tables every once in a while, because your use case involves a lot of DELETEs and UPDATEs.
Refer this for additional info.

Related

Storing duplicate data as a column in Postgres?

In some database project, I have a users table which somehow has a computed value avg_service_rating. And there is another table called services with all the services associated to the user and the ratings for that service. Is there a computationally-lite way which I can maintain the avg_service_rating rating without updating it every time an INSERT is done on the services table? Perhaps like a generate column but with a function call instead? Any direct advice or link to resources will be greatly appreciated as well!
CREATE TABLE users (
username VARCHAR PRIMARY KEY,
avg_service_ratings NUMERIC -- is it possible to store some function call for this column?,
...
);
CREATE TABLE service (
username VARCHAR NOT NULL REFERENCE users (username);
service_date DATE NOT NULL,
rating INTEGER,
PRIMARY KEY (username, service_date),
);
If the values should be consistent, a generated column won't fit the bill, since it is only recomputed if the row itself is modified.
I see two solutions:
have a trigger on the services table that updates the users table whenever a rating is added or modified. That slows down data modifications, but not your queries.
Turn users into a view. The original users table would be renamed, and it loses the avg_service_rating column, which is computed on the fly by the view.
To make the illusion perfect, create an INSTEAD OF INSERT OR UPDATE OR DELETE trigger on the view that modifies the underlying table. Then your application does not need to be changed.
With this solution you pay a certain price both on SELECT and on data modifications, but the latter price will be lower, since you don't have to modify two tables (and users might receive fewer modifications than services). An added advantage is that you avoid data duplication.
A generated column would only be useful if the source data is in the same table row.
Otherwise your options are a view (where you could call a function or calculate the value via a subquery), or an AFTER UPDATE OR INSERT trigger on the service table, which updates users.avg_service_ratings. With a trigger, if you get a lot of updates on the service table you'd need to consider possible concurrency issues, but it would mean the figure doesn't need to be calculated every time a row in the users table is accessed.

How to backup whole table into a single field item?

I have few very small tables (a total of ~1000 rows) that I want to backup regularly into the same DB, to a single table. I know it sounds weird but hear me out.
Let's say that the tables I want to backup are named linux_commands, and windows_commands. These two tables have roughly: id (pkey), name, definition, config (jsonb), commands.
I want to back these up everyday into a table called commands_backup and I want this new table to have a date field, a field for windows_commands, and another one for linux_commands, so three columns in total. Each day, a script would run and write current date to date field, and then fetch whole linux_commands table and write it to related field in a single row, then do the same for windows_commands.
How would you setup something like this? Also, what is the best data type for storing whole data set in a single item?
In the target table, windows_commands and linux_commands should be type jsonb.
Then you can use:
INSERT INTO commands_backup VALUES (
current_date,
(SELECT jsonb_agg(to_jsonb(linux_commands)) FROM linux_commands),
(SELECT jsonb_agg(to_jsonb(windows_commands)) FROM windows_commands)
);

Wrong order showing in database data inserted using seeds.exs [duplicate]

Folks, I have the following table:
CREATE TABLE IF NOT EXISTS users(
userid CHAR(100) NOT NULL,
assetid text NOT NULL,
date timestamp NOT NULL,
PRIMARY KEY(userid, assetid)
);
After I run a few insert queries such as :
INSERT INTO users (userid, assetid, date) VALUES ( foo, bar, now() );
I would like to retrieve records in order they were stored in the database. However, i seem to be getting back records not in order.
How should I modify my retrieve statement?
SELECT * FROM users WHERE userid=foo;
I would like the result to be sorted in order things were stored :)
Thanks!
To expand on Mark's correct answer, PostgreSQL doesn't keep a row timestamp, and it doesn't keep rows in any particular order. Unless you define your tables with a column containing the timestamp they were inserted at, there is no way to select them in the order they were inserted.
PostgreSQL will often return them in the order they were inserted in anyway, but that's just because they happen to be in that order on the disk. This will change as you do updates on the table, or deletes then later inserts. Operations like vacuum full also change it. You should never, ever rely on the order without an explicit order by clause.
Also, if you want the insertion timestamp to differ for rows within a transaction you can use clock_timestamp instead of now(). Also, please use the SQL-standard current_timestamp instead of writing now().
Assuming your date column holds different timestamps for each item, by using the ORDER BY clause:
SELECT * FROM users WHERE userid=foo ORDER BY "date";
However, if you inserted a large number of records in a single transaction, the date column value will probably be the same for all of them - if so, there is no way to tell which was inserted first (from the information given).

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!
I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.
I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

Insert data from staging table into multiple, related tables?

I'm working on an application that imports data from Access to SQL Server 2008. Currently, I'm using a stored procedure to import the data individually by record. I can't go with a bulk insert or anything like that because the data is inserted into two related tables...I have a bunch of fields that go into the Account table (first name, last name, etc.) and three fields that will each have a record in an Insurance table, linked back to the Account table by the auto-incrementing AccountID that's selected with SCOPE_IDENTITY in the stored procedure.
Performance isn't very good due to the number of round trips to the database from the application. For this and some other reasons I'm planning to instead use a staging table and import the data from there. Reading up on my options for approaching this, a cursor that executes the same insert stored procedure on the data in the staging table would make sense. However it appears that cursors are evil incarnate and should be avoided.
Is there any way to insert data into one table, retrieve the auto-generated IDs, then insert data for the same records into another table using the corresponding ID, in a set-based operation? Or is a cursor my only option here?
Look at the OUTPUT clause. You should be able to add it to your INSERT statement to do what you want.
BTW, if you need to output columns into the second table that weren't inserted into the first one, then use MERGE instead of INSERT (as suggested in the comment to the original question) as its OUTPUT clause supports referencing other columns from the source table(s). Otherwise, keeping it with an INSERT is more straightforward, and it does give you access to the inserted identity column.
I'm having experiment to worked out in inserting multiple record into related table using databinding. So, try this!
Hopefully this is very helpful. Follow this link How to insert record into related tables. for more information.