snowflake data transfer and formatting - snowflake-task

We are in the process of creating a DataWarehouse in Snowflake, basically what our developers have done is by using stitch they have transferred all existing data to a Database in Snowflake and then are updating this further daily.
Now I am in the process of formatting the table structure into something legible for use with PowerBI and 3rd parties, which hasn't been a problem. I have created all the necessary SQL for the tables and or views I wish as well as inserting the existing data this new structure.
However my issue is how do I then, say, update any new data that comes along?
E.g. dev have a table called SN_BARCODE which has all the barcodes for all products in it, the fields in that table are:
APNS (VARCHAR)
, BARCODEID (NUMBER)
, DATECREATED (TIMESTAMP)
, DATEMODIFIED (TIMESTAMP)
, DATEUPDATED (TIMESTAMP)
, ID (NUMBER)
, PRODUCTID (NUMBER)
, VENUEID (NUMBER)
, _SDC_BATCHED_AT(TIMESTAMP)
, _SDC_RECEIVE_AT (TIMESTAMP)
, _SDC_SEQUENCE (TIMESTAMP)
, _SDC_TABLE_VERSION (TIMESTAMP)
to which I do the following
CREATE OR REPLACE Table pc_stitch_db.Dim_BarCodes
(Barcodes_Id int, Barcodes_ProductId, Barcodes_APN, PRIMARY KEY (Barcodes_Id)) AS
SELECT
"PC_STITCH_DB".sn_barcode.barcodeid AS Barcodes_Id,
"PC_STITCH_DB".sn_product.id AS Barcodes_ProductId,
"PC_STITCH_DB".sn_barcode.apns AS Barcodes_APN
FROM "PC_STITCH_DB".sn_Barcode
INNER JOIN "PC_STITCH_DB".sn_product ON (
"PC_STITCH_DB".sn_product.PRODUCTID = "PC_STITCH_DB".sn_barcode.productid
AND "PC_STITCH_DB".sn_product.venueid = "PC_STITCH_DB".sn_barcode.venueid
);
How can I update these daily after stitch has loaded the data from the other source?

You have the dilemma of simplicity vs. efficiency.
You can easily create TASKs that recreate your tables on eg. a daily schedule using statements like the CREATE OR REPLACE TABLE statement you provided above. This means your data is reloaded completely every day.
However, if you have huge and growing tables that approach eventually will cause capacity problems. Then you have to modify your solution to do incremental updates.
There are two main types of incremental updates,
appending data using INSERT
merging data using MERGE (in general more complex and more flexible)

Related

Storing duplicate data as a column in Postgres?

In some database project, I have a users table which somehow has a computed value avg_service_rating. And there is another table called services with all the services associated to the user and the ratings for that service. Is there a computationally-lite way which I can maintain the avg_service_rating rating without updating it every time an INSERT is done on the services table? Perhaps like a generate column but with a function call instead? Any direct advice or link to resources will be greatly appreciated as well!
CREATE TABLE users (
username VARCHAR PRIMARY KEY,
avg_service_ratings NUMERIC -- is it possible to store some function call for this column?,
...
);
CREATE TABLE service (
username VARCHAR NOT NULL REFERENCE users (username);
service_date DATE NOT NULL,
rating INTEGER,
PRIMARY KEY (username, service_date),
);
If the values should be consistent, a generated column won't fit the bill, since it is only recomputed if the row itself is modified.
I see two solutions:
have a trigger on the services table that updates the users table whenever a rating is added or modified. That slows down data modifications, but not your queries.
Turn users into a view. The original users table would be renamed, and it loses the avg_service_rating column, which is computed on the fly by the view.
To make the illusion perfect, create an INSTEAD OF INSERT OR UPDATE OR DELETE trigger on the view that modifies the underlying table. Then your application does not need to be changed.
With this solution you pay a certain price both on SELECT and on data modifications, but the latter price will be lower, since you don't have to modify two tables (and users might receive fewer modifications than services). An added advantage is that you avoid data duplication.
A generated column would only be useful if the source data is in the same table row.
Otherwise your options are a view (where you could call a function or calculate the value via a subquery), or an AFTER UPDATE OR INSERT trigger on the service table, which updates users.avg_service_ratings. With a trigger, if you get a lot of updates on the service table you'd need to consider possible concurrency issues, but it would mean the figure doesn't need to be calculated every time a row in the users table is accessed.

How to backup whole table into a single field item?

I have few very small tables (a total of ~1000 rows) that I want to backup regularly into the same DB, to a single table. I know it sounds weird but hear me out.
Let's say that the tables I want to backup are named linux_commands, and windows_commands. These two tables have roughly: id (pkey), name, definition, config (jsonb), commands.
I want to back these up everyday into a table called commands_backup and I want this new table to have a date field, a field for windows_commands, and another one for linux_commands, so three columns in total. Each day, a script would run and write current date to date field, and then fetch whole linux_commands table and write it to related field in a single row, then do the same for windows_commands.
How would you setup something like this? Also, what is the best data type for storing whole data set in a single item?
In the target table, windows_commands and linux_commands should be type jsonb.
Then you can use:
INSERT INTO commands_backup VALUES (
current_date,
(SELECT jsonb_agg(to_jsonb(linux_commands)) FROM linux_commands),
(SELECT jsonb_agg(to_jsonb(windows_commands)) FROM windows_commands)
);

"ON UPDATE" equivalent for Amazon Redshift

I want a create a table that has a column updated_date that is updated to SYSDATE every time any field in that row is updated. How should I do this in Redshift?
You should be creating table definition like below, that will make sure whenever you insert the record, it populates sysdate.
create table test(
id integer not null,
update_at timestamp DEFAULT SYSDATE);
Every time field update?
Remember, Redshift is DW solution, not a simple database, hence updates should be avoided or minimized.
UPDATE= DELETE + INSERT
Ideally instead of updating any record, you should be deleting and inserting it, so takes care of update_at population while updating which is eventually, DELETE+INSERT.
Also, most of use ETLs, you may using stg_sales table for populating you date, then also, above solution works, where you could do something like below.
DELETE from SALES where id in (select Id from stg_sales);
INSERT INTO SALES select id from stg_sales;
Hope this answers your question.
Redshift doesn't support UPSERTs, so you should load your data to a temporary/staging table first and check for IDs in the main tables, which also exist in the staging table (i.e. which need to be updated).
Delete those records, and INSERT the data from the staging table, which will have the new updated_date.
Also, don't forget to run VACUUM on your tables every once in a while, because your use case involves a lot of DELETEs and UPDATEs.
Refer this for additional info.

Indexing for efficient querying and pagination of financial data in PostgreSQL

I'm working in an API that needs to return a list of financial transactions. These records are held in 6 different tables, but all have 3 common fields:
transaction_id int NOT NULL,
account_id bigint NOT NULL,
created timestamptz NOT NULL
note: might have actually
been a good use of table in inheritance in postgresql but it wasn't done like that.
The business requirement is to return all transactions for a given account_id in 1 list sorted by created in descending order (similar to an online banking page where your last transaction is at the top). Originally, they want to paginate in groups of 50 records, but I've got them to do it on date ranges (believing that I can do it more efficiently in the database than using offset and limits).
My intent is to create an index on each of these tables like this:
CREATE INDEX idx_table_1_account_created ON table_1(account_id, created desc);
ALTER TABLE table_1 CLUSTER ON idx_table_1_account_created;
Then finally to create a view to union all of the records from the 6 tables into one list and then obviously the records from the 6 tables will need to be *resorted" to come up with a unified list (in the correct order). This call will look like:
SELECT * FROM vw_all_transactions
WHERE account_id = 12345678901234
AND created >= '2014-01-01' AND created < '2014-02-01'
ORDER BY created desc;
My question is related to creating the indexing and clustering scheme. Since the records are going to have to be resorted by the view anyway is there any reason to do specify the individual indexes as created desc? And does sorting this way have any penalties when periodicially calling CLUSTER;
I've done some googling and reading but can't really seem to find any information that answers how this clustering is going to work.
Using PostgreSQL 9.2 on Heroku.

How to maintain record history on table with one-to-many relationships?

I have a "services" table for detailing services that we provide. Among the data that needs recording are several small one-to-many relationships (all with a foreign key constraint to the service_id) such as:
service_owners -- user_ids responsible for delivery of service
service_tags -- e.g. IT, Records Management, Finance
customer_categories -- ENUM value
provider_categories -- ENUM value
software_used -- self-explanatory
The problem I have is that I want to keep a history of updates to a service, for which I'm using an update trigger on the table, that performs an insert into a history table matching the original columns. However, if a normalized approach to the above data is used, with separate tables and foreign keys for each one-to-many relationship, any update on these tables will not be recognised in the history of the service.
Does anyone have any suggestions? It seems like I need to store child keys in the service table to maintain the integrity of the service history. Is a delimited text field a valid approach here or, as I am using postgreSQL, perhaps arrays are also a valid option? These feel somewhat dirty though!
Thanks.
If your table is:
create table T (
ix int identity primary key,
val nvarchar(50)
)
And your history table is:
create table THistory (
ix int identity primary key,
val nvarchar(50),
updateType char(1), -- C=Create, U=Update or D=Delete
updateTime datetime,
updateUsername sysname
)
Then you just need to put an update trigger on all tables of interest. You can then find out what the state of any/all of the tables were at any point in history, to determine what the relationships were at that time.
I'd avoid using arrays in any database whenever possible.
I don't like updates for the exact reason you are saying here...you lose information as it's over written. My answer is quite simple...don't update. Not sure if you're at a point where this can be implemented...but if you can I'd recommend using the main table itself to store historical (no need for a second set of history tables).
Add a column to your main header table called 'active'. This can be a character or a bit (0 is off and 1 is on). Then it's a bit of trigger magic...when an update is preformed, you insert a row into the table identical to the record being over-written with a status of '0' (or inactive) and then update the existing row (this process keeps the ID column on the active record the same, the newly inserted record is the inactive one with a new ID).
This way no data is ever lost (admittedly you are storing quite a few rows...) and the history can easily be viewed with a select where active = 0.
The pain here is if you are working on something already implemented...every existing query that hits this table will need to be updated to include a check for the active column. Makes this solution very easy to implement if you are designing a new system, but a pain if it's a long standing application. Unfortunately existing reports will include both off and on records (without throwing an error) until you can modify the where clause