Redshift. drop table under condition - amazon-redshift

I have ETL process that loads data into some redshift table (tbl1_tmp).
Now if data is loaded (count > 0) I want to drop another table (tbl1) and rename tmp1_tmp --> tbl1.
Can I write some SQL code that allows me to drop table in certain condition is met? (count > 0 in my case)
Thanks

Amazon Redshift basically runs SQL. It is not an ETL tool itself.
You would need to write some code that to check the contents of a table and then, if desired, issue a DROP TABLE command. There is no ability to "drop table in certain conditions".

Related

Auto generate script for CREATE TABLE including all indices, constraints, etc (not via SSMS)

I have a data anonymization process that takes a production copy of a database and turns it into an anonymized copy by UPDATE-ing some columns.
Some of the tables contain several million rows so instead of UPDATE-ing the columns, which is very log intensive, I went down the way of
SELECT
Id,
CAST('Redacted' AS NVARCHAR(255)) [ColumnRequiringAnonymization]
INTO MyTable_New
FROM MyTable
EXEC sp_rename MyTable, MyTable_old
EXEC sp_rename MyTable_new, MyTable
DROP TABLE MyTable_old
The problem with this approach is that the "new" table no longer has any of the keys, indices and other dependent objects. I have figured out the keys and indices using SPs to generate the DROP and CREATE scripts. The SPs are based on manually written SQL as can be seen e.g. in this answer.
The next problem is that we have a schemabound view on top of this table, which has indices and a full-text index on its own. The number of SPs to generate scripts is growing and I am sure there will be mistakes.
Is there a way to completely script a table/view by using SQL commands only? ie. just like SSMS does when you click "Script table as - CREATE to" but within a stored procedure?
Right-click on the database, select Tasks; there is Generate Scripts there. Just follow prompts or Google for additional information.

Best practices for performing a table swap in Redshift

We're in the process of running a handful of hourly scripts on our Redshift cluster which build summary tables for data consumers. After assembling a staging table, the script then runs a transaction which deletes the existing table and replaces it with the staging table, as such:
BEGIN;
DROP TABLE IF EXISTS public.data_facts;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
COMMIT;
The problem with this operation is that long-running analysis queries will place an AccessShareLock on public.data_facts, preventing it from being dropped and thrashing our ETL cycle. I'm thinking a better solution would be one which renames the existing table, as such:
ALTER TABLE public.data_facts RENAME TO data_facts_old;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
DROP TABLE public.data_facts_old;
However, this approach presupposes that 1) public.data_facts exists, and 2) public.data_facts_old does not exist.
Do you know if there's a way to conduct this operation safely in SQL, without relying on application logic? (eg. something like ALTER TABLE IF EXISTS).
I haven't tried it but looking at the documentation of CREATE VIEW it seems that this can be done with late-binding views.
The main idea would be a view public.data_facts that users interact with. Behind the scenes, you can load new data and then swap the view to “point” to the new table.
Bootstrap
-- load data into public.data_facts_v0
CREATE VIEW public.data_facts AS
SELECT * from public.data_facts_v0 WITH NO SCHEMA BINDING;
Update
-- load data into public.data_facts_v1
CREATE OR REPLACE VIEW public.data_facts AS
SELECT * from public.data_facts_v1 WITH NO SCHEMA BINDING;
DROP TABLE public.data_facts_v0;
The WITH NO SCHEMA BINDING means the view will be late-binding. “A late-binding view doesn't check the underlying database objects, such as tables and other views, until the view is queried.” This means the update can even introduce a table with renamed columns or a completely new structure.
Notes:
It might be a good idea to wrap the swap operations into a transaction to make sure we don't drop the previous table if the VIEW swap failed.
You can add a new load time timestamp encode runlength default getdate() column to your target table, and make your ETL do this:
INSERT INTO public.data_facts
SELECT * FROM public.data_facts_staging;
DELETE FROM public.data_facts
WHERE load_time<(select max(load_time) from public.data_facts);
DROP TABLE public.data_facts_staging;
note: public.data_facts_staging should have exactly the same structure as public.data_facts except that the last column of public.data_facts is load_time, so that on insert it will be populated with the current timestamp.
The only implication is that it would require extra disk space for a moment between you insert new rows and delete the old rows, and load_time has to be always the last column. Also you have to vaccum table every time you do this.
Another good thing about this is that if your ETL fails and staging table is empty or there is no staging table you won't lose your data. In the pure SQL scenario of swapping tables with DDL you're not protected from dropping the target table when staging table is missing. In the suggested scenario if no new rows are inserted the delete statement deletes nothing (there are no rows less than max load time), so worst case is just having the old version of data.
p.s. there is a command that instead of insert ... select ... just changes the pointer from staging to target table (alter table ... append from ...) but it requires the same type of lock as alter table I guess, so I don't suggest this

Delete rows from a table if table exists in Redshift otherwise ignore deletion

I am using Redshift. I want a query to delete selected rows from a redshift table if the table exists otherwise just ignore the statement.
Redshift's SQL dialect doesn't contain control-of-flow statements like IF.. THEN so you are not going to be able to do this in a single SQL statement.
Your application or process will need to first query the Redshift table metadata to determine if a table exists e.g.
select 1 from pg_tables where schemaname = 'myschema' and tablename = 'myschema';
If data is returned (i.e. the table exists) then the application or process will execute the delete statement, if no data is returned the application or process does nothing. Basically you need to handle the "if this then do this" logic externally to Redshift.
I recommend #Nathan's answer. I would use python/psycopg2 to set up this logic. The first query would check for the table's existence in pg_tables (eg SELECT count(1) FROM pg_tables WHERE tablename='foo'), and store the result in a variable. Then you'd check the results of that variable to decide whether to kick off a second query (your delete).
But, maybe you don't want to do it in Python. You're just all about Redshift (it's pretty sweet). You could just run the DELETE query in Redshift. If the table is not present, the query fails and nothing happens. If the table is, you delete your data. There's no harm in generating an error here.

Clearing records in HBase table

We are creating a Disaster Recovery System for HBase tables. Because of the restrictions we are not able to use the fancy methods to maintain the replica of the table. We are using Export/Import statements to get the data into HDFS and using that to create tables in the DR Servers.
While Importing the data into HBase table, we are using truncate command to clear the table and load the data fresh into the table. But the truncate statement is taking a long time to delete the rows. Is there are any other effective statements to clear the entire table?
(truncate takes 33 min for ~2500000 records)
disable -> drop -> create table again, maybe ? I don't know if drop takes too long.

wrapping postgresql commands in a transaction: truncate vs delete or upsert/merge

I am using the following commands below in postgresql 9.1.3 to move data from a temp staging table to a table being used in a webapp (geoserver) all in the same db. Then dropping the temp table.
TRUNCATE table_foo;
INSERT INTO table_foo
SELECT * FROM table_temp;
DROP TABLE table_temp;
I want to wrap this in a transaction to allow for concurrency. The data-set is small less than 2000 rows and truncating is faster than delete.
What is the best way to run these commands in a transaction?
Is creating a function advisable or writing a UPSERT/MERGE etc in a CTE?
Would it be better to DELETE all rows then bulk INSERT from temp table instead of TRUNCATE?
In postgres which would allow for a roll back TRUNCATE or DELETE?
The temp table is delivered daily via an ETL scripted in arcpy how could I automate the truncate/delete/bulk insert parts within postgres?
I am open to using PL/pgsql, PL/python (or the recommended py for postgres)
Currently I am manually executing the sql commands after the temp staging table is imported into my DB.
Both, truncate and delete can be rolled back (which is clearly documented in the manual).
truncate - due to its nature - has some oddities regarding the visibility.
See the manual for details: http://www.postgresql.org/docs/current/static/sql-truncate.html (the warning at the bottom)
If your application can live with the fact that table_foo is "empty" during that process, truncate is probably better (again see the big red box in the manual for an explanation). If you don't want the application to notice, you need to use delete
To run these statements in a transaction simply put them into one:
begin transaction;
delete from table_foo;
insert into ....
drop table_temp;
commit;
Whether you do that in a function or not is up to you.
truncate/insert will be faster (than delete/insert) as that minimizes the amount of WAL generated.