Redshift copy command append, replace or upsert? - amazon-redshift

Suppose I run the Redshift COPY command for a table where existing data. Then does the command:
Appends the data to the existing table?
Wipes clean existing data and add the new data?
Upserts the data. i.e., UPDATE if data with the same primary key is present in table or INSERT otherwise

The COPY command always appends data to a table.

In order to effectively upsert in Redshift using "copy" command, you need first to load your data (from your copy) to a staging table then run some sql on redshift to process this data.
AWS have documented an approach here https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

Related

how to dump data into a temporary table(without actually creating the temporary table) from an external table in Hive Script during run time

In SQL stored procedures, we have an option of creating a temporary table "#temp" whose structure is as that of another table that it is referring to. Here we don't explicitly create and mention the structure of "#temp" table.
Do we have similar option is HQL Hive script to create a temp table during run time without actually creating the table structure. Thus I can dump data to temp table and use it. Below code shows an example of #temp table in SQL.
SELECT name, age, gender
INTO #MaleStudents
FROM student
WHERE gender = 'Male'
Hive has the concept of temporary tables, which are local to a user's session. These tables behave just like any other table, and can be created using CTAS commands too. Hive automatically deletes all temporary tables at the end of the Hive session in which they are created.
Read more about them here.
Hive Documentation
DWGEEK
You can create simple temporary table. On this table you can perform any operation.
Once you are done with work and log out of your session they will be deleted automatically.
Syntax for temporary table is :
CREATE TEMPORARY TABLE TABLE_NAME_HERE (key string, value string)

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.
There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.

Append and Overwrite in Amazon Redshift

As Redshift is based on PostgreSQL, does it have an option to overwrite or append data in table while copying from S3 to redshift?
Only thing I got is use of triggers but they don't accept any argument.
All I need to write a script which takes an argument as yes/no (or similar) if the data is already in the table.
When loading data from Amazon S3 into Amazon Redshift using the COPY command, data is appended to the target table.
Redshift does not have an "overwrite" option. If you wish to replace existing data with the data being loaded, you could:
Load the data into a temporary table
Delete rows in the main table that match the incoming data, eg:
DELETE FROM main-table WHERE id IN (SELECT id from temp-table)
Copy the rows from the temporary table to the main table, eg:
SELECT * FROM temp-table INTO main-table
See: Updating and Inserting New Data
Redshift doesn't allow you to create triggers or events like other sql databases, the solution I found is to run update (sql query)though you can use also Python or other language and schedule the Rscript with crontab task.
As of May 2019, Redshift supports stored procedures so you can package up a set of queries/statements like this:
CREATE OR REPLACE PROCEDURE public.copy_and_cleanse_data(overwrite bool)
AS $$
BEGIN
if overwrite IS TRUE THEN DELETE FROM myredshifttable;
copy myredshifttable
from 's3://awssampledbuswest2/tickit/category_pipe.txt'
iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
region 'us-west-2';
UPDATE myredshifttable SET myfield = REPLACE(myfield, 'foo', 'bar');
END;
$$ LANGUAGE plpgsql
SECURITY DEFINER;
Then use or schedule the following query:
CALL public.copy_and_cleanse_data()

Clearing records in HBase table

We are creating a Disaster Recovery System for HBase tables. Because of the restrictions we are not able to use the fancy methods to maintain the replica of the table. We are using Export/Import statements to get the data into HDFS and using that to create tables in the DR Servers.
While Importing the data into HBase table, we are using truncate command to clear the table and load the data fresh into the table. But the truncate statement is taking a long time to delete the rows. Is there are any other effective statements to clear the entire table?
(truncate takes 33 min for ~2500000 records)
disable -> drop -> create table again, maybe ? I don't know if drop takes too long.

How to UPDATE table from csv file?

How to update table from csv file in PostgreSQL? (version 9.2.4)
Copy command is for insert. But I need to update table. How can I update table from csv file without temp table?
I don't want to copy to temp table from csv file and update table from temp table.
And no merge command like Oracle?
The simple and fast way is with a temporary staging table, like detailed in this closely related answer:
How to update selected rows with values from a CSV file in Postgres?
If you don't "want" that for some unknown reason, there are more ways:
A foreign data wrapper with file_fdw.
You can run UPDATE commands directly using this one.
pg_read_file(). For special use cases.
Details in this related answer:
Read data from a text file inside a trigger
There is no MERGE command in Postgres, even less for COPY.
Discussion about whether and how to add it is ongoing. Check out the Postgres Wiki for details.