As Redshift is based on PostgreSQL, does it have an option to overwrite or append data in table while copying from S3 to redshift?
Only thing I got is use of triggers but they don't accept any argument.
All I need to write a script which takes an argument as yes/no (or similar) if the data is already in the table.
When loading data from Amazon S3 into Amazon Redshift using the COPY command, data is appended to the target table.
Redshift does not have an "overwrite" option. If you wish to replace existing data with the data being loaded, you could:
Load the data into a temporary table
Delete rows in the main table that match the incoming data, eg:
DELETE FROM main-table WHERE id IN (SELECT id from temp-table)
Copy the rows from the temporary table to the main table, eg:
SELECT * FROM temp-table INTO main-table
See: Updating and Inserting New Data
Redshift doesn't allow you to create triggers or events like other sql databases, the solution I found is to run update (sql query)though you can use also Python or other language and schedule the Rscript with crontab task.
As of May 2019, Redshift supports stored procedures so you can package up a set of queries/statements like this:
CREATE OR REPLACE PROCEDURE public.copy_and_cleanse_data(overwrite bool)
AS $$
BEGIN
if overwrite IS TRUE THEN DELETE FROM myredshifttable;
copy myredshifttable
from 's3://awssampledbuswest2/tickit/category_pipe.txt'
iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
region 'us-west-2';
UPDATE myredshifttable SET myfield = REPLACE(myfield, 'foo', 'bar');
END;
$$ LANGUAGE plpgsql
SECURITY DEFINER;
Then use or schedule the following query:
CALL public.copy_and_cleanse_data()
Related
In SQL stored procedures, we have an option of creating a temporary table "#temp" whose structure is as that of another table that it is referring to. Here we don't explicitly create and mention the structure of "#temp" table.
Do we have similar option is HQL Hive script to create a temp table during run time without actually creating the table structure. Thus I can dump data to temp table and use it. Below code shows an example of #temp table in SQL.
SELECT name, age, gender
INTO #MaleStudents
FROM student
WHERE gender = 'Male'
Hive has the concept of temporary tables, which are local to a user's session. These tables behave just like any other table, and can be created using CTAS commands too. Hive automatically deletes all temporary tables at the end of the Hive session in which they are created.
Read more about them here.
Hive Documentation
DWGEEK
You can create simple temporary table. On this table you can perform any operation.
Once you are done with work and log out of your session they will be deleted automatically.
Syntax for temporary table is :
CREATE TEMPORARY TABLE TABLE_NAME_HERE (key string, value string)
Suppose I run the Redshift COPY command for a table where existing data. Then does the command:
Appends the data to the existing table?
Wipes clean existing data and add the new data?
Upserts the data. i.e., UPDATE if data with the same primary key is present in table or INSERT otherwise
The COPY command always appends data to a table.
In order to effectively upsert in Redshift using "copy" command, you need first to load your data (from your copy) to a staging table then run some sql on redshift to process this data.
AWS have documented an approach here https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html
I am using Redshift. I want a query to delete selected rows from a redshift table if the table exists otherwise just ignore the statement.
Redshift's SQL dialect doesn't contain control-of-flow statements like IF.. THEN so you are not going to be able to do this in a single SQL statement.
Your application or process will need to first query the Redshift table metadata to determine if a table exists e.g.
select 1 from pg_tables where schemaname = 'myschema' and tablename = 'myschema';
If data is returned (i.e. the table exists) then the application or process will execute the delete statement, if no data is returned the application or process does nothing. Basically you need to handle the "if this then do this" logic externally to Redshift.
I recommend #Nathan's answer. I would use python/psycopg2 to set up this logic. The first query would check for the table's existence in pg_tables (eg SELECT count(1) FROM pg_tables WHERE tablename='foo'), and store the result in a variable. Then you'd check the results of that variable to decide whether to kick off a second query (your delete).
But, maybe you don't want to do it in Python. You're just all about Redshift (it's pretty sweet). You could just run the DELETE query in Redshift. If the table is not present, the query fails and nothing happens. If the table is, you delete your data. There's no harm in generating an error here.
Here i need to select a column name by using function(stored procedure) which is present in other database table using PostgreSQL.
I have sql server query as shown below.
Example:
create procedure sp_testing
as
if not exists ( select ssn from testdb..testtable) /*ssn is the column-name of testtable which exists in testdb database */
...
Q: Can i do the same in PostgreSQL?
Your question is not very clear, but if you want to know if a column by a certain name exists in a table by a certain name in a remote PostgreSQL database, then you should first set up a foreign data wrapper, which is a multi-stage process. Then to test the existence of a certain column in a table you need to formulate a query that conforms to the standards of the particular DBMS that you are connecting to. Use the remote information_schema.tables table for optimal compatibility (which is here specified as remote_tables which you must have defined with a prior CREATE FOREIGN TABLE command):
CREATE FUNCTION sp_testing () AS $$
BEGIN
PERFORM *
FROM remote_tables
WHERE table_name = 'testtable'
AND column_name = 'ssn';
IF NOT FOUND THEN
...
END IF;
END;
$$ LANGUAGE plpgsql;
If you want to connect to another type of DBMS, you need to write some custom function in f.i. C or perl and then call that from within a PostgreSQL function on your local machine. The test on the column is then best done inside the function which should therefore take connection parameters, table name and column name as parameters, and return a boolean to inform the result.
Before you start testing this, make sure that you read all the documentation on connecting to remote servers and learning PL/pgSQL first would also be a nice gesture to demonstrate your own efforts before you ask for help.
I'm wondering if I can use a trigger on a table to "ignore" columns that are in a COPY statement from STDIN but which are not in the target table. Sorry if the wording/syntax of the question is off, but here is and explanation of what I'm trying to say. I'm new to triggers so any advice is helpful.
I'm using the PostGIS Shapefile importer to copy shapefiles to the spatial tables in my PostgreSQL database.
This creates a COPY statement which contains all the fields in the shapefile something like:
COPY "public"."stations" ("column1","column2","column3","column4", geom) FROM stdin;
column1 and column2 are in the file but not in the target table, so the COPY fails.
Is there a way to create a trigger to create something that would have the same result as:
COPY "public"."stations" ("column3","column4", geom) FROM stdin;
No, you cannot skip columns that are present in the input file. This will error out, before triggers are even invoked. And you cannot use rules either. I quote the manual:
COPY FROM will invoke any triggers and check constraints on the
destination table. However, it will not invoke rules.
You can either edit the file or use a temporary staging table:
COPY to a temporary table with matching columns.
Use INSERT to write the desired columns to the final target table(s) - or the whole range of SQL DDL commands for more sophisticated matters.