I want to Store folder name while copying data from S3 bucket to Redshift table

I want to Store folder name while copying data from S3 bucket to Redshift table - amazon-redshift

I am trying to load data from S3 bucket to redshift table,there is one column as source id in the table and i want to store the folder name where the source file is available,in to that column.
Actually i have multiple folders in S3 bucket and in each folder i have one file and i port all the files in same table with copy command in redshift, so to identify from which folder the data is, so i need to store the folder name along with data into the Redshift table, i have seperate column in table as Source id.
can any body help me.

If you are using the Redshift copy command, then you have no choice other than a process to import each folder (e.g. as a temp table) and then set your value manually the the value of the folder that you restored. repeat for each folder.
Another option is to use redshift spectrum and create an external table that maps to your folder as partitions.
first you create your base table like this
create external table spectrum.sales_part(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
partitioned by (saledate date)
row format delimited
fields terminated by '|'
stored as textfile
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/'
table properties ('numRows'='172000');
Then you add partitions to it like this
alter table spectrum.sales_part
add partition(saledate='2008-01-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/';
alter table spectrum.sales_part
add partition(saledate='2008-02-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';
alter table spectrum.sales_part
add partition(saledate='2008-03-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/';
Once you have that set up as an external table, you can use standard sql against that table, for example you could run your queries against that table or copy it to a permanent redshift table using CTAS.
Here is a link to the documentation
https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

Related

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?

Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Use dynamic value as table name of a table storage in Azure Data Factory

I have an ADF pipeline that uses copy data activity for copying data from blob storage to table storage. This pipeline runs on a trigger once every day. I have provided a table name in table storage data set as 'Table1'.
Instead of providing a hard coded table name value (Table1), is it possible to provide a dynamic value as table name in the table storage such that RUN ID of pipeline run is used as the table name in the table storage and copy data from blob to that table in table storage?

You could set a dynamic value as table name.
For example, you can add parameter to the table storage dataset:
Then you can set the pipeline parameter to specify the table name:
But we can not provide the RUN ID of pipeline run as the table name in the table storage and copy data from blob to that table in table storage.
Hope this helps.

how to dump data into a temporary table(without actually creating the temporary table) from an external table in Hive Script during run time

In SQL stored procedures, we have an option of creating a temporary table "#temp" whose structure is as that of another table that it is referring to. Here we don't explicitly create and mention the structure of "#temp" table.
Do we have similar option is HQL Hive script to create a temp table during run time without actually creating the table structure. Thus I can dump data to temp table and use it. Below code shows an example of #temp table in SQL.
SELECT name, age, gender
INTO #MaleStudents
FROM student
WHERE gender = 'Male'

Hive has the concept of temporary tables, which are local to a user's session. These tables behave just like any other table, and can be created using CTAS commands too. Hive automatically deletes all temporary tables at the end of the Hive session in which they are created.
Read more about them here.
Hive Documentation
DWGEEK

You can create simple temporary table. On this table you can perform any operation.
Once you are done with work and log out of your session they will be deleted automatically.
Syntax for temporary table is :
CREATE TEMPORARY TABLE TABLE_NAME_HERE (key string, value string)

How to clone or copy records in same table in postgres?

How to clone or copy records in same table in PostgreSQL by creating temporary table.
trying to create clones of records from one table to the same table with changed name(which is basically composite key in that table).

You can do it all in one INSERT combined with a SELECT.
i.e. say you have the following table definition and data populated in it:
create table original
(
id serial,
name text,
location text
);
INSERT INTO original (name, location)
VALUES ('joe', 'London'),
('james', 'Munich');
And then you can INSERT doing the kind of switch you're talking about without using a TEMP TABLE, like this:
INSERT INTO original (name, location)
SELECT 'john', location
FROM original
WHERE name = 'joe';
Here's an sqlfiddle.
This should also be faster (although for tiny data sets probably not hugely so in absolute time terms), since it's doing only one INSERT and SELECT as opposed to an extra SELECT and CREATE TABLE plus an UPDATE.

Did a bit of research, came up with a logic :
Create temp table
Copy records into it
Update the records in temp table
Copy it back to original table
CREATE TEMP TABLE temporary AS SELECT * FROM ORIGINAL WHERE NAME='joe';
UPDATE TEMP SET NAME='john' WHERE NAME='joe';
INSERT INTO ORIGINAL SELECT * FROM temporary WHERE NAME='john';
Was wondering if there was any shorter way to do it.

PostgreSQL copy command generate primary key id

I have a CSV file with two columns: city and zipcode. I want to be able to copy this file into a PostgreSQL table using the copy command and at the same time auto generate the id value.
The table has the following columns: id, city, and zipcode.
My CSV file has only: city and zipcode.

The COPY command should do that all by itself if your table uses a serial column for the id:
If there are any columns in the table that are not in the column list, COPY FROM will insert the default values for those columns.
So you should be able to say:
copy table_name(city, zipcode) from ...
and the id will be generated as usual. If you don't have a serial column for id (or a manually attached sequence), then you could hook up a sequence by hand, do your COPY, and then detach the sequence.