Unloading from redshift to s3 with headers - amazon-redshift

I already know how to unload a file from redshift into s3 as one file. I need to know how to unload with the column headers. Can anyone please help or give me a clue?
I don't want to manually have to do it in shell or python.

As of cluster version 1.0.3945, Redshift now supports unloading data to S3 with header rows in each file i.e.
UNLOAD('select column1, column2 from mytable;')
TO 's3://bucket/prefix/'
IAM_ROLE '<role arn>'
HEADER;
Note: you can't use the HEADER option in conjunction with FIXEDWIDTH.
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

If any of your columns are non-character, then you need to explicitly cast them as char or varchar because the UNION forces a cast.
Here is an example of the full statement that will create a file in S3 with the headers in the first row.
The output file will be a single CSV file with quotes.
This example assumes numeric values in column_1. You will need to adjust the ORDER BY clause to a numeric column to ensure the header row is in row 1 of the S3 file.
******************************************
/* Redshift export to S3 CSV single file with headers - limit 6.2GB */
UNLOAD ('
SELECT \'column_1\',\'column_2\'
UNION
SELECT
CAST(column_1 AS varchar(255)) AS column_1,
CAST(column_2 AS varchar(255)) AS column_2
FROM source_table_for_export_to_s3
ORDER BY 1 DESC
;
')
TO 's3://bucket/path/file_name_for_table_export_in_s3_' credentials
'aws_access_key_id=<key_with_no_<>_brackets>;aws_secret_access_key=<secret_access_key_with_no_<>_brackets>'
PARALLEL OFF
ESCAPE
ADDQUOTES
DELIMITER ','
ALLOWOVERWRITE
GZIP
;
****************************************

There is no direct option provided by redshift unload .
But we can tweak queries to generate files with rows having headers added.
First we will try with parallel off option so that it will create only on file.
"By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files."
To have headers in unload files we will do as below.
Suppose you have table as below
create table mutable
(
name varchar(64) default NULL,
address varchar(512) default NULL
)
Then try to use select command from you unload as below to add headers as well
( select 'name','address') union ( select name,address from mytable )
this will add headers name and address as first line in your output.

Just to complement the answer, to ensure the header row comes first, you don't have to order by a specific column of data. You can enclose the UNIONed selects inside another select, add a ordinal column to them and then in the outer select order by that column without including it in the list of selected columns.
UNLOAD ('
SELECT column_1, column_2 FROM (
SELECT 1 AS i,\'column_1\' AS column_, \'column_2\' AS column_2
UNION ALL
SELECT 2 AS i, column_1::varchar(255), column_2::varchar(255)
FROM source_table_for_export_to_s3
) t ORDER BY i
')
TO 's3://bucket/path/file_name_for_table_export_in_s3_'
CREDENTIALS
'aws_access_key_id=...;aws_secret_access_key=...'
DELIMITER ','
PARALLEL OFF
ESCAPE
ADDQUOTES;

Redshift now supports unload with headers. September 19–October 10, 2018 release.
The syntax for unloading with headers is -
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
authorization
HEADER

Unfortunately, the UNLOAD command doesn't natively support this feature (see other answers for how to do it with workarounds).
I've posted a feature request on the AWS forums, so hopefully it gets added someday.
Edit: The feature has now been implemented natively in Redshift! 🎉

Try like this:
Unload VENUE with a Header:
unload ('select * from venue where venueseats > 75000')
to 's3://mybucket/unload/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
header
parallel off;
The following shows the contents of the output file with a header row:
venueid|venuename|venuecity|venuestate|venueseats
6|New York Giants Stadium|East Rutherford|NJ|80242
78|INVESCO Field|Denver|CO|76125
83|FedExField|Landover|MD|91704
79|Arrowhead Stadium|Kansas City|MO|79451

To make the process easier you can use a pre-built docker image to extract and include the header row.
https://github.com/openbridge/ob_redshift_unload
It will also do a few other things, but it seemed to make sense to package this in an easy to use format.

To unload a table as csv to s3 including the headers, you will simply have to do it this way
UNLOAD ('SELECT * FROM {schema}.{table}')
TO 's3://{s3_bucket}/{s3_key}/{table}/'
with credentials
'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
CSV HEADER ALLOWOVERWRITE PARALLEL OFF;

Related

copy columns of a csv file into postgresql table

I have a CSV file with 12 - 11 - 10 or 5 columns.
After creating a PostgreSQL table with 12 columns, I want to copy this CSV into the table.
I use this request:
COPY absence(champ1, champ2, num_agent, nom_prenom_agent, code_gestion, code_service, calendrier_agent, date_absence, code_absence, heure_absence, minute_absence, periode_absence)
FROM 'C:\temp\absence\absence.csv'
DELIMITER '\'
CSV
My CSV file contains 80000 line.
Ex :
20\05\ 191\MARKEY CLAUDIE\GA0\51110\39H00\21/02/2020\1471\03\54\Matin
21\05\ 191\MARKEY CLAUDIE\GA0\51110\39H00\\8130\7H48\Formation avec repas\
30\05\ 191\MARKEY CLAUDIE\GA0\51430\39H00\\167H42\
22\9993\Temps de déplacement\98\37
when I execute the request, I get a message indicating that there is missing data for the lines with less than 12 fields.
Is there a trick?
copy is extremely fast and efficient, but less flexible because of that. Specifically it can't cope with files that have a different number of "columns" for each line.
You can either use a different import tool, or if you want to stick to built-in tools, copy the file into staging table that only has a single column, then use Postgres string functions to split the lines into the columns:
create unlogged table absence_import
(
line text
);
\COPY absence_import(line) FROM 'C:\temp\absence\absence.csv' DELIMITER E'\b' CSV
E'\b' is the "backspace" character which can't really appear in a text file, so no column splitting is taking place.
Once you have imported the file, you can split the line using string_to_array() and the insert that into the real table:
insert into absence(champ1, champ2, num_agent, nom_prenom_agent, code_gestion, code_service, calendrier_agent, date_absence, code_absence, heure_absence, minute_absence, periode_absence)
select line[1], line[2], line[3], .....
from (
select string_to_array(line, '\') as line
from absence_import
) t;
If there are non-text columns, might need to cast the values to the target data type explicitly: e.g. line[3]::int.
You can add additional expressions to deal with missing columns, e.g. something like: coalesce(line[10], 'default value')

Why does Redshift UNLOAD query is not able to quote column correctly?

I have following row in AWS Redshift warehouse table.
name
-------------------
tokenauthserver2018
This I queried via simple SELECT query
SELECT name
FROM tablename
When I am trying to unload it using UNLOAD query from AWS Redshift, it is successfully finishing but giving weird quoting.
"name"
"tokenauthserver2018\
Here is my query
UNLOAD ($TABLE_QUERY$
SELECT name
FROM tablename
$TABLE_QUERY$)
TO 's3://bucket/folder'
MANIFEST VERBOSE HEADER DELIMITER AS ','
NULL AS '' ESCAPE GZIP ADDQUOTES ALLOWOVERWRITE PARALLEL OFF;
I tried unloading without ADDQUOTES as well, but got following data
name
"tokenauthserver2018
This is the query for above.
UNLOAD ($TABLE_QUERY$
SELECT name
FROM tablename
$TABLE_QUERY$)
TO 's3://bucket/folder'
MANIFEST VERBOSE HEADER CSV NULL AS '' GZIP ALLOWOVERWRITE PARALLEL OFF;
Amazon support was able to resolve this, I am posting answer here for anyone interested.
This was due to presence of NULL character \0 in my data. As I don't have control over source data, I used TRANSLATE function to replace \0 character.
SELECT
TRANSLATE("name", CHR(0), '') AS "name"
FROM <tablename>
Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_TRANSLATE.html

How can I remove extra characters from a column?

I have a table with Customer/Phone/City/State/Zip/etc..
Occasionally, I'll be importing the info from a .csv file, and sometimes the zipcode is formatted like this: xxxxx-xxxx and I only need it to be a general, 5 digit zip code.
How can I delete the last 5 characters without having to do it from Excel, cell by cell (which is what I'm doing now)?
Thanks
EDIT: This is what I used after Craig's suggestion and it worked. However, some of the zip entries are canadian zipcodes and often time they are formated x1x-x2x. Running this deletes the last character in the field.
How could I remedy this?
You'll need to do one of these 3 ideas:
use an ETL tool to filter the data during insert;
COPY into a TEMPORARY or UNLOGGED table then do an INSERT INTO real_table SELECT ... that transforms the data with a suitable substring(...) call; or
Write a simple Perl/Python/whatever script that reads the csv, transforms it as desired, and inserts the results into PostgreSQL. I'd use Python with the csv module and psycopg2's copy_from.
Such an insert into ... select might look like:
INSERT INTO real_table(col1, col2, zip)
SELECT
col1,
col2,
substring(zip from 1 for 5)
FROM temp_table;

exporting to csv from db2 with no delimiter

I need to export content of a db2 table to CSV file.
I read that nochardel would prevent to have the separator between each data but that is not happening.
Suppose I have a table
MY_TABLE
-----------------------
Field_A varchar(10)
Field_B varchar(10)
Field_A varchar(10)
I am using this command
export to myfile.csv of del modified by nochardel select * from MY_TABLE
I get this written into the myfile.csv
data1 ,data2 ,data3
but I would like no ',' separator like below
data1 data2 data3
Is there a way to do that?
You're asking how to eliminate the comma (,) in a comma separated values file? :-)
NOCHARDEL tells DB2 not to surround character-fields (CHAR and VARCHAR fields) with a character-field-delimiter (default is the double quote " character).
Anyway, when exporting from DB2 using the delimited format, you have to have some kind of column delimiter. There isn't a NOCOLDEL option for delimited files.
The EXPORT utility can't write fixed-length (positional) records - you would have to do this by either:
Writing a program yourself,
Using a separate utility (IBM sells the High Performance Unload utility)
Writing an SQL statement that concatenates the individual columns into a single string:
Here's an example for the last option:
export to file.del
of del
modified by nochardel
select
cast(col1 as char(20)) ||
cast(intcol as char(10)) ||
cast(deccol as char(30));
This last option can be a pain since DB2 doesn't have an sprintf() function to help format strings nicely.
Yes there is another way of doing this. I always do this:
Put select statement into a file (input.sql):
select
cast(col1 as char(20)),
cast(col2 as char(10)),
cast(col3 as char(30));
Call db2 clp like this:
db2 -x -tf input.sql -r result.txt
This will work for you, because you need to cast varchar to char. Like Ian said, casting numbers or other data types to char might bring unexpected results.
PS: I think Ian points right on the difference between CSV and fixed-length format ;-)
Use "of asc" instead of "of del". Then you can specify the fixed column locations instead of delimiting.

SQL Server 2000 query that omits commas in resulting rows?

Wondering if there is a way to query a SQL Server database and somehow format columns to omit commas in the data if there is any.
Reason for asking is I have 10000+ records and through out the data the varchar have data like 3,25% and other 1%.
I'd prefer not to alter the data in the original table thus asking if a select with other functions would do the trick.
I have thought about selecting all the data into a temp table and stripping the commas but that is a lot of work for every time I do the query.
Any info or if its is possible please reply.
Take a look at the REPLACE function:
SELECT REPLACE(YourColumn, ',', '')
FROM YourTable
Use SQL REPLACE :
REPLACE(YourField,',','')