Why doesn't ACCEPTINVCHARS work here? - amazon-redshift

I'm getting load errors when trying to load data into Redshift. My error is:
Missing newline: Unexpected character 0x24 found at location nnn
I'm using this command which includes the ACCEPTINVCHARS option, and the column in question is defined as VARCHAR(80)
copy <dest_tbl> from <S3 source>
CREDENTIALS <my_credentials> IGNOREHEADER 1 ENCODING UTF8
IGNOREBLANKLINES NULL AS '\\N'
EMPTYASNULL BLANKSASNULL gzip ACCEPTINVCHARS timeformat 'auto'
dateformat 'auto' MAXERROR 1 compupdate on;
The errors look like this in vi
An octal dump looks like this:
I'm not understanding why this is failing given the ACCEPTINVCHARS documentation at Amazon Can anyone suggest a solution or a workaround? Put another way, what do I need to do to ensure that Redshift accepts this string in this field?

Octal dump shows they are null values (NUL) which are treated as line terminator by redshift copy command.
Use NULL AS '\0' instead default '\N',

Related

PostgreSQL how to read csv file with decimal comma?

I try to read a csv file containing real numbers with a comma as separator. I try to read this file with \copy in psql:
\copy table FROM 'filename.csv' DELIMITER ';' CSV HEADER;
psql does not recognize the comma as decimal point.
psql:filename.sql:44: ERROR: invalid input syntax for type real: "9669,84"
CONTEXT: COPY filename, line 2, column col-3: "9669,84"
I did some googling but could not find any answer other than "change the decimal comma into a decimal point". I tried SET DECIMALSEPARATORCOMMA=ON; but that did not work. I also experimented with some encoding but I couldn't find whether encoding governs the decimal point (I got the impression it didn't).
Is there really no solution other than changing the input data?
COPY to a table where you insert the number into a varchar field. Then do something like in psql:
--Temporarily change numeric formatting to one that uses ',' as
--decimal separator.
set lc_numeric = "de_DE.UTF-8";
--Below is just an example. In your case the select would be part of
--insert into the target table. Also the first part of to_number
--would be the field from your staging table.
select to_number('9669,84', '99999D999');
9669.84
You might need to change the format string to match all the numbers. For more information on what is available see Data formatting Table 9.28. Template Patterns for Numeric Formatting.

Postgresql - import from CSV null values wrapped in double quotes

So I am trying to import some data into postgresql using the COPY command.
Here is a sample of what the data looks like:
"UNIQ_ID","SP_grd1","SACN_grd1","BIOME_grd1","Meso_grd1","DM_grd1","VEG_grd1","lcov90_alb","WMA_grd1"
"G01_00000002","199058001.00000","1.00000","6.00000","24889.00000","2.00000","381.00000","33.00000","9.00000"
"G01_00000008","*********************","1.00000","*********************","24889.00000","2.00000","*********************","34.00000","*********************"
the issue that I am having is the double quotes that are wrapping the ********************* which are the null values.
I am using the following in order to create the data table and copy the data:
CREATE TABLE bravo.G01(UNIQ_ID character varying(18), SP_grd1 double precision ,SACN_grd1 numeric,BIOME_grd1 numeric,Meso_grd1 double precision,DM_grd1 numeric,VEG_grd1 numeric,lcov90_alb numeric,WMA_grd1 numeric);
COPY bravo.g01(UNIQ_ID,SP_grd1,SACN_grd1,BIOME_grd1,Meso_grd1,DM_grd1,VEG_grd1,lcov90_alb,WMA_grd1) FROM 'F:\GreenBook-Backup\LUdatacube_20171206\CSV_Data_bravo\G01.csv' DELIMITER ',' NUll AS '*********************' CSV HEADER ;
the create table command works fine but I encounter an error with the NULL AS statement. If I edit the text file and remove the double quotes then the import works fine.
I assume that as CSVs with double quotes and null values are very common there must be a work around here that I am missing. I certainly don't want to go and edit each of my CSVs so that it doesn't have double quotes!
You might want to try adding FORCE_NULL( column_name [, ...] ) option.
As the documentation stated for FORCE_NULL:
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
The option available from Postgres 9.4: https://www.postgresql.org/docs/10/static/sql-copy.html
If you're on a unix-like platform, you could use sed to replace the null-strings with something postgresql will recognize automatically as null. On windows, powershell exposes similar functionality.
This approach is more general if you need to perform other types of clean up on the data before loading.
The regex pattern to match your null-string is "[\*]*"
cleaning the file with sed:
[unix]>sed 's/"[\*]*"//g' test.csv > test2.csv
cleaning the file with windows powershell:
[windows-powershell]>cat test.csv | %{$_ -replace '"[\*]*"', ""} > test2.csv
loading into postgresql can then be shorter.:
psql>\copy bravo.g01 FROM 'test2.csv' WITH CSV HEADER;

How to NULL out a 'no-break space' character in Redshift Copy Command?

My COPY command keeps receiving the following error:
Missing newline: Unexpected character 0x73 found at location 4194303
I ran it through the following function to check for non-ASCII characters:
def return_non_ascii_codes(input: str):
for char in input:
if ord(char) > 127:
yield ord(char)
And found out that I had a number of characters that returned a 160 code. Looking this up in a Unicode chart, it looks like this is a non-break space character: http://www.fileformat.info/info/unicode/char/00a0/index.htm
I want to NULL these characters out in my COPY command, but am unsure of what the correct character sequence/format I should use.
The COPY command is as follows:
COPY xxx
FROM 's3://xxx/cleansed.csv'
WITH CREDENTIALS 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
-- GZIP
ESCAPE
FILLRECORD
TRIMBLANKS
TRUNCATECOLUMNS
DELIMITER '|'
BLANKSASNULL
REMOVEQUOTES
ACCEPTINVCHARS
TIMEFORMAT 'auto'
DATEFORMAT 'auto';
EDIT:
I used Python to find the characters, but Python does not do any of the actual processing in my pipeline. I do a COPY TO STDOUT command from our PostgreSQL databases, and then upload those files directly to S3 for copy to Redshift. So it needs to be handled in one of those two places.
Here are the two fields from the destination table:
id BIGINT,
quiz_data VARCHAR(65535)
UPDATE 1:
I ran the script through a function to cleanse all non-ASCII characters like so:
with open(file, 'r') as inf, open(outfile, 'w') as outf:
for line in inf:
print(return_non_ascii_codes(line))
outf.write(''.join(return_ascii_chars(line)))
def return_ascii_chars(input: str):
return (char for char in input if ord(char) < 127)
and then tried to COPY to Redshift. Still getting the following:
Missing newline: Unexpected character 0x20 found at location 4194303
I've double-checked that the cleansed file doesn't have any non-ASCII character...
COPY table1 FROM 's3://my_bucket' CREDENTIALS '' ACCEPTINVCHARS
Use can use ACCEPTINVCHARS parameter in your copy command.
Its pretty easy and straight forward.
If I’ve made a bad assumption please comment and I’ll refocus my answer.
yourvariable.replace(unichr(160), "")

Using ASCII 31 field separator character as Postgresql COPY delimiter

We are exporting data from Postgres 9.3 into a text file for ingestion by Spark.
We would like to use the ASCII 31 field separator character as a delimiter instead of \t so that we don't have to worry about escaping issues.
We can do so in a shell script like this:
#!/bin/bash
DELIMITER=$'\x1F'
echo "copy ( select * from table limit 1) to STDOUT WITH DELIMITER '${DELIMITER}'" | (psql ...) > /tmp/ascii31
But we're wondering, is it possible to specify a non-printable glyph as a delimiter in "pure" postgres?
edit: we attempted to use the postgres escaping convention per http://www.postgresql.org/docs/9.3/static/sql-syntax-lexical.html
warehouse=> copy ( select * from table limit 1) to STDOUT WITH DELIMITER '\x1f';
and received
ERROR: COPY delimiter must be a single one-byte character
Try prepending E before the sequence you're trying to use as a delimter. For example E'\x1f' instead of '\x1f'. Without the E PostgreSQL will read '\x1f' as four separate characters and not a hexadecimal escape sequence, hence the error message.
See the PostgreSQL manual on "String Constants with C-style Escapes" for more information.
From my testing, both of the following work:
echo "copy (select 1 a, 2 b) to stdout with delimiter u&'\\001f'"| psql;
echo "copy (select 1 a, 2 b) to stdout with delimiter e'\\x1f'"| psql;
I've extracted a small file from Actian Matrix (a fork of Amazon Redshift, both derivatives of postgres), using this notation for ASCII character code 30, "Record Separator".
unload ('SELECT btrim(class_cd) as class_cd, btrim(class_desc) as class_desc
FROM transport.stg.us_fmcsa_carrier_classes')
to '/tmp/us_fmcsa_carrier_classes_mk4.txt'
delimiter as '\036' leader;
This is an example of how this file looks in VI:
C^^Private Property
D^^Private Passenger Business
E^^Private Passenger Non-Business
I then moved this file over to a machine hosting PostgreSQL 9.5 via sftp, and used the following copy command, which seems to work well:
copy fmcsa.carrier_classes
from '/tmp/us_fmcsa_carrier_classes_mk4.txt'
delimiter u&'\001E';
Each derivative of postgres, and postgres itself seems to prefer a slightly different notation. Too bad we don't have a single standard!

How do I stop Postgres copy command to stop padding Strings?

My field is defined as follows
"COLUMNNAME" character(9)
I import CSV files using the following command
copy "TABLE" from '/my/directory' DELIMITERS ',' CSV;
If I have a string such as 'ABCDEF' Postgres pads it out to 'ABCDEF '. How can I stop it from doing this?
it is because you have char instead of varchar. change type of your column into varchar and everything will be fine