How to ignore errors, but not skip lines in COPY command redshift? - amazon-redshift

I have the below COPY statement. It skips lines for maxerror. Is there any way to COPY data over to redshift, forcing any errors into the column regardless of type? I dont want to lose information.
sql_prelim = """copy table1 from 's3://dwreplicatelanding/file.csv.gz'
access_key_id 'xxxx'
secret_access_key 'xxxx'
DELIMITER '\t' timeformat 'auto'
GZIP IGNOREHEADER 1
trimblanks
CSV
BLANKSASNULL
maxerror as 100000
"""
The error I want to skip is below, but ideally I want to skip all errors and maintain data:
1207- Invalid digit, Value 'N', Pos 0, Type: Decimal

Related

Improve SqlLdr performance for 120 Million records Upload from Csv

It is taking almost 10hrs to finish loading into tables.
Here is my ctl file.
OPTIONS (
skip =1,
ERRORS=4000000,
READSIZE=5000000,
BINDSIZE=8000000,
direct=true
unrecoverable
)
load data
INFILE 'weeklydata1108.csv'
INSERT INTO TABLE t_location_data
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS
(f_start_ip,
f_end_ip,
f_country,
f_count_code ,
f_reg,
f_stat,
f_city,
f_p_code ,
f_area,
f_lat,
f_long,
f_anon_stat ,
f_pro_detect date "YYYY-MM-DD",
f_date "SYSDATE")
And sqlldr command for running it is
sqlldr username#\"\(DESCRIPTION=\(ADDRESS=\(HOST=**mydbip***\)\(PROTOCOL=TCP\)\(PORT=1521\)\)\(CONNECT_DATA=\(SID=Real\)\)\)\"/geolocation control='myload.ctl' log='insert.log' bad='insert.bad'

Snowfalke csv copy failures

I'm trying to load csv data into snowflake using snowflake copy. CSV column separator is pipe character (|). One of the columns in the input csv has | character in data which is escaped by backslash . The datatype in the target snowflake database is VARCHAR(n). With the addition of the escape character(\) in the data the data size exceeds the target column definition size which causes copy to fail.
Is there a way I can remove the escape character (\) from the data before loaded into the table?
copy into table_name from 's3path.csv.gz' file_format=(type = 'csv', field_optionally_enclosed_by ='"' escape_unenclosed_field = NONE empty_field_as_null = true escape ='\' field_delimter ='|' skip_header=1 NULL_IF = ('\\N', '', '\N')) ON_ERROR ='ABORT_STATEMENT' PURGE = FALSE
Sample data that causes the failure: "Data 50k | $200K "

Snowflake null values quoted in CSV breaks PostgreSQL unload

I am trying to shift data from Snowflake to Postgresql and to do so I first load it into s3 in CSV format. In the table, comas in text could appear, I therefore use FIELD_OPTIONALLY_ENCLOSED_BY snowflake unloading option to quote the content of the problematic cells. However when this happen + null values, I can't manage to have a valid CSV for PostgreSQL.
I created a simple table for you to understand the issue. Here it is :
CREATE OR REPLACE TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
INSERT INTO PUBLIC.TEST VALUES
('A', 1),
(NULL, 2),
('B', NULL),
(NULL, NULL),
('Hello, world', NULL)
;
COPY INTO #STAGE/test
FROM PUBLIC.TEST
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ''
)
OVERWRITE = TRUE;
Snowflake will from that create the following CSV
"A",1
"",2
"B",""
"",""
"Hello, world",""
But after that, it is for me impossible to copy this CSV inside a PostgreSQL Table as it is.
Even thought from PostgreSQL documentation we have next to NULL option :
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format.
Not setting COPY Option in PostgreSQL COPY INTO will result in a failed unloading. Indeed it won't work as we also have to specify the quote used using QUOTE. Here it'll be QUOTE '"'
Therefore during POSTGRESQL unloading, using :
FORMAT csv, HEADER false, QUOTE '"' will give :
DataError: invalid input syntax for integer: "" CONTEXT: COPY test, line 3, column numeric_field: ""
FORMAT csv, HEADER false, NULL '""', QUOTE '"' will give :
NotSupportedError: CSV quote character must not appear in the NULL specification
FYI, To test the unloading in s3 I will use this command in PostgreSQL:
CREATE IF NOT EXISTS TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
SELECT aws_s3.table_import_from_s3(
'PUBLIC.TEST',
'',
'(FORMAT csv, HEADER false, NULL ''""'', QUOTE ''"'')',
'bucket',
'test_0_0_0.csv',
'aws_region'
)
Thanks a lot for any ideas on what I could do to make it happen? I would love to find a solution that don't requires modifying the csv between snowflake and postgres. I think it is an issue more on the Snowflake side as it don't really make sense to quote null values. But PostgreSQL is not helping either.
When you set the NULL_IF value to '', you are actually telling Snowflake to convert NULLS to a BLANK, which then get quoted. When you are copying out of Snowflake, the copy options are "backwards" in a sense and NULL_IF acts more like an IFNULL.
This is the code that I'd use on the Snowflake side, which will result in an unquoted empty string in your CSV file:
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ()
)

How does SAS decide which file to select when using wildcards

I have a SAS code that looks something like this:
DATA WORK.MY_IMPORT_&stamp;
INFILE "M:\YPATH\myfile_150*.csv"
delimiter = ';' MISSOVER DSD lrecl = 1000000 firstobs = 2 ignoredoseof;
[...]
RUN;
Now, at M:\YPATH I have several files named myfile_150.YYYYMMDD. The code works the way it is supposed to by importing always the latest file. I am wondering how SAS decides which file to choose, since the wildcard * can be replaced by anything. Does it sort the files in descending order and choose the first one?
On my system, SAS 9.4 TS1M4, SAS is reading ALL files that satisfy the wildcard.
I created 3 files (file_A.csv, file_B.csv, and file_C.csv). Each contain 1 record ('A', 'B', and 'C' respectively).
data test;
infile "c:\temp\file_*.csv"
delimiter = ';' MISSOVER DSD lrecl = 1000000 ignoredoseof;
format char $1.;
input char $;
run;
(Note I dropped the firstobs option from your code.)
The resulting TEST data set contains 3 observations, 'A', 'B', and 'C'.
This is the order of files returned when issuing
dir c:\temp\file_*.csv
SAS is using the default behavior of the OS and reading the files in that order.
25 data test;
26 infile "c:\temp\file_*.csv"
27 delimiter = ';' MISSOVER DSD lrecl = 1000000 ignoredoseof;
28 format char $1.;
29 input char $;
30 run;
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_A.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_B.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_C.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: The data set WORK.TEST has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds

how to deal with missings when importing csv to postgres?

I would like to import a csv file, which has multiple occurrences of missing values. I recoded them into NULL and tried to import the file as. I suppose that my attributes which include the NULLS are character values. However transforming them to numeric is bit complicated. Therefore I would like to import all of my table as:
\copy player_allstar FROM '/Users/Desktop/Rdaten/Data/player_allstar.csv' DELIMITER ';' CSV WITH NULL AS 'NULL' ';' HEADER
There must be a syntax error. But I tried different combinations and always get:
ERROR: syntax error at or near "WITH NULL"
LINE 1: COPY player_allstar FROM STDIN DELIMITER ';' CSV WITH NULL ...
I also tried:
\copy player_allstar FROM '/Users/Desktop/Rdaten/Data/player_allstar.csv' WITH(FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
and get:
ERROR: invalid input syntax for integer: "NULL"
CONTEXT: COPY player_allstar, line 2, column dreb: "NULL"
I suppose it is caused by preprocessing with R. The Table came with NAs so I change them to:
data[data==NA] <- "NULL"
I`m not aware of a different way chaning to NULL. I think this causes strings. Is there a different way to preprocess and keep the NAs(as NULLS in postgres of course)?
Sample:
pts dreb oreb reb asts stl
11 NULL NULL 8 3 NULL
4 5 3 8 2 1
3 NULL NULL 1 1 NULL
data type is integer
Given /tmp/sample.csv:
pts;dreb;oreb;reb;asts;stl
11;NULL;NULL;8;3;NULL
4;5;3;8;2;1
3;NULL;NULL;1;1;NULL
then with a table like:
CREATE TABLE player_allstar (pts integer, dreb integer, oreb integer, reb integer, asts integer, stl integer);
it works for me:
\copy player_allstar FROM '/tmp/sample.csv' WITH (FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
Your syntax is fine, the problem seem to be in the formatting of your data. Using your syntax I was able to load data with NULLs successfully:
mydb=# create table test(a int, b text);
CREATE TABLE
mydb=# \copy test from stdin WITH(FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> col a header;col b header
>> 1;one
>> NULL;NULL
>> 3;NULL
>> NULL;four
>> \.
mydb=# select * from test;
a | b
---+------
1 | one
|
3 |
| four
(4 rows)
mydb=# select * from test where a is null;
a | b
---+------
|
| four
(2 rows)
In your case you can substitute to NULL 'NA' in the copy command, if the original value is 'NA'.
You should make sure that there's no spaces around your data values. For example, if your NULL is represented as NA in your data and fields are delimited with semicolon:
1;NA <-- good
1 ; NA <-- bad
1<tab>NA <-- bad
etc.