how to deal with missings when importing csv to postgres? - postgresql

I would like to import a csv file, which has multiple occurrences of missing values. I recoded them into NULL and tried to import the file as. I suppose that my attributes which include the NULLS are character values. However transforming them to numeric is bit complicated. Therefore I would like to import all of my table as:
\copy player_allstar FROM '/Users/Desktop/Rdaten/Data/player_allstar.csv' DELIMITER ';' CSV WITH NULL AS 'NULL' ';' HEADER
There must be a syntax error. But I tried different combinations and always get:
ERROR: syntax error at or near "WITH NULL"
LINE 1: COPY player_allstar FROM STDIN DELIMITER ';' CSV WITH NULL ...
I also tried:
\copy player_allstar FROM '/Users/Desktop/Rdaten/Data/player_allstar.csv' WITH(FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
and get:
ERROR: invalid input syntax for integer: "NULL"
CONTEXT: COPY player_allstar, line 2, column dreb: "NULL"
I suppose it is caused by preprocessing with R. The Table came with NAs so I change them to:
data[data==NA] <- "NULL"
I`m not aware of a different way chaning to NULL. I think this causes strings. Is there a different way to preprocess and keep the NAs(as NULLS in postgres of course)?
Sample:
pts dreb oreb reb asts stl
11 NULL NULL 8 3 NULL
4 5 3 8 2 1
3 NULL NULL 1 1 NULL
data type is integer

Given /tmp/sample.csv:
pts;dreb;oreb;reb;asts;stl
11;NULL;NULL;8;3;NULL
4;5;3;8;2;1
3;NULL;NULL;1;1;NULL
then with a table like:
CREATE TABLE player_allstar (pts integer, dreb integer, oreb integer, reb integer, asts integer, stl integer);
it works for me:
\copy player_allstar FROM '/tmp/sample.csv' WITH (FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);

Your syntax is fine, the problem seem to be in the formatting of your data. Using your syntax I was able to load data with NULLs successfully:
mydb=# create table test(a int, b text);
CREATE TABLE
mydb=# \copy test from stdin WITH(FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> col a header;col b header
>> 1;one
>> NULL;NULL
>> 3;NULL
>> NULL;four
>> \.
mydb=# select * from test;
a | b
---+------
1 | one
|
3 |
| four
(4 rows)
mydb=# select * from test where a is null;
a | b
---+------
|
| four
(2 rows)
In your case you can substitute to NULL 'NA' in the copy command, if the original value is 'NA'.
You should make sure that there's no spaces around your data values. For example, if your NULL is represented as NA in your data and fields are delimited with semicolon:
1;NA <-- good
1 ; NA <-- bad
1<tab>NA <-- bad
etc.

Related

Snowflake null values quoted in CSV breaks PostgreSQL unload

I am trying to shift data from Snowflake to Postgresql and to do so I first load it into s3 in CSV format. In the table, comas in text could appear, I therefore use FIELD_OPTIONALLY_ENCLOSED_BY snowflake unloading option to quote the content of the problematic cells. However when this happen + null values, I can't manage to have a valid CSV for PostgreSQL.
I created a simple table for you to understand the issue. Here it is :
CREATE OR REPLACE TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
INSERT INTO PUBLIC.TEST VALUES
('A', 1),
(NULL, 2),
('B', NULL),
(NULL, NULL),
('Hello, world', NULL)
;
COPY INTO #STAGE/test
FROM PUBLIC.TEST
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ''
)
OVERWRITE = TRUE;
Snowflake will from that create the following CSV
"A",1
"",2
"B",""
"",""
"Hello, world",""
But after that, it is for me impossible to copy this CSV inside a PostgreSQL Table as it is.
Even thought from PostgreSQL documentation we have next to NULL option :
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format.
Not setting COPY Option in PostgreSQL COPY INTO will result in a failed unloading. Indeed it won't work as we also have to specify the quote used using QUOTE. Here it'll be QUOTE '"'
Therefore during POSTGRESQL unloading, using :
FORMAT csv, HEADER false, QUOTE '"' will give :
DataError: invalid input syntax for integer: "" CONTEXT: COPY test, line 3, column numeric_field: ""
FORMAT csv, HEADER false, NULL '""', QUOTE '"' will give :
NotSupportedError: CSV quote character must not appear in the NULL specification
FYI, To test the unloading in s3 I will use this command in PostgreSQL:
CREATE IF NOT EXISTS TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
SELECT aws_s3.table_import_from_s3(
'PUBLIC.TEST',
'',
'(FORMAT csv, HEADER false, NULL ''""'', QUOTE ''"'')',
'bucket',
'test_0_0_0.csv',
'aws_region'
)
Thanks a lot for any ideas on what I could do to make it happen? I would love to find a solution that don't requires modifying the csv between snowflake and postgres. I think it is an issue more on the Snowflake side as it don't really make sense to quote null values. But PostgreSQL is not helping either.
When you set the NULL_IF value to '', you are actually telling Snowflake to convert NULLS to a BLANK, which then get quoted. When you are copying out of Snowflake, the copy options are "backwards" in a sense and NULL_IF acts more like an IFNULL.
This is the code that I'd use on the Snowflake side, which will result in an unquoted empty string in your CSV file:
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ()
)

Unable to replace dash with null during COPY operation from CSV

I have the following CSV data
"AG","Saint Philip","AG-08"
"AI","Anguilla","-"
"AL","Berat","AL-01"
I want to replace - with NULL
I use the following command
copy subdivision from '/tmp/IP2LOCATION-ISO3166-2.CSV' with delimiter as ',' NULL AS '-' csv;
The copy operation is success. However, - in 3rd column is being copied as well, instead of replaced with NULL.
Do you have idea what mistake in my command? My table is
CREATE TABLE subdivision(
country_code TEXT NOT NULL,
name TEXT NOT NULL,
code TEXT
);
It comes down to the quoting. If you have this:
"AI","Anguilla",-
"AL","Berat","AL-01"
Then the below works(using newer COPY format):
copy
subdivision
from
'/home/postgres/csv_test.csv'
with(format csv, delimiter ',' , NULL '-');
COPY 3
\pset null NULL
select * from subdivision ;
country_code | name | code
--------------+--------------+-------
AG | Saint Philip | AG-08
AI | Anguilla | NULL
AL | Berat | AL-01
If you maintain the original csv:
"AG","Saint Philip","AG-08"
"AI","Anguilla","-"
"AL","Berat","AL-01"
then you have to do this:
copy
subdivision
from
'/home/postgres/csv_test.csv'
with(format csv, delimiter ',' , NULL '-', FORCE_NULL (code) );
select * from subdivision ;
country_code | name | code
--------------+--------------+-------
AG | Saint Philip | AG-08
AI | Anguilla | NULL
AL | Berat | AL-01
where FORCE_NULL is:
https://www.postgresql.org/docs/current/sql-copy.html
FORCE_NULL
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
So to convert quoted values you have to force the conversion by specifying the columns(s)

DB2: Converting varchar to money

I have 2 varchar(64) values that are decimals in this case (say COLUMN1 and COLUMN2, both varchars, both decimal numbers(money)). I need to create a where clause where I say this:
COLUMN1 < COLUMN2
I believe I have to convert these 2 varchar columns to a different data types to compare them like that, but I'm not sure how to go about that. I tried a straight forward CAST:
CAST(COLUMN1 AS DECIMAL(9,2)) < CAST(COLUMN2 AS DECIMAL(9,2))
But I had to know that would be too easy. Any help is appreciated. Thanks!
You can create a UDF like this to check which values can't be cast to DECIMAL
CREATE OR REPLACE FUNCTION IS_DECIMAL(i VARCHAR(64)) RETURNS INTEGER
CONTAINS SQL
--ALLOW PARALLEL -- can use this on Db2 11.5 or above
NO EXTERNAL ACTION
DETERMINISTIC
BEGIN
DECLARE NOT_VALID CONDITION FOR SQLSTATE '22018';
DECLARE EXIT HANDLER FOR NOT_VALID RETURN 0;
RETURN CASE WHEN CAST(i AS DECIMAL(31,8)) IS NOT NULL THEN 1 END;
END
For example
CREATE TABLE S ( C VARCHAR(32) );
INSERT INTO S VALUES ( ' 123.45 '),('-00.12'),('£546'),('12,456.88');
SELECT C FROM S WHERE IS_DECIMAL(c) = 0;
would return
C
---------
£546
12,456.88
It really is that easy...this works fine...
select cast('10.15' as decimal(9,2)) - 1
from sysibm.sysdummy1;
You've got something besides a valid numerical character in your data..
And it's something besides leading or trailing whitespace...
Try the following...
select *
from table
where translate(column1, ' ','0123456789.')
<> ' '
or translate(column2, ' ','0123456789.')
<> ' '
That will show you the rows with alpha characters...
If the above does't return anything, then you've probably got a string with double decimal points or something...
You could use a regex to find those.
There is a built-in ability to do this without UDFs.
The xmlcast function below does "safe" casting between (var)char and decfloat (you may use as double or as decimal(X, Y) instead, if you want). It returns NULL if it's impossible to cast.
You may use such an expression twice in the WHERE clause.
SELECT
S
, xmlcast(xmlquery('if ($v castable as xs:decimal) then xs:decimal($v) else ()' passing S as "v") as decfloat) D
FROM (VALUES ( ' 123.45 '),('-00.12'),('£546'),('12,456.88')) T (S);
|S |D |
|---------|------------------------------------------|
| 123.45 |123.45 |
|-00.12 |-0.12 |
|£546 | |
|12,456.88| |

invalid input syntax for integer: "1" postgresql

PostgreSql gives me this error when i try to cast a TEXT colum to a integer.
select pro_id::integer from mmp_promocjas_tmp limit 1;
This colum contains only digits, valid integer. How can "1" be invalid integer?
select pro_id, length(pro_id) ,length(trim(pro_id)) from mmp_promocjas_tmp limit 1;
outputs:
1 | 2 | 2
Query select pro_id from mmp_promocjas_tmp where trim(pro_id) = '1' shows nothing.
I tried to remove whitespaces, without no result:
select pro_id from mmp_promocjas_tmp where regexp_replace(trim(pro_id), '\s*', '', 'g')
There are probably spurious invisible contents in the column.
To make them visible, try a query like this:
select pro_id, c,lpad(to_hex(ascii(c)),4,'0') from (
select pro_id,regexp_split_to_table(pro_id,'') as c
from (select pro_id from mmp_promocjas_tmp limit 10) as s
) as g;
This will show the ID and each character its contains, both as a character and as its hexadecimal code in the repertoire.

How to import data into teradata tables from delimited file using BTEQ import?

I am trying to execute following bteq command on linux environment but couldn't load data properly into Teradata DB server. Can someone please advise me to resolve the below issue that I am facing while loading.
BTEQ Command used :
.SET width 64000;
.SET session transaction btet;
.logmech ldap
.logon XXXXXXX/XXXXXXXX,********;
DATABASE corecm;
.PACK 1000
.IMPORT VARTEXT '~' FILE=/v/global/user/application_event_bus_evt
.REPEAT *
USING(APPLICATION_EVENT_ID CHAR(24),BUS_EVT_ID CHAR(24),BUS_EVT_VID BIGINT,BUS_EVT_RESTATE_IN SMALLINT)
insert into corecm.application_event_bus_evt (APPLICATION_EVENT_ID
, BUS_EVT_ID
, BUS_EVT_VID
, BUS_EVT_RESTATE_IN
)
values
( COALESCE(:APPLICATION_EVENT_ID,1)
, COALESCE(:BUS_EVT_ID,1)
, COALESCE(:BUS_EVT_VID,1)
, COALESCE(:BUS_EVT_RESTATE_IN,1)
) ;
.LOGOFF;
.EXIT;
SAMPLE INPUT FILE DELIMITTER "~" [ /v/global/user/application_event_bus_evt ] :
Ckn3gMxLEeOgIQBQVgErYA==~g+GDDtlaY3n7BdUrYshDFA==~1~1
CL1kEcxLEeOgIQBQVgErYA==~qoKoiuGDbClpcGt/z6RKGw==~1~1
oYIVcMxKEeOgIQBQVgErYA==~mfmQiwl7yAteevzJfilMvA==~1~1
5N7ME5bM4xGhM7exj3ykUw==~yFM2FZbM4xGhM7exj3ykUw==~1~0
JLBH4JfM4xGDH9s5+Ds/8w==~doZ/7pfM4xGDH9s5+Ds/8w==~1~0
fGvpoMxKEeOgIQBQVgErYA==~mQUQIK2mY6WIPcszfp5BTQ==~1~1
Table Definition :
CREATE MULTISET TABLE CORECM.APPLICATION_EVENT_BUS_EVT ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
APPLICATION_EVENT_ID CHAR(26) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
BUS_EVT_ID CHAR(26) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
BUS_EVT_VID BIGINT NOT NULL,
BUS_EVT_RESTATE_IN SMALLINT)
UNIQUE PRIMARY INDEX ( APPLICATION_EVENT_ID ,BUS_EVT_ID ,BUS_EVT_VID )
INDEX APPLICATION_EVENT_BUS_EVT_IDX1 ( APPLICATION_EVENT_ID )
INDEX APPLICATION_EVENT_BUS_EVT_IDX2 ( BUS_EVT_ID ,BUS_EVT_VID );
Results set in DB server as,
APPLICATION_EVENT_ID BUS_EVT_ID BUS_EVT_VID BUS_EVT_RESTATE_IN
1 Ckn3gMxLEeOgIQBQVgErYA == g+GDDtlaY3n7BdUrYshD 85,849,873,219,141,958 12,544
2 CL1kEcxLEeOgIQBQVgErYA == qoKoiuGDbClpcGt/z6RK 85,849,873,219,155,783 12,544
3 oYIVcMxKEeOgIQBQVgErYA == mfmQiwl7yAteevzJfilM 85,849,873,219,142,006 12,544
4 5N7ME5bM4xGhM7exj3ykUw == JAf0GpbM4xGhM7exj3yk 85,849,873,219,155,797 12,288
5 JLBH4JfM4xGDH9s5+Ds/8w == Du6T7pfM4xGDH9s5+Ds/ 85,849,873,219,155,768 12,288
6 fGvpoMxKEeOgIQBQVgErYA == mQUQIK2mY6WIPcszfp5B 85,849,873,219,146,068 12,544
If we look at the Data, we can see two issues as,
First two column data length is 24 CHARACTERS ( as per input file ), but the issue is that it been shifted two characters in next column.
Column BUS_EVT_VID and BUS_EVT_RESTATE_IN has wrong data 85,849,873,219,141,958 and 12,544 instead of 1 and 1 respectively (this may be because first two column data got shifted)
I tried following options to resolve the above issue but couldn't resolve the issue,
Modified the Table Definition, i.e. changed datatype to
CHAR(28),CHAR(24),CHAR(26)
Modified the Table Definition column
datatypes to VARCHAR(24), VARCHAR(26)
Modified BTEQ command, i.e. altered datatype in below line,
USING(APPLICATION_EVENT_ID CHAR(24),BUS_EVT_ID CHAR(24),BUS_EVT_VID BIGINT,BUS_EVT_RESTATE_IN SMALLINT)
Thanks in advance.
When you define VARTEXT all input columns must be defined as VARCHAR, but you used CHAR and INT.
This should work, VARCHAR length based on the definition of your target table:
USING(
APPLICATION_EVENT_ID VARCHAR(26),
BUS_EVT_ID VARCHAR(26),
BUS_EVT_VID VARCHAR(19),
BUS_EVT_RESTATE_IN VARCHAR(6)
)