How to import empty strings as null values from CSV file - using pgloader? - postgresql

I am using pgloader to import from a .csv file which has empty strings in double quotes. A sample line is
12334,0,"MAIL","CA","","Sanfransisco","TX","","",""
After a successful import, the fields that has double quotes ("") are shown as two single quotes('') in postgres database.
Is there a way we can insert a null or even empty string in place of two single quotes('')?
I am using the arguments -
WITH truncate,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
SET client_encoding to 'UTF-8',
work_mem to '12MB',
standard_conforming_strings to 'on'
I tried using 'empty-string-to-null' mentioned in the documentation like this -
CAST column enumerate.fax using empty-string-to-null
But it gives me an error saying -
pgloader nph_opr_addr.test.load An unhandled error condition has been
signalled: At LOAD CSV
^ (Line 1, Column 0, Position 0) Could not parse subexpression ";"
when parsing

Use the field option:
null if blanks
Something like this:
...
having fields foo, bar, mynullcol null if blanks, baz
From the documentation:
null if
This option takes an argument which is either the keyword blanks or a double-quoted string.
When blanks is used and the field value that is read contains only space characters, then it's automatically converted to an SQL NULL value.
When a double-quoted string is used and that string is read as the field value, then the field value is automatically converted to an SQL NULL value

Related

Getting ERROR: invalid input syntax for type double precision: "" in PostgreSQL

I am trying to use copy data from CSV to Postgres Table using the following command.
psql -c "\COPY team_cweo.bsa_mobile_pre_retention_asset FROM 'part-00199-8372009a-439d-49e0-9efc-141aead78131-c000.csv' CSV HEADER DELIMITER ','
The CSV file is the result oft Spark DataFrameWriter. I realized that for some fields there are null values which is represent as "" in the CSV file. But because of this I am getting the following error :
ERROR: invalid input syntax for type double precision: ""
CONTEXT: COPY bsa_mobile_pre_retention_asset, line 3, column 6281410000207
How should I do so that Postgresql knows that "" is null values instead of empty string. Or should I do something in the DataFrameWriter so that null values can be represent as something else in the CSV file.
Yes, it would be good if you could choose a different representation for NULL values, ideally an empty string. At any rate, it cannot contain the escape character (by default "). You can then use the NULL option of COPY, for example NULL '(null)' (the default value is the empty string).
If you cannot do that, you could define the column as type text and later convert it with
ALTER TABLE tab
ALTER col TYPE double precision USING CAST (nullif(col, '') AS double precision);
But that requires that the table gets rewritten, which can take a while.

Basic DELETE commands in PostgreSQL detecting value as column name [duplicate]

This question already has answers here:
delete "column does not exist"
(1 answer)
SQL domain ERROR: column does not exist, setting default
(3 answers)
Closed last year.
I'm trying to delete a row at PostgreSQL using pgAdmin4.
Here is my command:
DELETE FROM commissions_user
WHERE first_name = "Steven";
For some reason, the error states that
ERROR: column "Steven" does not exist
LINE 2: WHERE first_name = "Steven";
^
SQL state: 42703
Character: 50
It's weird, why is "Steven" detected as a column name, shouldn't the column name be first_name?
Use single quotes instead
DELETE FROM commissions_user
WHERE first_name = 'Steven';
Double quotes can be used table and column, and single quotes can be used for strings.
ex.
DELETE FROM "commissions_user"
WHERE "first_name" = 'Steven';
https://www.postgresql.org/docs/current/sql-syntax-lexical.html
Double quote:
A convention often used is to write key words in upper case and names
in lower case, e.g.:
UPDATE my_table SET a = 5;
There is a second kind of identifier: the delimited identifier or
quoted identifier. It is formed by enclosing an arbitrary sequence of characters in double-quotes ("). A delimited identifier
is always an identifier, never a key word. So "select" could be used
to refer to a column or table named “select”, whereas an unquoted
select would be taken as a key word and would therefore provoke a
parse error when used where a table or column name is expected. The
example can be written with quoted identifiers like this:
UPDATE "my_table" SET "a" = 5;
Single Quote:
https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
A string constant in SQL is an arbitrary sequence of characters
bounded by single quotes ('), for example 'This is a string'. To
include a single-quote character within a string constant, write two
adjacent single quotes, e.g., 'Dianne''s horse'. Note that this is not
the same as a double-quote character (")

When using psql to copy from a csv file, a "" for a null column generates an error instead of a null

I'm trying to use copy to copy a large csv file into a postgres table.
A certain integer column is primarily null. In the csv file, this column just has "".
Every column is quoted, which doesn't seem to be an issue for other columns.
I get this error when I try to copy it:
ERROR: invalid input syntax for integer: ""
I tried setting a NULL clause to '' and "" in my copy statement. '' does nothing, "" generates an error:
zero-length delimited identifier at or near """"
I tried using sed to change all "" to " ", but that still doesn't work even when I set the null clause to " ". I still get
ERROR: invalid input syntax for integer: " "
For now I am able to proceed by sed'ing the column to -1. I don't really care about this column much anyways. I'd be ok to just setting it to null, or ignoring it, but when I tried to take it out of the column definition section of the copy command, postgres yelled at me.
So my question comes down to this: how can I tell postgres to treat "" as a null value?
Thank you.
The typical way to indicate a missing value (null) in a .csv file is to just put nothing into that field. For instance, if you have three columns (A, B and C) and there is no value for B, the .csv file would contain "Col A value",,"Col C value". "" is a string value, not a numeric value, so there's no way for it to be considered one.
This is what the force_null option is for:
Match the specified columns' values against the null string, even if it has been quoted
So assuming the name of the int column is "y":
\copy foo from foo.csv with (format csv, force_null (y));

USQL Escape Quotes

I am new to Azure data lake analytics, I am trying to load a csv which is double quoted for sting and there are quotes inside a column on some random rows.
For example
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
When I try loading, it fails on second record and throwing an error message.
1, I wonder if there is a way to fix this in csv file, unfortunatly we cannot extract new from source as these are log files?
2, is it possible to let ADLA to ignore the bad rows and proceed with rest of the records?
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error
occurred while extracting row after processing 9045 record(s) in the
vertex' input split. Column index: 9, column name:
'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid
character following the ending quote character in a quoted
field.","description":"Invalid character is detected following the
ending quote character in a quoted field. A column delimiter, row
delimiter or EOF is expected.\nThis error can occur if double-quotes
within the field are not correctly escaped as two
double-quotes.","resolution":"Column should be fully surrounded with
double-quotes and double-quotes within the field escaped as two
double-quotes."
As per the error message, if you are importing a quoted csv, which has quotes within some of the columns, then these need to be escaped as two double-quotes. In your particular example, you second row needs to be:
..."Life after death and ""good death"" models - a qualitative study",...
So one option is to fix up the original file on output. If you are not able to do this, then you can import all the columns as one column, use RegEx to fix up the quotes and output the file again, eg
// Import records as one row then use RegEx to clean columns
#input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
#output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "$1\"\"$2") AS cleanCol
FROM #input;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
The file will now import successfully. My results:

Postgresql - import from CSV null values wrapped in double quotes

So I am trying to import some data into postgresql using the COPY command.
Here is a sample of what the data looks like:
"UNIQ_ID","SP_grd1","SACN_grd1","BIOME_grd1","Meso_grd1","DM_grd1","VEG_grd1","lcov90_alb","WMA_grd1"
"G01_00000002","199058001.00000","1.00000","6.00000","24889.00000","2.00000","381.00000","33.00000","9.00000"
"G01_00000008","*********************","1.00000","*********************","24889.00000","2.00000","*********************","34.00000","*********************"
the issue that I am having is the double quotes that are wrapping the ********************* which are the null values.
I am using the following in order to create the data table and copy the data:
CREATE TABLE bravo.G01(UNIQ_ID character varying(18), SP_grd1 double precision ,SACN_grd1 numeric,BIOME_grd1 numeric,Meso_grd1 double precision,DM_grd1 numeric,VEG_grd1 numeric,lcov90_alb numeric,WMA_grd1 numeric);
COPY bravo.g01(UNIQ_ID,SP_grd1,SACN_grd1,BIOME_grd1,Meso_grd1,DM_grd1,VEG_grd1,lcov90_alb,WMA_grd1) FROM 'F:\GreenBook-Backup\LUdatacube_20171206\CSV_Data_bravo\G01.csv' DELIMITER ',' NUll AS '*********************' CSV HEADER ;
the create table command works fine but I encounter an error with the NULL AS statement. If I edit the text file and remove the double quotes then the import works fine.
I assume that as CSVs with double quotes and null values are very common there must be a work around here that I am missing. I certainly don't want to go and edit each of my CSVs so that it doesn't have double quotes!
You might want to try adding FORCE_NULL( column_name [, ...] ) option.
As the documentation stated for FORCE_NULL:
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
The option available from Postgres 9.4: https://www.postgresql.org/docs/10/static/sql-copy.html
If you're on a unix-like platform, you could use sed to replace the null-strings with something postgresql will recognize automatically as null. On windows, powershell exposes similar functionality.
This approach is more general if you need to perform other types of clean up on the data before loading.
The regex pattern to match your null-string is "[\*]*"
cleaning the file with sed:
[unix]>sed 's/"[\*]*"//g' test.csv > test2.csv
cleaning the file with windows powershell:
[windows-powershell]>cat test.csv | %{$_ -replace '"[\*]*"', ""} > test2.csv
loading into postgresql can then be shorter.:
psql>\copy bravo.g01 FROM 'test2.csv' WITH CSV HEADER;