How to null missing columns when importing CSV - postgresql

I have a very large(200+GB), incorrectly formatted CSV file. Some columns have fewer values than the other columns, i.e.
col1,col2,col3,col4
val2,val3,val5,val6
val2
val2,val3
val2,val4,val8,val9
Obviously, when I try to import this into postgres it will throw an error about missing data for columns. I would like to avoid fixing this CSV file, as it is very large and will likely take quite a bit of time. How do I get the importer to simply insert null values for the missing data instead of throwing an error?

You could use awk to edit the .csv file.
#!/bin/sh
cat - <<OMG > omg.csv
col1,col2,col3,col4
val2,val3,val5,val6
val2
val2,val3
val2,val4,val8,val9
OMG
awk -F, '{printf($0); for (i=NF;i<4;i++) {printf(",");} printf("\n"); }' < omg.csv # >out.csv
Result:
$ sh awk.sh
col1,col2,col3,col4
val2,val3,val5,val6
val2,,,
val2,val3,,
val2,val4,val8,val9
$

Related

How to remove a dynamic string from a CSV file using sed?

I added a dummy column at the beginning of my data export to a CSV file to get rid of control characters and some specific string values as mentioned below by using a pipe '|' delimiter. This data is coming from Teradata fast export using utf-8
'''
y^CDUMMYCOLUMN|
<86>^ADUMMYCOLUMN|
<87>^ADUMMYCOLUMN|
<94>^ADUMMYCOLUMN|
{^ADUMMYCOLUMN|
_^ADUMMYCOLUMN|
y^CDUMMYCOLUMN|
[^ADUMMYCOLUMN|
k^ADUMMYCOLUMN|
m^ADUMMYCOLUMN|
<82>^ADUMMYCOLUMN|
c^ADUMMYCOLUMN|
<8e>^ADUMMYCOLUMN|
<85>^ADUMMYCOLUMN|
'''
This is completely random and not every row has these special characters. I'm sure I'm missing something here. I'm using sed to get rid of dummycolumn and control characters.
'''$ sed -e 's/.*DUMMYCOLUMN|//;/^$/d' data.csv > data_output.csv'''
After running this statement, I'm still remaining these below random values.
'''
<86>
<87>
<85>
<94>
<8a>
<85>
<8e>
'''
I could have written a sed statement to remove first three letters from each row but this series is not appearing in every row. At the same time, row count is 400 Million.
Current output.
y^CDUMMYCOLUMN|COLUMN1|COLUMN2|COLUMN3
<86>^ADUMMYCOLUMN|6218915846|36596|12
<87>^ADUMMYCOLUMN|9822354765|35325|33
t^ADUMMYCOLUMN|6788793999|111|12
g^ADUMMYCOLUMN|6090724004|7017|12
_^ADUMMYCOLUMN|IC-21357688806502|111|12
<8e>^ADUMMYCOLUMN|9682027117|35335|33
v^ADUMMYCOLUMN|6406807681|121|12
h^ADUMMYCOLUMN|6346768510|121|12
V^ADUMMYCOLUMN|6130452510|7017|12
Desired Output
COLUMN1|COLUMN2|COLUMN3
6218915846|36596|12
9822354765|35325|33
6788793999|111|12
6090724004|7017|12
IC-21357688806502|111|12
9682027117|35335|33
6406807681|121|12
6346768510|121|12
6130452510|7017|12
Please help.
Thank you.

Is there any way to encode Multiple columns in a csv using base64 in Shell?

I have a requirement to replace multiple columns of a csv file with its base64 encoding value which should be applied to some columns of the file but keep the first line unaffected as the first line contains the header of the file. I have tried out for 1 column as below but as I have given it to proceed after skipping the first line of the file it is not
gawk 'BEGIN { FS="|"; OFS="|" } NR >=2 { cmd="echo "$4" | base64 -w 0";cmd | getline x;close(cmd); print $1,$2,$3,x}' awktest
o/p:
12|A|B|Qw==
13|C|D|RQ==
36|Z|V|VQ==
Qs: It is not showing the header in the output. What should I do to make produce the header in the output? Also can I use any loop here to replace multiple columns?
input:
10|A|B|C|5|T|R
12|A|B|C|6|eee|ff
13|C|D|E|9|dr|xrdd
36|Z|V|U|7|xc|xd
Required output:
10|A|B|C|5|T|R
12|A|B|encodedvalue|6|encodedvalue|ff
13|C|D|encodedvalue|9|encodedvalue|xrdd
36|Z|V|encodedvalue|7|encodedvalue|xd
Is this possible? Have researched a lot but could not find a proper explanation. I am new to shell. Kindly help. Many thanks!!!!
It looks like you can just sequence conditionals. This may not be the best way of solving the header issue, but it's intuitive.
BEGIN { FS="|"; OFS="|" } NR ==1 {print} NR >=2 { cmd="echo "$4" | base64 -w 0";cmd | getline x;close(cmd); print $1,$2,$3,x}
As for using a loop to affect multiple columns... Loops in bash are hard. Awk is technically its own language, and may have a looping construct of it's own, IDK. But it's not clear you need a loop. If there's only a reasonable number of fields that need modifying, you can just parameterize the existing command (somehow) by the field index, and then pipe through however many instances of it. It won't be as performant as doing it all in a single pass of awk, but that's probably ok.

how import csv file into Postgres with empty values?

I am trying to import one csv file into Postgres which does contain age values, however there are also some empty values, since not all ages are known.
I would like to import the columns as real, since the columns contain ages with decimals like 98.45. The empty values for people when age is not known is apparently considered as strings, however I still would like to import the ages values as numbers. So I was wondering how to import real values, even when some cells in the csv are empty and thus are considered according to Postgres as string values.
for creation I used the following code, since I am dealing with decimal values.
Create table psychosocial.age (
respnr integer Primary key,
fage real,
gage real,
hage real);
after importing csv file, I get the following error
ERROR: invalid input syntax for integer: "11455, , , "
CONTEXT: COPY age, line 2, column respnr: "11455, , , "
One problem is that you're trying to import white spaces into numeric fields. So, first you have to pre-process your csv file before importing it.
Below is an example of how you can solve it using awk. From your console execute the following command:
$ cat file.csv | awk '{sub(/^ +/,""); gsub(/, /,",")}1' | psql db -c "COPY psychosocial.age FROM STDIN WITH CSV HEADER"
In case you're wondering how to pipe commands, take a look at these answers. Here a more detailed example on how to use COPY and the STDIN.
You also have to take into account that having quotation marks on integer fields can be problematic, e.g:
"11455, , , "
This will result in an error, since postgres will parse "11455 as a single value and will try to store it in an interger field, which will obviously fail. Instead, format your csv file to be like this:
11455, , ,
or even
11455,,,
You can achieve this also using awk from your console:
$ awk '{gsub(/\"/,"")};1' file.csv

Postgresql - import from CSV null values wrapped in double quotes

So I am trying to import some data into postgresql using the COPY command.
Here is a sample of what the data looks like:
"UNIQ_ID","SP_grd1","SACN_grd1","BIOME_grd1","Meso_grd1","DM_grd1","VEG_grd1","lcov90_alb","WMA_grd1"
"G01_00000002","199058001.00000","1.00000","6.00000","24889.00000","2.00000","381.00000","33.00000","9.00000"
"G01_00000008","*********************","1.00000","*********************","24889.00000","2.00000","*********************","34.00000","*********************"
the issue that I am having is the double quotes that are wrapping the ********************* which are the null values.
I am using the following in order to create the data table and copy the data:
CREATE TABLE bravo.G01(UNIQ_ID character varying(18), SP_grd1 double precision ,SACN_grd1 numeric,BIOME_grd1 numeric,Meso_grd1 double precision,DM_grd1 numeric,VEG_grd1 numeric,lcov90_alb numeric,WMA_grd1 numeric);
COPY bravo.g01(UNIQ_ID,SP_grd1,SACN_grd1,BIOME_grd1,Meso_grd1,DM_grd1,VEG_grd1,lcov90_alb,WMA_grd1) FROM 'F:\GreenBook-Backup\LUdatacube_20171206\CSV_Data_bravo\G01.csv' DELIMITER ',' NUll AS '*********************' CSV HEADER ;
the create table command works fine but I encounter an error with the NULL AS statement. If I edit the text file and remove the double quotes then the import works fine.
I assume that as CSVs with double quotes and null values are very common there must be a work around here that I am missing. I certainly don't want to go and edit each of my CSVs so that it doesn't have double quotes!
You might want to try adding FORCE_NULL( column_name [, ...] ) option.
As the documentation stated for FORCE_NULL:
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
The option available from Postgres 9.4: https://www.postgresql.org/docs/10/static/sql-copy.html
If you're on a unix-like platform, you could use sed to replace the null-strings with something postgresql will recognize automatically as null. On windows, powershell exposes similar functionality.
This approach is more general if you need to perform other types of clean up on the data before loading.
The regex pattern to match your null-string is "[\*]*"
cleaning the file with sed:
[unix]>sed 's/"[\*]*"//g' test.csv > test2.csv
cleaning the file with windows powershell:
[windows-powershell]>cat test.csv | %{$_ -replace '"[\*]*"', ""} > test2.csv
loading into postgresql can then be shorter.:
psql>\copy bravo.g01 FROM 'test2.csv' WITH CSV HEADER;

Can I get a CSV header but no row count in PostgreSQL?

When I do psql --no-align --field-separator ',', I get CSV output with a header containing field names and a trailer telling me how many rows were found. To pass that into an analysis program, I need the header but not the trailer. I can surely write a filter to pass the first N-1 lines of the psql output but I'd prefer to suppress the trailer. Is there an option I'm missing that will turn on the header with --tuples-only or turn off the trailer?
psql --no-align --field-separator ',' --pset footer will turn off the row summary footer
I found
COPY (...query...) TO STDOUT WITH CSV HEADER;
at http://blogs.law.harvard.edu/dlarochelle/2011/12/11/outputing-to-csv-in-postgresql/.
It doesn't seem to work without the TO STDOUT but I can work with that.