How to remove a dynamic string from a CSV file using sed? - sed

I added a dummy column at the beginning of my data export to a CSV file to get rid of control characters and some specific string values as mentioned below by using a pipe '|' delimiter. This data is coming from Teradata fast export using utf-8
'''
y^CDUMMYCOLUMN|
<86>^ADUMMYCOLUMN|
<87>^ADUMMYCOLUMN|
<94>^ADUMMYCOLUMN|
{^ADUMMYCOLUMN|
_^ADUMMYCOLUMN|
y^CDUMMYCOLUMN|
[^ADUMMYCOLUMN|
k^ADUMMYCOLUMN|
m^ADUMMYCOLUMN|
<82>^ADUMMYCOLUMN|
c^ADUMMYCOLUMN|
<8e>^ADUMMYCOLUMN|
<85>^ADUMMYCOLUMN|
'''
This is completely random and not every row has these special characters. I'm sure I'm missing something here. I'm using sed to get rid of dummycolumn and control characters.
'''$ sed -e 's/.*DUMMYCOLUMN|//;/^$/d' data.csv > data_output.csv'''
After running this statement, I'm still remaining these below random values.
'''
<86>
<87>
<85>
<94>
<8a>
<85>
<8e>
'''
I could have written a sed statement to remove first three letters from each row but this series is not appearing in every row. At the same time, row count is 400 Million.
Current output.
y^CDUMMYCOLUMN|COLUMN1|COLUMN2|COLUMN3
<86>^ADUMMYCOLUMN|6218915846|36596|12
<87>^ADUMMYCOLUMN|9822354765|35325|33
t^ADUMMYCOLUMN|6788793999|111|12
g^ADUMMYCOLUMN|6090724004|7017|12
_^ADUMMYCOLUMN|IC-21357688806502|111|12
<8e>^ADUMMYCOLUMN|9682027117|35335|33
v^ADUMMYCOLUMN|6406807681|121|12
h^ADUMMYCOLUMN|6346768510|121|12
V^ADUMMYCOLUMN|6130452510|7017|12
Desired Output
COLUMN1|COLUMN2|COLUMN3
6218915846|36596|12
9822354765|35325|33
6788793999|111|12
6090724004|7017|12
IC-21357688806502|111|12
9682027117|35335|33
6406807681|121|12
6346768510|121|12
6130452510|7017|12
Please help.
Thank you.

Related

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

Is there any way to encode Multiple columns in a csv using base64 in Shell?

I have a requirement to replace multiple columns of a csv file with its base64 encoding value which should be applied to some columns of the file but keep the first line unaffected as the first line contains the header of the file. I have tried out for 1 column as below but as I have given it to proceed after skipping the first line of the file it is not
gawk 'BEGIN { FS="|"; OFS="|" } NR >=2 { cmd="echo "$4" | base64 -w 0";cmd | getline x;close(cmd); print $1,$2,$3,x}' awktest
o/p:
12|A|B|Qw==
13|C|D|RQ==
36|Z|V|VQ==
Qs: It is not showing the header in the output. What should I do to make produce the header in the output? Also can I use any loop here to replace multiple columns?
input:
10|A|B|C|5|T|R
12|A|B|C|6|eee|ff
13|C|D|E|9|dr|xrdd
36|Z|V|U|7|xc|xd
Required output:
10|A|B|C|5|T|R
12|A|B|encodedvalue|6|encodedvalue|ff
13|C|D|encodedvalue|9|encodedvalue|xrdd
36|Z|V|encodedvalue|7|encodedvalue|xd
Is this possible? Have researched a lot but could not find a proper explanation. I am new to shell. Kindly help. Many thanks!!!!
It looks like you can just sequence conditionals. This may not be the best way of solving the header issue, but it's intuitive.
BEGIN { FS="|"; OFS="|" } NR ==1 {print} NR >=2 { cmd="echo "$4" | base64 -w 0";cmd | getline x;close(cmd); print $1,$2,$3,x}
As for using a loop to affect multiple columns... Loops in bash are hard. Awk is technically its own language, and may have a looping construct of it's own, IDK. But it's not clear you need a loop. If there's only a reasonable number of fields that need modifying, you can just parameterize the existing command (somehow) by the field index, and then pipe through however many instances of it. It won't be as performant as doing it all in a single pass of awk, but that's probably ok.

how import csv file into Postgres with empty values?

I am trying to import one csv file into Postgres which does contain age values, however there are also some empty values, since not all ages are known.
I would like to import the columns as real, since the columns contain ages with decimals like 98.45. The empty values for people when age is not known is apparently considered as strings, however I still would like to import the ages values as numbers. So I was wondering how to import real values, even when some cells in the csv are empty and thus are considered according to Postgres as string values.
for creation I used the following code, since I am dealing with decimal values.
Create table psychosocial.age (
respnr integer Primary key,
fage real,
gage real,
hage real);
after importing csv file, I get the following error
ERROR: invalid input syntax for integer: "11455, , , "
CONTEXT: COPY age, line 2, column respnr: "11455, , , "
One problem is that you're trying to import white spaces into numeric fields. So, first you have to pre-process your csv file before importing it.
Below is an example of how you can solve it using awk. From your console execute the following command:
$ cat file.csv | awk '{sub(/^ +/,""); gsub(/, /,",")}1' | psql db -c "COPY psychosocial.age FROM STDIN WITH CSV HEADER"
In case you're wondering how to pipe commands, take a look at these answers. Here a more detailed example on how to use COPY and the STDIN.
You also have to take into account that having quotation marks on integer fields can be problematic, e.g:
"11455, , , "
This will result in an error, since postgres will parse "11455 as a single value and will try to store it in an interger field, which will obviously fail. Instead, format your csv file to be like this:
11455, , ,
or even
11455,,,
You can achieve this also using awk from your console:
$ awk '{gsub(/\"/,"")};1' file.csv

USQL Escape Quotes

I am new to Azure data lake analytics, I am trying to load a csv which is double quoted for sting and there are quotes inside a column on some random rows.
For example
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
When I try loading, it fails on second record and throwing an error message.
1, I wonder if there is a way to fix this in csv file, unfortunatly we cannot extract new from source as these are log files?
2, is it possible to let ADLA to ignore the bad rows and proceed with rest of the records?
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error
occurred while extracting row after processing 9045 record(s) in the
vertex' input split. Column index: 9, column name:
'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid
character following the ending quote character in a quoted
field.","description":"Invalid character is detected following the
ending quote character in a quoted field. A column delimiter, row
delimiter or EOF is expected.\nThis error can occur if double-quotes
within the field are not correctly escaped as two
double-quotes.","resolution":"Column should be fully surrounded with
double-quotes and double-quotes within the field escaped as two
double-quotes."
As per the error message, if you are importing a quoted csv, which has quotes within some of the columns, then these need to be escaped as two double-quotes. In your particular example, you second row needs to be:
..."Life after death and ""good death"" models - a qualitative study",...
So one option is to fix up the original file on output. If you are not able to do this, then you can import all the columns as one column, use RegEx to fix up the quotes and output the file again, eg
// Import records as one row then use RegEx to clean columns
#input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
#output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "$1\"\"$2") AS cleanCol
FROM #input;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
The file will now import successfully. My results:

Vertica COPY + FLEX table

I want to load on a flex table a log in which each record is composed by some fields + a JSON, the format is the following:
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z";"2017-01-31T00:00:04.714Z";"{"requestedFrom":"BUTTON","tripId":{"request":3003926837969,"mac":"v01162450701"}}"
and (after many tries) I'm using the COPY command with a CSV parser in this way:
COPY schema.flex_table from local 'C:\temp/test.log' parser fcsvparser(delimiter=';',header=false, trim=true, type='traditional')
in this way all is loaded correctly except the JSON, that is skipped and left empty.
Is there a way to load also the JSON as a string?
HINT: just for test puposes, I noticed that if in the JSON I put a '\' before every '"' in the log, the loading runs smoothly, but unfortunately I cannot modify the content of the log.
Not without modifying the file beforehand - or writing your own UDParser function.
It clearly is a strange format: CSV (well, semicolon delimited and with double quotes as string enclosers), until the children appear - which are stored with a leading double quote and a trailing double quote; doubly nested with curly braces - JSON type, ok. But you have double quotes (not doubled) within the JSON encoding - any parser would go astray on those.
You'll have to write a program (ideally in C) to remove the curly braces, to remove the column names in the JSON code and leave just a CSV line
So, from the line (the backslash at the end means an escaped newline, meaning that the three lines you see are actually one line, for readability)
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z"; \
"2017-01-31T00:00:04.714Z"; \
"{"requestedFrom":"BUTTON","tripId":{"request":3003926837969,"mac":"v01162450701"}}"
you make (title line with column names, then data line)
col1;col2;col3;timestampz1;\
timestampz2;requestedfrom;tripid_request;tripid_mac
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z"; \
"2017-01-31T00:00:04.714Z";"BUTTON";3003926837969;"v01162450701"
Finally, you'll be able to load it as a CSV file - and maybe you will have to then normalise everything again: tripIdseems to be a dependent structure ....
Good luck
Marco the Sane