Related
I am cleaning up a dataset (csv dataset). I only want to consider registers in which all fields are complete and have the right type of values. This is what I tried:
sed -r '{
/regex_pattern/!d
more commands follow...
}' $1
The program works just fine and does what it is supposed to do. The problem is that it also removes the very first line (header line) since it does not match the specific regex_pattern. I know there is a way to specify the range in which the command should apply so for example:
sed '2,$ s/A/a/'
will do substitutions on data skipping the header line. Based on this logic I tried:
sed -r '{
2,$/regex_pattern/!d
more commands follow...
}' $1
so that the header line will be untouched however this code does not run at all.So what (and why) would be the right command to do what I am intending?
As an example, imagine my csv file is fruits.csv and that my regex_pattern is [0-9]+,[0-9]+
apples,oranges
20,5
7,3
,4
a,b
12,22
When I call the .sh script that contains the sed commands in should output:
apples,oranges
20,5
7,3
12,22
So, note that:
Header line was not deleted even though it does not match the regex_pattern.
Line number 4, i.e. ",4" was deleted as it does not match the regex_pattern.
Line number 5, i.e. "a,b" was deleted as it does not match the regex_pattern.
Any help is very much appreciated and I wish to thank you all in advance.
Kind regards.
You could write it like this, matching the whole line, starting at the second line:
sed -r '
2,${/^[0-9]+,[0-9]+$/!d}
' file
Output
apples,oranges
20,5
7,3
12,22
If you also want to allow single numbers or more than just 2 comma separated numbers:
sed -r '
2,${/^[0-9]+(,[0-9]+)*$/!d}
' file
Using sed
$ sed '2,${/[0-9]\+,[0-9]\+/!d}' input_file
apples,oranges
20,5
7,3
12,22
any one of these should work in gawk, mawk1/2, or macos nawk
mawk 'NF-_^(NF==NR)' FS='^[0-9]+,[0-9]+$'
nawk '(NF!=NR)!=NF' FS='^[0-9]+,[0-9]+$'
gawk 'NF-(NF!~NR)' FS='^[0-9]+,[0-9]+$'
'
apples,oranges
20,5
7,3
12,22
more concisely would be
mawk -F'[0-9]+,[0-9]+' '(NF<NR)-NF' # using FS
gawk '/[0-9]+,[0-9]+/^+(NF<NR)' # not using FS
nawk '(NF<NR)<=/([0-9]+,?){2}/' # same approach, rev. order
mawk '(NF~NR)-/[0-9]+,[0-9]+/' # truly fringe but
# concise syntax
nawk '(NF~NR)!=/([0-9]+,?){2}/' # same approach, to
# circumvent nawk peculiarities
sed is a bad choice for working with CSVs since it doesn't have any inbuilt functionality for working with fields, nor literal strings, nor variables, doesn't use EREs by default (all of the answers you have so far will only work with GNU sed), etc. To do what you specifically want with any awk in any shell on every Unix box is simply:
$ awk 'NR==1 || /[0-9]+,[0-9]+/' file
apples,oranges
20,5
7,3
12,22
which says "if the current line number (stored in NR) is 1 or the regexp matches the current line contents then print the line". Anything else you want to do with your CSV will also be easier with awk than with sed.
Meh, I would just preserve first line.
sed -r '
1{p;d}
/regex_pattern/!d
more commands follow...
' "$1"
or run it not for first line:
1!{
/regex_pattern/!d
more commands follow...
}
This might work for you (GNU sed):
sed -E '1!{/^[0-9]+,[0-9]+$/!d}' file
If it is not the first line, delete any line that does not match one set of comma separated natural numbers.
Alternative:
sed -E '1b;/^[0-9]+,[0-9]+$/!d' file
Or:
sed -nE '1p;1b;/^[0-9]+,[0-9]+$/p' file
I have a file csv with this columns:
"Weight","Impedance","Units","User","Timestamp","PhysiqueRating"
"58.75","5.33","kg","7","2020-7-11 19:29:29","5"
Of course, I can convert the date command:
date -d '2020-7-11 19:29:29' +%s
Results:
1594488569
How to replace this date in csv file in bash script?
With GNU sed
sed -E '2,$ s/(("[^"]*",){4})("[^"]+")(.*)/echo \x27\1"\x27$(date -d \3 +%s)\x27"\4\x27/e'
2,$ to skip header from getting processed
(("[^"]*",){4}) first four columns
("[^"]+") fifth column
(.*) rest of the line
echo \x27\1"\x27 and \x27"\4\x27 preserve first four columns and rest of line after fifth column, along with adding double quotes to result of date conversion
$(date -d \3 +%s) calling shell command with fifth column value
Note that this command will fail if input can contain single quotes. That can be worked around by using s/\x27/\x27\\&\x27/g.
You can see the command that gets executed by using -n option and pe flags
sed -nE '2,$ s/(("[^"]*",){4})("[^"]+")(.*)/echo \x27\1"\x27$(date -d \3 +%s)\x27"\4\x27/pe'
will give
echo '"58.75","5.33","kg","7","'$(date -d "2020-7-11 19:29:29" +%s)'","5"'
For 58.25,5.89, kg, 7,2020 / 7/12 11:23:46, "5" format, try
sed -E '2,$ s/(([^,]*,){4})([^,]+)(.*)/echo \x27\1\x27$(date -d "\3" +%s)\x27\4\x27/e'
or (adapted from https://stackoverflow.com/a/62862416)
awk 'BEGIN{FS=OFS=","} NR>1{$5=mktime(gensub(/[:\/]/, " ", "g", $5))} 1'
Note: For the sed solution, if the input can come from outside source, you'll have to take care to avoid malicious intent as mentioned in the comments. One way is to match the fifth column using [0-9: -]+ or similar.
Using GNU awk:
$ gawk '
BEGIN {
FS=OFS=","
}
{
n=split($5,a,/[-" :]/)
if(n==8)
$5="\"" mktime(sprintf("%s %s %s %s %s %s",a[2],a[3],a[4],a[5],a[6],a[7])) "\""
}1' file
Output:
"Weight","Impedance","Units","User","Timestamp","PhysiqueRating"
"58.75","5.33","kg","7","1594484969","5"
With GNU awk for gensub() and mktime():
$ awk 'BEGIN{FS=OFS="\""} NR>1{$10=mktime(gensub(/[-:]/," ","g",$10))} 1' file
"Weight","Impedance","Units","User","Timestamp","PhysiqueRating"
"58.75","5.33","kg","7","1594513769","5"
I want to get the first line of a file that is not commented out with an hash, then append a line of text just after that line just before that line.
I managed to get the number of the line:
sed -n '/^\s*#/!{=;q}' file // prints 2
and also to insert text (specifying the line manually):
sed '2 a extralinecontent' file
I can't get them working together as a one liner or in a batch.
I tried command substitution (with $(command) and also with backticks) but I get an error from bash:
sed '$(sed -n '/^\s*#/!{=;q}' file) a extralinecontent' file
-bash: !{=: event not found
and also tried many other combinations, but no luck.
I'm using gnu-sed (via brew) on macOS.
This might work for you (GNU sed):
sed -e '/^\s*#/b;a extra line content' -e ':a;n;ba' file
Bail out of any lines beginning with a comment at the beginning of the file, append an extra line following the first line that is not a comment and keep fetching/printing all the remaining lines of the file.
Here's a way to do it with GNU sed without reading the file twice
$ cat ip.txt
#comment
foo baz good
123 456 7889
$ sed -e '0,/^\s*[^#[:space:]]/ {// a XYZ' -e '}' ip.txt
#comment
foo baz good
XYZ
123 456 7889
GNU sed allows first address to be 0 if the other address is regex, that way this will work even if first line matches the condition
/^\s*[^#[:space:]]/ as sed doesn't support possessive quantifier, need to ensure that the first character being matched by the character class isn't either a # or a whitespace character
// is a handy shortcut to repeat the last regex
a XYZ your required line to be appended (note that your question mentiones insert, so if you want that, use i instead of a)
I need to process every line of a curve CSV to remove the last column of only those lines which start with 10 commas. the sed command I used was:
$ cat curves.csv
(...)
,,,,,,,,,,2017/10/18,20630.000000
,,,,,,,,,,2017/11/15,20595.000000
,usdSN,:usdSN,,,,8005,$,,2017/08/07,Settlement Date
,,,,,,,,,,2017/12/20,20575.000000
,,,,,,,,,,2018/01/17,20555.000000
,,,,,,,,,,2018/02/21,20535.000000
(...)
,,,,,,,,,,2018/12/21,20290.000000
,usdZS,:usdZS,,,,8007,$,,2017/08/07,Settlement Date
,,,,,,,,,,2017/08/16,2848.500000
(...)
$ sed s/\(,,,,,,,,,,[0-9/]*\),[0-9.]*/\1/g curves.csv
however, it didn't work. it printed out all lines unchanged.
Please help.
Another approach with GNU sed:
sed -r '/^,{10}/{s/,[^,]*$//}' file
Output:
(...)
,,,,,,,,,,2017/10/18
,,,,,,,,,,2017/11/15
,usdSN,:usdSN,,,,8005,$,,2017/08/07,Settlement Date
,,,,,,,,,,2017/12/20
,,,,,,,,,,2018/01/17
,,,,,,,,,,2018/02/21
(...)
,,,,,,,,,,2018/12/21
,usdZS,:usdZS,,,,8007,$,,2017/08/07,Settlement Date
,,,,,,,,,,2017/08/16
(...)
The problem is the way you are running sed. You did:
sed s/\(,,,,,,,,,,[0-9/]*\),[0-9.]*/\1/g curves.csv
But because the parameters aren't quoted the shell resolves the escape characters and what is actually run is:
sed s/(,,,,,,,,,,[0-9/]*),[0-9.]*/\1/g curves.csv
Which doesn't match anything because there are no parenthesis in your file. How you should run it is:
sed 's/\(,,,,,,,,,,[0-9/]*\),[0-9.]*/\1/g' curves.csv
Can anyone explain how to use sed to delete all characters up to & including the 2nd comma on a line in a CSV file?
The beginning of a typical line might look like
1234567890,ABC/DEF, and the number of digits in the first column varies i.e. there might be 9 or 10 or 11 separate digits in random order, and the letters in the second column could also be random. This randomness and varying length makes it impossible to use any explicit pattern searching.
You could do it with sed like this
sed -e 's/^\([^,]*,\)\{2\}//'
not 100% sure on the syntax, I tried it, and it seems to work though. It'll delete zero-or-more of anything-but-a-comma followed by a comma, and all that is matched twice in succession.
But even easier would be to use cut, like this
cut -d, -f3-
which will use comma as a delimiter, and print fields 3 and up.
EDIT:
Just for the record, both sed and cut can work with a file as a parameter, just append it at the end like so
cut -d, -f3- myfile.txt
or you can pipe the output of your program through them
./myprogram | cut -d, -f3-
sed is not the "right" choice of tool (although it can be done). since you have structured data, you can use fields/delimiter method instead of creating complicated regex.
you can use cut
$ cut -f3- -d"," file
or gawk
$ gawk -F"," '{$1=$2=""}1' file
$ gawk -F"," '{for(i=3;i<NF;i++) printf "%s,",$i; print $NF}' file
Thanks for all replies - with the help provided I have written the simple executable script below which does what I want.
#!/bin/bash
cut -d, -f3- ~/Documents/forex_convert/input.csv |
sed -e '1d' \
-e 's/-/,/g' \
-e 's/ /,/g' \
-e 's/:/,/g' \
-e 's/,D//g' > ~/Documents/forex_convert/converted_input
exit