sed script to delete all characters up to & including the 2nd comma on a line - sed

Can anyone explain how to use sed to delete all characters up to & including the 2nd comma on a line in a CSV file?
The beginning of a typical line might look like
1234567890,ABC/DEF, and the number of digits in the first column varies i.e. there might be 9 or 10 or 11 separate digits in random order, and the letters in the second column could also be random. This randomness and varying length makes it impossible to use any explicit pattern searching.

You could do it with sed like this
sed -e 's/^\([^,]*,\)\{2\}//'
not 100% sure on the syntax, I tried it, and it seems to work though. It'll delete zero-or-more of anything-but-a-comma followed by a comma, and all that is matched twice in succession.
But even easier would be to use cut, like this
cut -d, -f3-
which will use comma as a delimiter, and print fields 3 and up.
EDIT:
Just for the record, both sed and cut can work with a file as a parameter, just append it at the end like so
cut -d, -f3- myfile.txt
or you can pipe the output of your program through them
./myprogram | cut -d, -f3-

sed is not the "right" choice of tool (although it can be done). since you have structured data, you can use fields/delimiter method instead of creating complicated regex.
you can use cut
$ cut -f3- -d"," file
or gawk
$ gawk -F"," '{$1=$2=""}1' file
$ gawk -F"," '{for(i=3;i<NF;i++) printf "%s,",$i; print $NF}' file

Thanks for all replies - with the help provided I have written the simple executable script below which does what I want.
#!/bin/bash
cut -d, -f3- ~/Documents/forex_convert/input.csv |
sed -e '1d' \
-e 's/-/,/g' \
-e 's/ /,/g' \
-e 's/:/,/g' \
-e 's/,D//g' > ~/Documents/forex_convert/converted_input
exit

Related

Use sed to remove lines that do not match a pattern but keep header line

I am cleaning up a dataset (csv dataset). I only want to consider registers in which all fields are complete and have the right type of values. This is what I tried:
sed -r '{
/regex_pattern/!d
more commands follow...
}' $1
The program works just fine and does what it is supposed to do. The problem is that it also removes the very first line (header line) since it does not match the specific regex_pattern. I know there is a way to specify the range in which the command should apply so for example:
sed '2,$ s/A/a/'
will do substitutions on data skipping the header line. Based on this logic I tried:
sed -r '{
2,$/regex_pattern/!d
more commands follow...
}' $1
so that the header line will be untouched however this code does not run at all.So what (and why) would be the right command to do what I am intending?
As an example, imagine my csv file is fruits.csv and that my regex_pattern is [0-9]+,[0-9]+
apples,oranges
20,5
7,3
,4
a,b
12,22
When I call the .sh script that contains the sed commands in should output:
apples,oranges
20,5
7,3
12,22
So, note that:
Header line was not deleted even though it does not match the regex_pattern.
Line number 4, i.e. ",4" was deleted as it does not match the regex_pattern.
Line number 5, i.e. "a,b" was deleted as it does not match the regex_pattern.
Any help is very much appreciated and I wish to thank you all in advance.
Kind regards.
You could write it like this, matching the whole line, starting at the second line:
sed -r '
2,${/^[0-9]+,[0-9]+$/!d}
' file
Output
apples,oranges
20,5
7,3
12,22
If you also want to allow single numbers or more than just 2 comma separated numbers:
sed -r '
2,${/^[0-9]+(,[0-9]+)*$/!d}
' file
Using sed
$ sed '2,${/[0-9]\+,[0-9]\+/!d}' input_file
apples,oranges
20,5
7,3
12,22
any one of these should work in gawk, mawk1/2, or macos nawk
mawk 'NF-_^(NF==NR)' FS='^[0-9]+,[0-9]+$'
nawk '(NF!=NR)!=NF' FS='^[0-9]+,[0-9]+$'
gawk 'NF-(NF!~NR)' FS='^[0-9]+,[0-9]+$'
'
apples,oranges
20,5
7,3
12,22
more concisely would be
mawk -F'[0-9]+,[0-9]+' '(NF<NR)-NF' # using FS
gawk '/[0-9]+,[0-9]+/^+(NF<NR)' # not using FS
nawk '(NF<NR)<=/([0-9]+,?){2}/' # same approach, rev. order
mawk '(NF~NR)-/[0-9]+,[0-9]+/' # truly fringe but
# concise syntax
nawk '(NF~NR)!=/([0-9]+,?){2}/' # same approach, to
# circumvent nawk peculiarities
sed is a bad choice for working with CSVs since it doesn't have any inbuilt functionality for working with fields, nor literal strings, nor variables, doesn't use EREs by default (all of the answers you have so far will only work with GNU sed), etc. To do what you specifically want with any awk in any shell on every Unix box is simply:
$ awk 'NR==1 || /[0-9]+,[0-9]+/' file
apples,oranges
20,5
7,3
12,22
which says "if the current line number (stored in NR) is 1 or the regexp matches the current line contents then print the line". Anything else you want to do with your CSV will also be easier with awk than with sed.
Meh, I would just preserve first line.
sed -r '
1{p;d}
/regex_pattern/!d
more commands follow...
' "$1"
or run it not for first line:
1!{
/regex_pattern/!d
more commands follow...
}
This might work for you (GNU sed):
sed -E '1!{/^[0-9]+,[0-9]+$/!d}' file
If it is not the first line, delete any line that does not match one set of comma separated natural numbers.
Alternative:
sed -E '1b;/^[0-9]+,[0-9]+$/!d' file
Or:
sed -nE '1p;1b;/^[0-9]+,[0-9]+$/p' file

Parsing a line with sed using regular expression

Using sed I want to parse Heroku's log-runtime-metrics like this one:
2016-01-29T00:38:43.662697+00:00 heroku[worker.2]: source=worker.2 dyno=heroku.17664470.d3f28df1-e15f-3452-1234-5fd0e244d46f sample#memory_total=54.01MB sample#memory_rss=54.01MB sample#memory_cache=0.00MB sample#memory_swap=0.00MB sample#memory_pgpgin=17492pages sample#memory_pgpgout=3666pages
the desired output is:
worker.2: 54.01MB (54.01MB is being memory_total)
I could not manage although I tried several alternatives including:
sed -E 's/.+source=(.+) .+memory_total=(.+) .+/\1: \2/g'
What is wrong with my command? How can it be corrected?
The .+ after source= and memory_total= are both greedy, so they accept as much of the line as possible. Use [^ ] to mean "anything except a space" so that it knows where to stop.
sed -E 's/.+source=([^ ]+) .+memory_total=([^ ]+) .+/\1: \2/g'
Putting your content into https://regex101.com/ makes it really obvious what's going on.
I'd go for the old-fashioned, reliable, non-extended sed expressions and make sure that the patterns are not too greedy:
sed -e 's/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/'
The -e is not the opposite of -E, which is primarily a Mac OS X (BSD) sed option; the normal option for GNU sed is -r instead. The -e simply means that the next argument is an expression in the script.
This produces your desired output from the given line of data:
worker.2: 54.01MB
Bonus question: There are some odd lines within the stream, I can usually filter them out using a grep pipe like | grep memory_total. However if I try to use it along with the sed command, it does not work. No output is produced with this:
heroku logs -t -s heroku | grep memory_total | sed.......
Sometimes grep | sed is necessary, but it is often redundant (unless you are using a grep feature that isn't readily supported by sed, such as Perl regular expressions).
You should be able to use:
sed -n -e '/memory_total=/ s/.*source=\([^ ]*\) .*memory_total=\([^ ]*\) .*/\1: \2/p'
The -n means "don't print by default". The /memory_total=/ matches the lines you're after; the s/// content is the same as before. I removed the g suffix that was there previously; the regex would never match multiple times anyway. I added the p to print the line when the substitution occurs.

Using variables in sed -f (where sed script is in a file rather than inline)

We have a process which can use a file containing sed commands to alter piped input.
I need to replace a placeholder in the input with a variable value, e.g. in a single -e type of command I can run;
$ echo "Today is XX" | sed -e "s/XX/$(date +%F)/"
Today is 2012-10-11
However I can only specify the sed aspects in a file (and then point the process at the file), E.g. a file called replacements.sed might contain;
s/XX/Thursday/
So obviously;
$ echo "Today is XX" | sed -f replacements.sed
Today is Thursday
If I want to use an environment variable or shell value, though, I can't find a way to make it expand, e.g. if replacements.txt contains;
s/XX/$(date +%F)/
Then;
$ echo "Today is XX" | sed -f replacements.sed
Today is $(date +%F)
Including double quotes in the text of the file just prints the double quotes.
Does anyone know a way to be able to use variables in a sed file?
This might work for you (GNU sed):
cat <<\! > replacements.sed
/XX/{s//'"$(date +%F)"'/;s/.*/echo '&'/e}
!
echo "Today is XX" | sed -f replacements.sed
If you don't have GNU sed, try:
cat <<\! > replacements.sed
/XX/{
s//'"$(date +%F)"'/
s/.*/echo '&'/
}
!
echo "Today is XX" | sed -f replacements.sed | sh
AFAIK, it's not possible. Your best bet will be :
INPUT FILE
aaa
bbb
ccc
SH SCRIPT
#!/bin/sh
STRING="${1//\//\\/}" # using parameter expansion to prevent / collisions
shift
sed "
s/aaa/$STRING/
" "$#"
COMMAND LINE
./sed.sh "fo/obar" <file path>
OUTPUT
fo/obar
bbb
ccc
As others have said, you can't use variables in a sed script, but you might be able to "fake" it using extra leading input that gets added to your hold buffer. For example:
[ghoti#pc ~/tmp]$ cat scr.sed
1{;h;d;};/^--$/g
[ghoti#pc ~/tmp]$ sed -f scr.sed <(date '+%Y-%m-%d'; printf 'foo\n--\nbar\n')
foo
2012-10-10
bar
[ghoti#pc ~/tmp]$
In this example, I'm using process redirection to get input into sed. The "important" data is generated by printf. You could cat a file instead, or run some other program. The "variable" is produced by the date command, and becomes the first line of input to the script.
The sed script takes the first line, puts it in sed's hold buffer, then deletes the line. Then for any subsequent line, if it matches a double dash (our "macro replacement"), it substitutes the contents of the hold buffer. And prints, because that's sed's default action.
Hold buffers (g, G, h, H and x commands) represent "advanced" sed programming. But once you understand how they work, they open up new dimensions of sed fu.
Note: This solution only helps you replace entire lines. Replacing substrings within lines may be possible using the hold buffer, but I can't imagine a way to do it.
(Another note: I'm doing this in FreeBSD, which uses a different sed from what you'll find in Linux. This may work in GNU sed, or it may not; I haven't tested.)
I am in agreement with sputnick. I don't believe that sed would be able to complete that task.
However, you could generate that file on the fly.
You could change the date to a fixed string, like
__DAYOFWEEK__.
Create a temp file, use sed to replace __DAYOFWEEK__ with $(date +%Y).
Then parse your file with sed -f $TEMPFILE.
sed is great, but it might be time to use something like perl that can generate the date on the fly.
To add a newline in the replacement expression using a sed file, what finally worked for me is escaping a literal newline. Example: to append a newline after the string NewLineHere, then this worked for me:
#! /usr/bin/sed -f
s/NewLineHere/NewLineHere\
/g
Not sure it matters but I am on Solaris unix, so not GNU sed for sure.

Using `sed` without piping multiple times

Example:
echo one two three | sed 's/ /\n/g' | sed 's/^/:/g'
output:
:one
:two
:three
Without piping:
echo one two three | sed 's/ /\n/g;s/^/:/g'
output:
:one
two
three
seems like first pattern isn't expanded before executing second one, but I really don't know much about sed
How can I use first example without piping twice?
PS Pattern used in examples is informative
The other way to do it is with repeated -e options:
echo one two three | sed -e 's/ /\n:/g' -e 's/^/:/g'
This is easier to understand when you have many operations to do; you can align the separate operations on separate lines:
echo one two three |
sed -e 's/ /\n:/g' \
-e 's/^/:/g'
For example, I have a script to generate outline documents from templates. One part of the script contains:
sed -e "s/[:]YEAR:/$(date +%Y)/g" \
-e "s/[:]TODAY:/$today/" \
-e "s/[:]BASE:/$BASE/g" \
-e "s/[:]base:/$base/g" \
-e "s/[:]FILE:/$FILE/g" \
-e "s/[:]file:/$file/g" \
$skeleton |
...
Although it could be done on one line, it would not promote readability.
The main problem here is that sed decides on what constitutes a line (a pattern that it works on) before executing any commands. That is, if you have only one pattern (one two three), it won't get reinterpreted as multiple lines after execution of s/ /\n/g. If would be still a single pattern, although that would be the one that contains newlines inside it.
The simplest workaround to make sed reinterpret patterns along the newly inserted newlines is just running sed twice, as you did.
Another workaround would be adding something like m option (multi-line buffer) to s command:
$ echo one two three | sed 's/ /\n/g;s/^/:/mg'
:one
:two
:three
You could put all that into one regular expression like this:
echo one two three | sed 's/\([^ ]\+\)\( \+\|$\)/:\1\n/g'
The first part \([^ ]\+\) selects your words (i.e. a string of characters which is not a space. The seconds part \( \+\|$\) matches either one or more spaces or the line end (which is required for the three which has no space after it.
Then we we just build the line by using a back-reference to the word matched in part 1.
This might work for you:
echo one two three | sed 'y/ /\n/;s/^/:/mg'
:one
:two
:three

sed + remove "#" and empty lines with one sed command

how to remove comment lines (as # bal bla ) and empty lines (lines without charecters) from file with one sed command?
THX
lidia
If you're worried about starting two sed processes in a pipeline for performance reasons, you probably shouldn't be, it's still very efficient. But based on your comment that you want to do in-place editing, you can still do that with distinct commands (sed commands rather than invocations of sed itself).
You can either use multiple -e arguments or separate commands with a semicolon, something like (just one of these, not both):
sed -i 's/#.*$//' -e '/^$/d' fileName
sed -i 's/#.*$//;/^$/d' fileName
The following transcript shows this in action:
pax> printf 'Line # with a comment\n\n# Line with only a comment\n' >file
pax> cat file
Line # with a comment
# Line with only a comment
pax> cp file filex ; sed -i 's/#.*$//;/^$/d' filex ; cat filex
Line
pax> cp file filex ; sed -i -e 's/#.*$//' -e '/^$/d' filex ; cat filex
Line
Note how the file is modified in-place even with two -e options. You can see that both commands are executed on each line. The line with a comment first has the comment removed then all is removed because it's empty.
In addition, the original empty line is also removed.
#paxdiablo has a good answer but it can be improved.
(1) The '/^$/d' clause only matches 100% blank lines.
If you want to also match lines that are entirely whitespace (spaces, tabs etc.) use this instead:
'/^\s*$/d'
(2) The 's/#.*$//' clause only matches lines that start with the # character in column 0.
If you want to also match lines that have only whitespace before the first # use this instead:
'/^\s*#.*$/d'
The above criteria may not be universal (e.g. within a HEREDOC block, or in a Python multi-line string the different approaches could be significant), but in many cases the conventional definition of "blank" lines include whitespace-only, and "comment" lines include whitespace-then-#.
(3) Lastly, on OSX at least, the #paxdiablo solution in which the first clause turns comment lines into blank lines, and the second clause strips blank lines (including what were originally comments) doesn't work. It seems to be more portable to make both clauses /d delete actions as I've done.
The revised command incorporating the above is:
sed -e '/^\s*#.*$/d' -e '/^\s*$/d' inputFile
This tiny jewel removes all # comments, no matter where they begin in a line (see caution below):
sed -e 's/\s*#.*$//'
Example:
text="
this is a # test
#this is a test
#this is a #test
this is # another #test
"
$echo "$text" | sed -e 's/\s*#.*$//'
this is a
this is
Next this removes any resulting blank lines:
$echo "$text" | sed -e 's/\s*#.*$//' | sed -e '/^\s*$/d'
Caution: Depending on the syntax and/or interpretation of the lines your processing, this might not be an appropriate solution, as it just stupidly removes end of lines, even if the '#' is part of your data or code. However, for use cases where you'll never use a hash except for as an end of line comment then it works fine. So just as with all coding, context must be taken into consideration.
Alternative variant, using grep:
cat file.txt | grep -Ev '(#.*$)|(^$)'
you can use awk
awk 'NF{gsub(/^[ \t]*#/,"");print}' file
First example(paxdiablo) is very good except its not change file, just output result. If you want to change it inline:
sudo sed -i 's/#.*$//;/^$/d' inputFile
On (one of) my linux boxes, sed understands extended regular expressions with the -r option, so:
sed -r '/(^\s*#)|(^\s*$)/d' squid.conf.installed
is very useful for showing all non-blank, non comment lines.
The regex matches either start of line followed by zero or more spaces or tabs followed by either a hash or end of line, and deletes those matching lines from the input.