Understanding working of sed in bash - sed

#!/bin/bash
echo "the first application of sed"
sed -e 's/^\([0-9]\{3\}\)/(\1)/' s.txt
echo "the second application of sed"
sed -e 's/^\([0-9]\{3\}\)/(\1\+\1)/' s.txt
echo "see the original file"
cat s.txt
the first application of sed
(905)-123-3456
(905)-124-3456
(905)-125-3456
(905)-126-3456
(905)-127-3456
the second application of sed
(905+905)-123-3456
(905+905)-124-3456
(905+905)-125-3456
(905+905)-126-3456
(905+905)-127-3456
see the original file
905-123-3456
905-124-3456
905-125-3456
905-126-3456
905-127-3456
I'm just starting out in shell programming and for the last 2 hours I'm stuck with this code. I know the basic usage of sed but I cannot figure out what the line
sed -e 's/^\([0-9]\{3\}\)/(\1)/' s.txt
does. I know -e is expression, s is substitute. ^ indicates beginning of line but the part after that is confusing. Any ideas?

Ultimately, it is manual-bashing exercise.
\( marks the start of a capture, up to the balanced \) — they can be nested, though these ones don't.
\{ marks the start of a repeat specification up to the following \} — they cannot be nested. In this case, you have \{3\} so this repeats the previous item, [0-9], three times.
The \1 in the replacement refers the material captured by the first \( in the search pattern.
Hence:
s/^\([0-9]\{3\}\)/(\1)/
wraps the three digits at the start of the line in parentheses — as shown in your output. Because it is anchored, it happens just once. If a line doesn't start with three digits, nothing happens to that line as a result of this command.
The second example is only marginally different. It takes the sequence of three digits at the start of the line and replaces it with that sequence, a + mark, and the sequence again, all wrapped in parentheses — as shown in your output.
There are relatively few metacharacters in the replacement part of a s/// command; there are a lot of metacharacters in the search part. Further, there are different dialects in the search part — some variants of sed support 'extended regular expressions' instead of 'basic regular expressions' (which is what your example uses); others support Perl-like expressions (not quite the full PCRE — Perl Compatible Regular Expressions — as far as I know, but some notations from PCRE). For that, you need to read the manual for the sed you're using.

Let's break this down:
sed -e 's/^\([0-9]\{3\}\)/(\1)/' s.txt
The nomenclature of sed's substitute is like this:
s/search/replace/options
In your case, search part is ^\([0-9]\{3\}\). Parenthesis and curly brackets can have special meaning and they are escaped by a \. If we remove them for understanding purposes, this is how it will look:
^([0-9]{3})
It means - the line should start with a number between 0 and 9 and it should be repeated 3 times. So basically, it's a 3 digit number (e.g. 123, 543 etc.).
The parenthesis () groups the 3 digit number, which can be referred to as the first group.
The replace part of it is (\1). That means, the group we captured in search is regurgitated.

Related

gnu sed remove portion of line after pattern match with special characters

The goal is to use sed to return only the url from each line of FF extension Mining Blocker which uses this format for its regex lines:
{"baseurl":"*://002.0x1f4b0.com/*", "suburl":"*://*/002.0x1f4b0.com/*"},
{"baseurl":"*://003.0x1f4b0.com/*", "suburl":"*://*/003.0x1f4b0.com/*"},
the result should be:
002.0x1f4b0.com
003.0x1f4b0.com
One way would be to keep everything after suburl":"*://*/ then remove each occurrence of /*"},
I found https://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern but the special characters are a problem.
this won't work:
sed -n -e s#^.*suburl":"*://*/##g hosts
Would someone please show me how to mark the 2 asterisks in the string so they are seen by regex as literal characters, not wildcards?
edit:
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' hosts
doesn't work, unfortunately.
regarding character substitution, thanks for directing me to the references.
I reduced the searched-for string to //*/ and used ASCII character codes like this:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Unfortunately, that didn't output any changes to the lines.
My assumptions are:
^.*something specifies everything up to and including the last occurrence of "something" in a line
sed -n -e s#search##g deletes (replace with nothing) "search" within a line
So, this line:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Should output everything after //*/ in each line...except it doesn't.
What is incorrect with that line?
Regarding deleting everything including and after the first / AFTER that first operation, yes, that's wanted too.
This might work for you (GNU sed):
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' file
Match greedily (the longest string that matches) all characters up to ://*/, followed by a group of characters (which will be referred to as \1) that do not match a /, followed by the rest of the line and replace it by the group \1.
N.B. the sed substitution delimiters are arbitrary, in this case chosen to be # so as make pattern matching / easier. Also the character * on the left hand side of the substitution command may be interpreted as a meta character that means zero or more of the previous character/group and so is quoted \* so that it does not mistakenly exert this property. Finally, using the option -n toggles off the usual printing of every thing in the pattern space after all the sed commands have been executed. The p flag on the substitution command, prints the pattern space following a successful substitution, therefore only URL's will appear in the output or nothing.

sed charachter to leave a match untouched

If I have
123456red100green
123456bee010yellow
123456usb110orange
123456sos011querty
123456let101bottle
and I want it to be
123456red111green
123456bee111yellow
123456usb111orange
123456sos111querty
123456let111bottle
notice: the first 6 characters don't change,,,,
the following 6 change,,,,
also these strings might be anywhere in a file (beginning, end, anywhere)
I want to specify sed to
1)find 123456
2)skip the next three characters
3)replace the next three with 111
The closest I've come to is:
sed '/s/123456....../123456...111/g'
I know dots mean anything but I don't know the equivalent on the other side. In short how to command sed to leave characters in a match untouched.
sorry for having been unclear of what I want please bear with me
Matching 123456 followed by three characters that are not to be modified, and then replacing the next three characters with 111:
sed 's/\(123456...\).../\1111/g' file
The \( ... \) captures the part of the string that we don't want to modify. These are re-inserted with \1. The whole matching bit of the line is replaced by "the bit in the \( ... \) (i.e. \1) followed by 111".
If you want to change each and every zero (as in your examples), then just sed 's/0/1/g' would do. Or sed -e '/^123456/ s/0/1/g' to do the same on lines starting with 123456.
But to count characters, as you ask, use ( .. ) to capture the varying parts and \1 to replace them (using sed -E). So:
echo 123456abcdefgh | sed -Ee 's/^(123456...).../\1111/'
outputs 123456abc111gh. The \1 puts back the part matched by 123456..., the next three ones are literal characters.
(Without -E, you'd need \( .. \) to group.)

Decode sed expression

I would like to understand the sed part of this code:
/usr/local/bin/pcsensor -l60 -n | sed -e "s/^.*\$/PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:\0/"
(the input) pcsensor produces:
2016/09/19 22:41:31 Temperature 90.50F 32.50C
The code produces (output):
PUTVAL downloads/exec-environmental/temperature-cpu interval=30 N:32.50
I am hoping that understanding the sed expression will help me to knock the last digit off (so the temp is only 1 decimal place).
Updated: My booboo (it was late):
the -n in the first part of the command outputs this:
32.50
Which works fine in an echo/printf
printf "32.50 %s\n"| sed -e "s/^.*\$/PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:\0/"
About
sed -e "s/^.*\$/PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:\0/"
This is 1 sed command, namely the s/.../.../ for "substitute". In simple terms, it does a single "search and replace" for every line that it gets to work on.
The "search" part is ^.*\$, the "replacement" part is PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:\0/.
^.*\$ is a simple Regular expression that here stands for "everything" or "the whole line". So, the s command will replace the whole line with
PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:\0/
As Benjamin W. pointed out the use of \0 is "weird". It apparently was meant as a so-called reference, so that the part we searched for is appended after the text "PUTVAL(...)val=30 N:".
I have several issues with the way this is presented, though.
\0 is not in the manpage of my Debian GNU Sed 4.2.2.
Quoting the sed command with " is not needed here and makes things unnecessarily complicated and error-prone. Single quotes should be used instead.
A \0 anywhere in a Shell and especially in Sed could very well stand for a null character which here raises even more red flags due to the " quoting.
Using sed just to prepend a text is "useless use of Sed".
Since you asked about sed, here is how I would write it:
sed -e 's/^.*$/PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:&/'
& stands for "what the search part found". In your case, the whole line.
In order to cut off the last decimal, there are many ways to achieve this. A rather simple approach assumes that the input always has 2 decimals. Then we could prepend a command that replaces the last character (.$) with "nothing" (//):
sed -e 's/.$//;s/^[0-9][0-9]*\.[0-9]/PUTVAL downloads\/exec-environmental\/temperature-cpu interval=30 N:&/'
However, as I said, sed is overkill here. You could just use for instance printf:
text='PUTVAL downloads/exec-environmental/temperature-cpu interval=30 N:'
printf "%s%3.1f\n" "$text" $(/usr/local/bin/pcsensor -l60 -n)

Extract CentOS mirror domain names using sed

I'm trying to extract a list of CentOS domain names only from http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os
Truncating prefix "http://" and "ftp://" to the first "/" character only resulting a list of
yum.phx.singlehop.com
mirror.nyi.net
bay.uchicago.edu
centos.mirror.constant.com
mirror.teklinks.com
centos.mirror.netriplex.com
centos.someimage.com
mirror.sanctuaryhost.com
mirrors.cat.pdx.edu
mirrors.tummy.com
I searched stackoverflow for the sed method but I'm still having trouble.
I tried doing this with sed
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed '/:\/\//,/\//p'
but doesn't look like it is doing anything. Can you give me some advice?
Here you go:
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed -e 's?.*://??' -e 's?/.*??'
Your sed was completely wrong:
/x/,/y/ is a range. It selects multiple lines, from a line matching /x/ until a line matching /y/
The p command prints the selected range
Since all lines match both the start and end pattern you used, you effectively selected all lines. And, since sed echoes the input by default, the p command results in duplicated lines (all lines printed twice).
In my fix:
I used s??? instead of s/// because this way I didn't need to escape all the / in the patterns, so it's a bit more readable this way
I used two expressions with the -e flag:
s?.*://?? matches everything up until :// and replaces it with nothing
s?/.*?? matches everything from / until the end replaces it with nothing
The two expressions are executed in the given order
In modern versions of sed you can omit -e and separate the two expressions with ;. I stick to using -e because it's more portable.

how to use sed/awk to remove words with multiple pattern count

I have a file of string records where one of the fields - delimited by "," - can contain one or more "-" inside it.
The goal is to delete the field value if it contains more than two "-".
i am trying to recoup my past knowledge of sed/awk but can't make much headway
==========
info,whitepaper,Data-Centers,yes-the-6-top-problems-in-your-data-center-lane
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers,the-evolution-of-lan-technology-lanner
==========
expected outcome:
info,whitepaper,Data-Centers
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers
thanks
Try
sed -r 's/(^|,)([^,-]+-){3,}[^,]+(,|$)/\3/g'
or if you're into slashes
sed 's/\(^\|,\)\([^,-]\+-\)\{3,\}[^,]\+\(,\|$\)/\3/g'
Explanation:
I'm using the most basic sed command: substitution. The syntax is: s/pattern/replacement/flags.
Here pattern is (^|,)([^,-]+-){3,}[^,]+(,|$), replacement is \3, flags is g.
The g flag means global replacement (all matching parts are replaced, not only the first in line).
In pattern:
brackets () create a group. Somewhat like in math. They also allow to refer to a group with a number later.
^ and $ mean beginning and end of the string.
| means "or", so (^|,) means "comma or beginning of the string".
square brackets [] mean a character class, ^ inside means negation. So [^,-] means "anything but comma or hyphen". Not that usually the hyphen has a special meaning in character classes: [a-z] means all lowercase letters. But here it's just a hyphen because it's not in the middle.
+ after an expression means "match it 1 or more times" (like * means match it 0 or more times).
{N} means "match it exactly N times. {N,M} is "from N to M times". {3,} means "three times or more". + is equivalent to {1,}.
So this is it. The replacement is just \3. This refers to the third group in (), in this case (,|$). This will be the only thing left after the substitution.
P.S. the -r option just changes what characters need to be escaped: without it all of ()-{}| are treated as regular chars unless you escape them with \. Conversely, to match literal ( with -r option you'll need to escape it.
P.P.S. Here's a reference for sed. man sed is your friend as well.
Let me know if you have further questions.
You could try perl instead of sed or awk:
perl -F, -lane 'print join ",", grep { !/-.*-.*-/ } #F' < file.txt
This might work for you:
sed 's/,\{,1\}[^,-]*\(-[^,]*\)\{3,\}//g file
sed 's/\(^\|,\)\([^,]*-\)\{3\}[^,]*\(,\|$\)//g'
This should work in more cases:
sed 's/,$/\n/g;s/\(^\|,\|\n\)\([^,\n]*-\)\{3\}[^,\n]*\(,\|\n\|$\)/\3/g;s/,$//;s/\n/,/g'