Finding it difficult to extract digits from string using sed - sed

I am trying to extract the version information a string using sed as follows
echo "A10.1.1-Vers8" | sed -n "s/^A\([0-9]+\)\.\([0-9]\)\.[0-9]+-.*/\1/p"
I want to extract '10' after 'A'. But the above expression doesn't give the expected information. Could some one please give some explanation on why this statement doesn't work ?
I tried the above command and changed options os sed but nothing works. I think this is some syntax error
echo "A10.1.1-Vers10" | sed -n "s/^X\([0-9]+\)\.\([0-9]\)\.[0-9]+-.*/\1/p"
Expected result is '10'
Actually result is None

$ echo "A10.1.1-Vers8" | sed -r 's/^A([[:digit:]]+)\.(.*)$/\1/g'
10
Search for string starting with A (^A), followed by multiple digits (I am using POSIX character class [[:digit:]]+) which is captured in a group (), followed by a literal dot \., followed by everything else (.*)$.
Finally, replace the whole thing with the Captured Group content \1.
In GNU sed, -r adds some syntactic sugar, in the man page, it is called as --regexp-extended

GNU grep is an alternative to sed:
$ echo "A10.1.1-Vers10" | grep -oP '(?<=^A)[0-9]+'
10
The -o option tells grep to print only the matched characters.
The -P option tells grep to match Perl regular expressions, which enables the (?<= lookbehind zero-length assertion.
The lookbehind assertion (?<=^A) ensures there is an A at the beginning of the line, but doesn't include it as part of the match for output.
If you need to match more of the version string, you can use a lookforward assertion:
$ echo "A10.1.1-Vers10" | grep -oP '(?<=^A)[0-9]+(?=\.[0-9]+\.[0-9]+-.*)'
10

Related

Extract version from string cross platform

I need to create a line in makefile which will extract the version from string, and will work cross-platform, ideally without dependencies.
This is what I had
echo "golangci-lint has version 1.42.0 built..." | grep -oP '\d+\.\d+\.\d'
retuslt: 1.42.0
But it doesn't work on mac.
Trying to do it with sed like this, but doesn't work
echo "golangci-lint has version 1.42.0 built ..." | sed -n 's/.*\(\d+\.\d+\.\d\).*/\1/p'
grep -ow '[0-9][0-9.]\+[0-9]'
That uses only a basic regular expression, and options that BSD grep and GNU grep share.
You can use
echo "golangci-lint has version 1.42.0 built ..." | sed -En 's/.*([0-9]+\.[0-9]+\.[0-9]+).*/\1/p'
Details:
-E - enables the POSIX ERE syntax
n - default line output is suppressed now
.*([0-9]+\.[0-9]+\.[0-9]+).* - any text, then Group 1 capturing one or more digits, ., one or more digits, ., one or more digits and the rest of the line
\1 - the replacement is just Group 1 value
p - only the substitution result is printed.
With your shown samples, you could try following awk program which will print only matched value of version out of whole line.
echo "golangci-lint has version 1.42.0 built ..." |
awk '
{
match($0,/[0-9]+\.[0-9]+\.[0-9]+/)
print substr($0,RSTART,RLENGTH)
}
'
Explanation: Simple explanation would be, printing line's value with echo command of shell here and sending its output as a standard input to awk code, where using match function to match mentioned regex in it. If there is a match then printing matched value.
Explanation of regex:
[0-9]+\.[0-9]+\.[0-9]+: Matching 1 or more occurrences of digits followed by . followed by 1 or more occurrences of digits followed by another dot. followed by 1 or more digits.
-P is an experimental feature in gnu-grep which is not available on Mac BSD. However default grep available in Mac can handle it easily with -E switch but you have to use [0-9] or [[:digit:]] in place of \d in your search pattern:
s="golangci-lint has version 1.42.0 built..."
grep -Eo '([0-9]+\.)+[0-9]+' <<< "$s"
# or else
grep -Eo '([[:digit:]]+\.)+[[:digit:]]+' <<< "$s"
1.42.0
As a side note I have gnu-grep installed on my Mac using home brew package.
Suggesting the following:
echo "golangci-lint has version 1.42.0 built..." | grep -o '[0-9\.]\{4,\}'
Explanation
[0-9\.] --- match a single digit or dot(.)
\{4,\} --- the matched charterer 4 or more times.
This awk is 100% POSIX:
awk 'match($0, /[0-9][0-9.]+[0-9]/) {print substr($0, RSTART, RLENGTH)}'
It will always print the first match and only (up to) one match per line. There can be zero or more dots in the number, but leading/trailing dots won't get printed.
grep -o is quite portable, but not every platform supported by Go has it. Eg. IBM AIX. Also note that if a line has multiple matches, it will print each match on a new line.

Parsing a string with sed

I have a string like prefix-2020.80-suffix-1
Here are all of possible combinations of input string
"2020.80-suffix-1"
"2020.80-suffix"
"prefix-2020.80"
"prefix-2020.80-1"
I need to cut out and assign 2020 to a variable but cannot get my desired output
Here what i got so far...
set var=`echo "prefix-2020.80-suffix-1" | sed "s/[[:alnum:]]*-*\([0-9]*\).*/\1/"`
My regexp does not work for other cases and i cannot figure out why! its more complicated that python's regexp syntax
This should work for all you inputs
sed 's/.*\(^\|-\)\([0-9]*\)\..*/\2/' test
Matches the start of the line or everything up to -[number]. and captures the number.
The problem with the original you were using was you didn't take into account when there wasn't a prefix.
You can use this grep -oP:
echo "prefix-2020.80-suffix-1" | grep -oP '^([[:alnum:]]+-)?\K[0-9]+'
2020
RegEx Demo
Using sed (with extended regex):
echo "prefix-2020.80-suffix-1" |sed -r 's/^([^-]*-|)([0-9]+).*/\2/'
Using grep:
echo "prefix-2020.80-suffix-1" |grep -oP "^([^-]*-|)\K\d+"
2020
-P is for Perl regex.

Extract pattern between a substring and first occurrence of numeric in a string

Following is the content of a file:
xxx_component1-1.0-2-2acd314.xc-linux-x86-64-Release-devel.r
xxx_component2-3.0-1-fg3sdhd.xc-linux-x86-64-Release-devel.r
xxx_component3-1.0-2-3gsjcgd.xc-linux-x86-64-Release-devel.r
xxx_component4-0.0-2-2acd314.xc-linux-x86-64-Release-devel.r
I want to extract component names component1 component2 etc.
This is what I tried:
for line in `sed -n -e '/^xxx-/p' $file`
do
comp=`echo $line | sed -e '/xxx-/,/[0-9]/p'`
echo "comp - $comp"
done
I also tried sed -e 's/.*xxx-\(.*\)[^0-9].*/\1/'
This is based on some info on net. Please give me sed command and if possible also explain stepwise
Part 2. I also need to extract version number from the string.
version number starts with digit and ends with . followed by xc-linux.
As you can see to maintain the uniqueness its has random alphanumeric characters ( length is 7) as part of the version number.
For example :
xxx_component1-1.0-2-2acd314.xc-linux-x86-64-Release-devel.r
In this string the version number is : 1.0-2-2acd314
There are quite a few ways to extract the data. The simplest form would be grep.
GNU grep:
You can grab the required data using GNU grep with PCRE option -P:
$ cat file
xxx_component1-1.0-2-2acd314.xc-linux-x86-64-Release-devel.r
xxx_component2-3.0-1-fg3sdhd.xc-linux-x86-64-Release-devel.r
xxx_component3-1.0-2-3gsjcgd.xc-linux-x86-64-Release-devel.r
xxx_component4-0.0-2-2acd314.xc-linux-x86-64-Release-devel.r
$ grep -oP '(?<=_)[^-]*' file
component1
component2
component3
component4
Here we use negative look behind assertion tell to capture everything from _ to a - not incusive.
awk:
$ awk -F"[_-]" '{print $2}' file
component1
component2
component3
component4
Here we tell awk to use - and _ as delimiters and print the second column.
sed:
Having said that, you can also use sed to extract required data using group capture:
$ sed 's/.*_\([^-]*\)-.*/\1/' file
component1
component2
component3
component4
The regex states that match any character zero or more times up to an _. From that point onwards, capture everything until a - in a group. In the replacement part we just use the data captured in the group by calling it using back reference, that is \1.

why can't match the content between # and end of line in sed?

$ echo "haha#nihao" | sed "s/#.+$/end/"
haha#nihao
I want to match contents between character # and the end of line.why can't i get it?
:%!sed "s/#.\+$/end/"
E194:No alternate file name to substitute for '#'
problem 1:
why i can't use it in sed of vim?
problem 2:
how to see the EORROR 194 ?
problem 1: why i can't use it in sed of vim?
Because by default sed uses BRE, basic regular expressions:
/.+/ this matches any character followed by a "+"
/.\+/ this matches one or more occurrences of any character
You can tell sed to use extended regular expressions with the -r flag in GNU implementations and -E flag with BSD implementations:
$ echo "haha#nihao" | sed -r "s/#.+$/end/"
hahaend
problem 2: how to see the EORROR 194 ?
You get this error because # has a special meaning in vim when you run commands with !: # marks on the command line are replaced with the alternate file. It should work if you escape the #:
%!sed "s/\#.\+$/end/"
You can read about this error with the :help E194 command, and about alternative file with :help alternate-file.
use sed -r
echo "haha#nihao" | sed -r "s/#.+$/end/"
hahaend
from man sed
-r, --regexp-extended
use extended regular expressions in the script.

How do I push `sed` matches to the shell call in the replacement pattern?

I need to replace several URLs in a text file with some content dependent on the URL itself. Let's say for simplicity it's the first line of the document at the URL.
What I'm trying is this:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \1 | head -n 1)/" file.txt
This doesn't work, since \1 is not set. However, the shell is getting called. Can I somehow push the sed match variables to that subprocess?
The accept answer is just plain wrong. Proof:
Make an executable script foo.sh:
#! /bin/bash
echo $* 1>&2
Now run it:
$ echo foo | sed -e "s/\\(foo\\)/$(./foo.sh \\1)/"
\1
$
The $(...) is expanded before sed is run.
So you are trying to call an external command from inside the replacement pattern of a sed substitution. I dont' think it can be done, the $... inside a pattern just allows you to use an already existent (constant) shell variable.
I'd go with Perl, see the /e option in the search-replace operator (s/.../.../e).
UPDATE: I was wrong, sed plays nicely with the shell, and it allows you do to that. But, then, the backlash in \1 should be escaped. Try instead:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \\1 | head -n 1)/" file.txt
Try this:
sed "s/^URL=\(.*\)/\1/" file.txt | while read url; do sed "s#URL=\($url\)#TITLE=$(curl -s $url | head -n 1)#" file.txt; done
If there are duplicate URLs in the original file, then there will be n^2 of them in the output. The # as a delimiter depends on the URLs not including that character.
Late reply, but making sure people don't get thrown off by the answers here -- this can be done in gnu sed using the e command. The following, for example, decrements a number at the beginning of a line:
echo "444 foo" | sed "s/\([0-9]*\)\(.*\)/expr \1 - 1 | tr -d '\n'; echo \"\2\";/e"
will produce:
443 foo