Extract pattern between a substring and first occurrence of numeric in a string

Extract pattern between a substring and first occurrence of numeric in a string - sed

Following is the content of a file:
xxx_component1-1.0-2-2acd314.xc-linux-x86-64-Release-devel.r
xxx_component2-3.0-1-fg3sdhd.xc-linux-x86-64-Release-devel.r
xxx_component3-1.0-2-3gsjcgd.xc-linux-x86-64-Release-devel.r
xxx_component4-0.0-2-2acd314.xc-linux-x86-64-Release-devel.r
I want to extract component names component1 component2 etc.
This is what I tried:
for line in `sed -n -e '/^xxx-/p' $file`
do
comp=`echo $line | sed -e '/xxx-/,/[0-9]/p'`
echo "comp - $comp"
done
I also tried sed -e 's/.*xxx-\(.*\)[^0-9].*/\1/'
This is based on some info on net. Please give me sed command and if possible also explain stepwise
Part 2. I also need to extract version number from the string.
version number starts with digit and ends with . followed by xc-linux.
As you can see to maintain the uniqueness its has random alphanumeric characters ( length is 7) as part of the version number.
For example :
xxx_component1-1.0-2-2acd314.xc-linux-x86-64-Release-devel.r
In this string the version number is : 1.0-2-2acd314

There are quite a few ways to extract the data. The simplest form would be grep.
GNU grep:
You can grab the required data using GNU grep with PCRE option -P:
$ cat file
xxx_component1-1.0-2-2acd314.xc-linux-x86-64-Release-devel.r
xxx_component2-3.0-1-fg3sdhd.xc-linux-x86-64-Release-devel.r
xxx_component3-1.0-2-3gsjcgd.xc-linux-x86-64-Release-devel.r
xxx_component4-0.0-2-2acd314.xc-linux-x86-64-Release-devel.r
$ grep -oP '(?<=_)[^-]*' file
component1
component2
component3
component4
Here we use negative look behind assertion tell to capture everything from _ to a - not incusive.
awk:
$ awk -F"[_-]" '{print $2}' file
component1
component2
component3
component4
Here we tell awk to use - and _ as delimiters and print the second column.
sed:
Having said that, you can also use sed to extract required data using group capture:
$ sed 's/.*_\([^-]*\)-.*/\1/' file
component1
component2
component3
component4
The regex states that match any character zero or more times up to an _. From that point onwards, capture everything until a - in a group. In the replacement part we just use the data captured in the group by calling it using back reference, that is \1.

Related

Extract version from string cross platform

I need to create a line in makefile which will extract the version from string, and will work cross-platform, ideally without dependencies.
This is what I had
echo "golangci-lint has version 1.42.0 built..." | grep -oP '\d+\.\d+\.\d'
retuslt: 1.42.0
But it doesn't work on mac.
Trying to do it with sed like this, but doesn't work
echo "golangci-lint has version 1.42.0 built ..." | sed -n 's/.*\(\d+\.\d+\.\d\).*/\1/p'

grep -ow '[0-9][0-9.]\+[0-9]'
That uses only a basic regular expression, and options that BSD grep and GNU grep share.

You can use
echo "golangci-lint has version 1.42.0 built ..." | sed -En 's/.*([0-9]+\.[0-9]+\.[0-9]+).*/\1/p'
Details:
-E - enables the POSIX ERE syntax
n - default line output is suppressed now
.*([0-9]+\.[0-9]+\.[0-9]+).* - any text, then Group 1 capturing one or more digits, ., one or more digits, ., one or more digits and the rest of the line
\1 - the replacement is just Group 1 value
p - only the substitution result is printed.

With your shown samples, you could try following awk program which will print only matched value of version out of whole line.
echo "golangci-lint has version 1.42.0 built ..." |
awk '
{
match($0,/[0-9]+\.[0-9]+\.[0-9]+/)
print substr($0,RSTART,RLENGTH)
}
'
Explanation: Simple explanation would be, printing line's value with echo command of shell here and sending its output as a standard input to awk code, where using match function to match mentioned regex in it. If there is a match then printing matched value.
Explanation of regex:
[0-9]+\.[0-9]+\.[0-9]+: Matching 1 or more occurrences of digits followed by . followed by 1 or more occurrences of digits followed by another dot. followed by 1 or more digits.

-P is an experimental feature in gnu-grep which is not available on Mac BSD. However default grep available in Mac can handle it easily with -E switch but you have to use [0-9] or [[:digit:]] in place of \d in your search pattern:
s="golangci-lint has version 1.42.0 built..."
grep -Eo '([0-9]+\.)+[0-9]+' <<< "$s"
# or else
grep -Eo '([[:digit:]]+\.)+[[:digit:]]+' <<< "$s"
1.42.0
As a side note I have gnu-grep installed on my Mac using home brew package.

Suggesting the following:
echo "golangci-lint has version 1.42.0 built..." | grep -o '[0-9\.]\{4,\}'
Explanation
[0-9\.] --- match a single digit or dot(.)
\{4,\} --- the matched charterer 4 or more times.

This awk is 100% POSIX:
awk 'match($0, /[0-9][0-9.]+[0-9]/) {print substr($0, RSTART, RLENGTH)}'
It will always print the first match and only (up to) one match per line. There can be zero or more dots in the number, but leading/trailing dots won't get printed.
grep -o is quite portable, but not every platform supported by Go has it. Eg. IBM AIX. Also note that if a line has multiple matches, it will print each match on a new line.

Finding it difficult to extract digits from string using sed

I am trying to extract the version information a string using sed as follows
echo "A10.1.1-Vers8" | sed -n "s/^A\([0-9]+\)\.\([0-9]\)\.[0-9]+-.*/\1/p"
I want to extract '10' after 'A'. But the above expression doesn't give the expected information. Could some one please give some explanation on why this statement doesn't work ?
I tried the above command and changed options os sed but nothing works. I think this is some syntax error
echo "A10.1.1-Vers10" | sed -n "s/^X\([0-9]+\)\.\([0-9]\)\.[0-9]+-.*/\1/p"
Expected result is '10'
Actually result is None

$ echo "A10.1.1-Vers8" | sed -r 's/^A([[:digit:]]+)\.(.*)$/\1/g'
10
Search for string starting with A (^A), followed by multiple digits (I am using POSIX character class [[:digit:]]+) which is captured in a group (), followed by a literal dot \., followed by everything else (.*)$.
Finally, replace the whole thing with the Captured Group content \1.
In GNU sed, -r adds some syntactic sugar, in the man page, it is called as --regexp-extended

GNU grep is an alternative to sed:
$ echo "A10.1.1-Vers10" | grep -oP '(?<=^A)[0-9]+'
10
The -o option tells grep to print only the matched characters.
The -P option tells grep to match Perl regular expressions, which enables the (?<= lookbehind zero-length assertion.
The lookbehind assertion (?<=^A) ensures there is an A at the beginning of the line, but doesn't include it as part of the match for output.
If you need to match more of the version string, you can use a lookforward assertion:
$ echo "A10.1.1-Vers10" | grep -oP '(?<=^A)[0-9]+(?=\.[0-9]+\.[0-9]+-.*)'
10

How to search in sed for any name matching

How to find structures matching a pattern
struct struct_name {
....
....
};
I'm using
sed -n -e '/struct{/,/}/p'
how to search for any struct_name

To extract all struct definitions (POSIX-compliant command):
sed -n '/struct [^ {]\{1,\} {/,/}/p' file
More robust with respect to whitespace variations (POSIX-compliant):
sed -n '/struct[[:blank:]]\{1,\}[^ {]\{1,\}[[:blank:]]*{/,/}/p' file
Alternative, using an extended regular expression (works with both GNU and BSD/macOS sed):
sed -E -n '/struct[[:blank:]]+[^ {]+[[:blank:]]*\{/,/\}/p' file
awk alternative (awk only uses extended regexes):
awk '/struct[[:blank:]]+[^ {]+[[:blank:]]*\{/,/\}/' file
The awk solution has the added advantage that a given struct definition will also be extracted correctly if it is all on a single line: awk looks for the end of a range on the same input line as the start of the range, whereas sed does not.
To extract a specific struct definition by name:
sed doesn't support variables, so your best bet is to splice in a shell variable that the shell expands up front.
name='struct_name' # define name to search for as shell var.
sed -n '/struct '"$name"' {/,/}/p' file # splice shell var. into sed script
Note that I've deliberately not used sed -n "/struct $name {/,/}/p" - a single, double-quoted string expanded by the shell as a whole - so as to make it clear which part of the sed script is expanded by the shell up front.
This works in this simple case, but is tricky business in general, because you must ensure that the expanded variable value contains no regex/sed metacharacters that break the command.
Here's an awk alternative that uses awk variables and literal substring matching to bypass the problem of potentially having to escape the variable value:
awk -v name='struct_name' 'index($0, "struct " name " {"),/}/' file
This solution has the added advantage that the struct definition will also be extracted correctly if it is all on a single line: awk looks for the end of a range on the same input line as the start of the range, whereas sed does not.

This will search a text file for struct_name. You can use the -E switch to use a regular expression.
grep -no struct_name test.txt
The -n switch causes the line number to be included, the -o means only the matching element of the line will be displayed.

How to truncate the first digit of a number?

For example, my file has the following data:
$ cat sample.txt
19999119999,string1,dddddd
18888135790,string2,dddddd
15555555500,string3,dddddd
This is a sample data. How can we remove ONLY first digit from each row? My output should be:
$ cat output.txt
9999119999,string1,dddddd
8888135790,string2,dddddd
5555555500,string3,dddddd
Is there any way to parse each line character wise using grep or sed?
Or any other way to get the desired output?

You just need to print from the second character on:
$ cut -c2- file
9999119999,string1,dddddd
8888135790,string2,dddddd
5555555500,string3,dddddd
Or, using sed, remove the first char:
$ sed 's/^.//' file
9999119999,string1,dddddd
8888135790,string2,dddddd
5555555500,string3,dddddd

Try this:
$ sed -r 's/^[0-9](.*)/\1/' sample.txt
Output:
9999119999,string1,dddddd
8888135790,string2,dddddd
5555555500,string3,dddddd
^[0-9] - The first digit of each line
(.*) - The content of each line except the first digit
\1 - Denote the content of (.*)
Sorry for my bad English.

Grep can solve this with a look behind. For that you need -P option :
grep -Po '(?<=^\d)(.+)' file
or in shorthand :
grep -Po '^\d\K.+' file
The (?<=^\d)/^\d\K part is the look behind that matches the first digit.

sed to remove URLs from a file

I am trying to write a sed expression that can remove urls from a file
example
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor #kdpartak :)
But I dont get it:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
FIXED!!!!!
handles almost all cases, even malformed URLs
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more

The following removes http:// or https:// and everything up until the next space:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor #kdpartak :)
Edit:
I should have used:
sed -e 's!http[s]\?://\S*!!g' posFile
"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"
"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"
I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).
There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?

The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
The GNU sed flags, expressions used are:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
However,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.
sed -i 's/^[ \t]*//; s/[ \t]*$//'
do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).
The solution is to use the following perl expression:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
which uses a shell substitution,
'`echo "\012"`'
to replace an octal value
\012
(i.e., a newline, \n), that occurs 2 or more times,
{2,}
(otherwise we would unwrap all lines), with something else; here:
//
i.e., nothing.
[The second reference below provides a wonderful table of these values!]
The perl flags used are:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
References:
perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline (\n) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Example:
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extract pattern between a substring and first occurrence of numeric in a string - sed

Related

Extract version from string cross platform

Finding it difficult to extract digits from string using sed

How to search in sed for any name matching

How to truncate the first digit of a number?

sed to remove URLs from a file

Categories

Resources