sed to match first pattern among multiple matches - sed

So for a given text like
a[test] asdfasdf [sdfsdf]b
I want the first match of text which is inside the first square brackets (regex = [.*]), so in this case [test].
I tried the following command it didn't work:
echo "a[test] asdfasdf [sdfsdf]b" | sed -n -e 's/.*\(\[.*\]\).*/\1/p'
This is returning [sdfsdf]
How do I get [test] instead ?

.* will select the longest match. Use [^[]* and [^]]* instead.
sed -n -e 's/[^[]*\(\[[^]]*\]\).*/\1/p'

Related

Using SED to Remove Anything but a Pattern

I have a bunch of . pdf file names. For example:
901201_HKW_RNT_HW21_136_137_DE_442_Freigabe_DE_CLX.pdf
and i am trying to remove everything but this pattern XXX_XXX where X is always a digit.
The result should be:
136_137
So far i did the opposite .. manage to match the pattern by using :
set NoSpacesString to do shell script "echo " & quoted form of insideName & " | sed 's/([0-9][0-9][0-9]_[0-9][0-9][0-9])//'"
My goal is to set NoSpaceString to 136_137
Little bit of help please.
Thank you !
P.S. The rest of the code is in AppleScript if this matters
Fixing sed command...
You can use
sed -n 's/.*\([0-9]\{3\}_[0-9]\{3\}\).*/\1/p'
See the online demo
Details
-n - suppresses the default line output
s/.*\([0-9]\{3\}_[0-9]\{3\}\).*/\1/ - finds the .*\([0-9]\{3\}_[0-9]\{3\}\).* pattern that matches
.* - any zero or more chars
\([0-9]\{3\}_[0-9]\{3\}\) - Group 1 (the \1 in the RHS refers to this group value): three digits, _, three digits
.* - any zero or more chars
p - prints the result of the substitution only.
The regex above is a POSIX BRE compliant pattern. The same can be written in POSIX ERE:
sed -En 's/.*([0-9]{3}_[0-9]{3}).*/\1/p'
Final AppleScript code
set noSpacesString to do shell script "sed -En 's/.*([0-9]{3}_[0-9]{3}).*/\\1/p' <<<" & insideName's quoted form
This might work for you (GNU sed):
sed -E '/\n/{P;D};s/[0-9]{3}_[0-9]{3}/\n&\n/;D' file
This solution will print all occurrences of the pattern on a separate line.
The initial command is dependant on what follows.
The second command replaces the desired pattern prepending and appending newlines either side.
The D command removes up to the first newline, but as the pattern space is not empty, restarts the sed cycle (without append the next line).
Now the initial command comes into play. The front of the line is printed and then deleted along with its appended newline.
Again, the sed cycle is restarted as if the line had never been presented but minus any characters up to and including the first desired pattern.
This flip-flop pattern of control is repeated until nothing is left and then repeated on subsequent lines until the end of the file.
Here is a copy of the debug log for a suitable one line input containing two representations of the desired pattern:
SED PROGRAM:
/\n/ {
P
D
}
s/[0-9]{3}_[0-9]{3}/
&
/
D
INPUT: 'file' line 1
PATTERN: aaa123_456bbb123_456ccc
COMMAND: /\n/ {
COMMAND: }
COMMAND: s/[0-9]{3}_[0-9]{3}/
&
/
MATCHED REGEX REGISTERS
regex[0] = 3-10 '123_456'
PATTERN: aaa\n123_456\nbbb123_456ccc
MATCHED REGEX REGISTERS
regex[0] = 0-3 'aaa'
PATTERN: \n123_456\nbbb123_456ccc
COMMAND: D
PATTERN: 123_456\nbbb123_456ccc
COMMAND: /\n/ {
COMMAND: P
123_456
COMMAND: D
PATTERN: bbb123_456ccc
COMMAND: /\n/ {
COMMAND: }
COMMAND: s/[0-9]{3}_[0-9]{3}/
&
/
MATCHED REGEX REGISTERS
regex[0] = 3-10 '123_456'
PATTERN: bbb\n123_456\nccc
MATCHED REGEX REGISTERS
regex[0] = 0-3 'bbb'
PATTERN: \n123_456\nccc
COMMAND: D
PATTERN: 123_456\nccc
COMMAND: /\n/ {
COMMAND: P
123_456
COMMAND: D
PATTERN: ccc
COMMAND: /\n/ {
COMMAND: }
COMMAND: s/[0-9]{3}_[0-9]{3}/
&
/
PATTERN: ccc
MATCHED REGEX REGISTERS
regex[0] = 0-3 'ccc'
PATTERN:
COMMAND: D

How to extract a specific character inside a parentheses using sed command?

I want to extract an atomic symbols inside a parentheses using sed.
The data I have is in the form C(X12), and I only want the X symbol
EX: that a test command :
echo "C(Br12)" | sed 's/[0-9][0-9])$//g'
gives me C(Br.
You can use
sed -n 's/.*(\(.*\)[0-9]\{2\})$/\1/p'
See the online demo:
sed -n 's/.*(\(.*\)[0-9]\{2\})$/\1/p' <<< "c(Br12)"
# => Br
Details
-n - suppresses the default line output
.*(\(.*\)[0-9]\{2\})$ - a regex that matches
.* - any text
( - a ( char
\(.*\) - Capturing group 1: any text up to the last....
[0-9]\{2\} - two digits
)$ - a ) at the end of string
\1 - replaces with Group 1 value
p - prints the result of the substitution.
For example:
echo "C(Br12)" | sed 's/C(\(.\).*/\1/'
C( - match exactly literally C(
. match anything
\(.\) - match anythig - one character- and "remember" it in a backreference \1
.* ignore everything behind it
\1 - replace it by the stuff that was remembered. The first character.
Research sed, regex and backreferences for more information.
Try using the following command
echo "C(BR12)" | cut -d "(" -f2 | cut -d ")" -f1 | sed 's/[0-9]*//g'
The cut tool will split and get you the string in middle of the paranthesis.Then pass the string to a sed for replacing the numbers inside the string.
Not a fully sed solution but this will get you the output.

sed replace positional match of unknown string divided by user-defined separator

Want to rename the (known) 3th folder within a (unknown) file path from a string, when positioned on 3th level while separator is /
Need a one-liner explicitly for sed. Because I later want use it for tar --transform=EXPRESSION
string="/db/foo/db/bar/db/folder"
echo "$string" | sed 's,db,databases,'
sed replace "db" only on 3th level
expected result
/db/foo/databases/bar/db/folder
You could use a capturing group to capture /db/foo/ and then match db. Then use use the first caputring group in the replacement using \1:
string="/db/foo/db/bar/db/folder"
echo -e "$string" | sed 's,^\(/[^/]*/[^/]*/\)db,\1databases,'
About the pattern
^ Start of string
\( Start capture group
/[^/]*/[^/]*/ Match the first 2 parts using a negated character class
\) Close capture group
db Match literally
That will give you
/db/foo/databases/bar/db/folder
If awk is also an option for this task:
$ awk 'BEGIN{FS=OFS="/"} $4=="db"{$4="database"} 1' <<<'/db/foo/db/bar/db/folder'
/db/foo/database/bar/db/folder
FS = OFS = "/" assign / to both input and output field separators,
$4 == "db" { $4 = "database }" if fourth field is db, make it database,
1 print the record.
Here is a pure bash way to get this done by setting IFS=/ without calling any external utility:
string="/db/foo/db/bar/db/folder"
string=$(IFS=/; read -a arr <<< "$string"; arr[3]='databases'; echo "${arr[*]}")
echo "$string"
/db/foo/databases/bar/db/folder

Conditional substitution of patterns in bash strings depending on the beginning of a string

I am new in bash, so excuse me if do not use the right terms.
I need to substitute certain patterns of six characters in a set of files. The order by patterns are substituted depends on the beginning of each string of text.
This is an example of input:
chr1:123-123 5GGGTTAGGGTTAGGGTTAGGGTTAGGGTTA3
chr1:456-456 5TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG3
chr1:789-789 5GGGCTAGGGTTAGGGTTAGGGTTA3
chr1:123-123 etc is the name of the string, they are separated from the string I need to work with by a tab. The string I need to work with is delimited by characters 5 and 3, but I can change them.
I want that all patterns containing T, A, G in anyone of these orders is substituted with X: TTAGGG, TAGGG, AGGGTT, GGGTTA, GGTTAG, GTTAGG.
Similarly, patterns containing CTAGGG, like row 3, in orders similar to the previous one will be substituted with a different character.
The game is repeated with some specific differences for all the 6 characters composing each pattern.
I started writing something like this:
#!/bin/bash
NORMAL=`echo "\033[m"`
RED=`echo "\033[31m"` #red
#read filename for the input file and create a copy and a folder for the output
read -p "Insert name for INPUT file: " INPUT
echo "Creating OUTPUT file " "${RED}"$INPUT"_sub.txt${NORMAL}"
mkdir -p ./"$INPUT"_OUTPUT
cp $INPUT.txt ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
echo
#start the first set of instructions
perfrep
#starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism
Instructions are
perfrep() {
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGGT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGGTT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGTTAG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
# starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism(){
sed -i -e 's/[GCA]TAGGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/G[GCA]TAGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GG[GCA]TAG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGG[GCA]TA/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGG[GCA]T/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGG[GCA]/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
I will need to repeat also with T[GCA]AGGG, TT[TCG]GGG, TTA[ACT]GG, TTAG[ACT]G and TTAGG[ACT].
Using this procedure, I get for these results for the inputs shown
5GGGXXXXTTA3
5XXXXX3
5GGGLXXTTA3
In my point of view, for my job, the first and second string are both made by X repeated five times, and the order of characters is just slightly different. On the other hand, the third one could be masked like this:
5LXXX3
How do I tell the script that if the string starts with 5GGGTTA instead of 5TTAGGG must start to substitute with
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
instead of
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
?
I will need to repeat with all cases; for instance, if the string starts with GTTAGG I will need to start with
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
and so on, and add a couple of variation of my pattern.
I need to repeat the substitution with TTAGGG and the variations for all the rows of my input file.
Sorry for the very long question. Thank you all.
Adding information asked by Varun.
Patterns of 6 characters would be TTAGGG , [GCA]TAGGG , T[GCA]AGGG , TT[TCG]GGG , TTA[ACT]GG , TTAG[ACT]G , TTAGG[ACT].
Each one must be checked for a different frame, for instance for TTAGGG we have 6 frames TTAGGG , GTTAGG , GGTTAG, GGGTTA , AGGGTT , TAGGGT.
The same frames must be applied to the pattern containing a variable position.
I will have a total of 42 patterns to check, divided in 7 groups: one containing TTAGGG and derivative frames, 6 with the patterns with a variable position and their derivatives.
TTAGGG and derivatives are the most important and need to be checked first.
#! /usr/bin/awk -f
# generate a "frame" by moving the first char to the end
function rotate(base){ return substr(base,2) substr(base,1,1) }
# Unfortunately awk arrays do not store regexps
# so I am generating the list of derivative strings to match
function generate_derivative(frame,arr, i,j,k,head,read,tail) {
arr[i]=frame;
for(j=1; j<=length(frame); j++) {
head=substr(frame,1,j-1);
read=substr(frame,j,1);
tail=substr(frame,j+1);
for( k=1; k<=3; k++) {
# use a global index to simplify
arr[++Z]= head substr(snp[read],k,1) tail
}
}
}
BEGIN{
fs="\t";
# alternatives to a base
snp["A"]="TCG"; snp["T"]="ACG"; snp["G"]="ATC"; snp["C"]="ATG";
# the primary target
frame="TTAGGG";
Z=1; # warning GLOBAL
X[Z] = frame;
# primary derivatives
generate_derivative(frame, X);
xn = Z;
# secondary shifted targets and their derivatives
for(i=1; i<length(frame); i++){
frame = rotate(frame);
L[++Z] = frame;
generate_derivative(frame, L);
}
}
/^chr[0-9:-]*\t5[ACTG]*3$/ {
# because we care about the order of the prinary matches
for (i=1; i<=xn; i++) {gsub(X[i],"X",$2)}
# since we don't care about the order of the secondary matches
for (hit in L) {gsub(L[hit],"L",$2)}
print
}
END{
# print the matches in the order they are generated
#for (i=1; i<=xn; i++) {print X[i]};
#print ""
#for (i=1+xn; i<=Z; i++) {print L[i]};
}
IFF you can generate a static matching order you can live with then
something like the above Awk script could work. but you say the primary patterns should take precedence and that a secondary rule would be better applied first in some cases. (no can do).
If you need a more flexible matching pattern I would suggest looking at "recursive decent parsing with backtracking" Or "parsing expression grammars".
But then you are not in a bash shell anymore.

Search for a particular multiline pattern using awk and sed

I want to read from the file /etc/lvm/lvm.conf and check for the below pattern that could span across multiple lines.
tags {
hosttags = 1
}
There could be as many white spaces between tags and {, { and hosttags and so forth. Also { could follow tags on the next line instead of being on the same line with it.
I'm planning to use awk and sed to do this.
While reading the file lvm.conf, it should skip empty lines and comments.
That I'm doing using.
data=$(awk < cat `cat /etc/lvm/lvm.conf`
/^#/ { next }
/^[[:space:]]*#/ { next }
/^[[:space:]]*$/ { next }
.
.
How can I use sed to find the pattern I described above?
Are you looking for something like this
sed -n '/{/,/}/p' input
i.e. print lines between tokens (inclusive)?
To delete lines containing # and empty lines or lines containing only whitespace, use
sed -n '/{/,/}/p' input | sed '/#/d' | sed '/^[ ]*$/d'
space and a tab--^
update
If empty lines are just empty lines (no ws), the above can be shortened to
sed -e '/#/d' -e '/^$/d' input
update2
To check if the pattern tags {... is present in file, use
$ tr -d '\n' < input | grep -o 'tags\s*{[^}]*}'
tags { hosttags = 1# this is a comment}
The tr part above removes all newlines, i.e. makes everything into one single line (will work great if the file isn't to large) and then search for the tags pattern and outputs all matches.
The return code from grep will be 0 is pattern was found, 1 if not.
Return code is stored in variable $?. Or pipe the above to wc -l to get the number of matches found.
update3
regex for searcing for tags { hosttags=1 } with any number of ws anywhere
'tags\s*{\s*hosttags\s*=\s*1*[^}]*}'
try this line:
awk '/^\s*#|^\s*$/{next}1' /etc/lvm/lvm.conf
One could try preprocessing the file first, removing commments and empty lines and introducing empty lines behind the closing curly brace for easy processing with the second awk.
awk 'NF && $1!~/^#/{print; if(/}/) print x}' file | awk '/pattern/' RS=