How to extract a specific character inside a parentheses using sed command? - sed

I want to extract an atomic symbols inside a parentheses using sed.
The data I have is in the form C(X12), and I only want the X symbol
EX: that a test command :
echo "C(Br12)" | sed 's/[0-9][0-9])$//g'
gives me C(Br.

You can use
sed -n 's/.*(\(.*\)[0-9]\{2\})$/\1/p'
See the online demo:
sed -n 's/.*(\(.*\)[0-9]\{2\})$/\1/p' <<< "c(Br12)"
# => Br
Details
-n - suppresses the default line output
.*(\(.*\)[0-9]\{2\})$ - a regex that matches
.* - any text
( - a ( char
\(.*\) - Capturing group 1: any text up to the last....
[0-9]\{2\} - two digits
)$ - a ) at the end of string
\1 - replaces with Group 1 value
p - prints the result of the substitution.

For example:
echo "C(Br12)" | sed 's/C(\(.\).*/\1/'
C( - match exactly literally C(
. match anything
\(.\) - match anythig - one character- and "remember" it in a backreference \1
.* ignore everything behind it
\1 - replace it by the stuff that was remembered. The first character.
Research sed, regex and backreferences for more information.

Try using the following command
echo "C(BR12)" | cut -d "(" -f2 | cut -d ")" -f1 | sed 's/[0-9]*//g'
The cut tool will split and get you the string in middle of the paranthesis.Then pass the string to a sed for replacing the numbers inside the string.
Not a fully sed solution but this will get you the output.

Related

Sed expression to convert only certain lines starting with some example phrase

There's an example file with a regular text. In some places of the document there's a mix of the following lines:
| ![](/img/2016/12/020.jakis-tam-text1.png#medium) | ![](/img/2016/12/021.jakis-tam-text2.png#medium) | ![](/img/2016/12/022.jakis-tam-text3.png#medium) |
| ![](/img/2016/12/020.jakis-tam-text1.png#medium) | ![](/img/2016/12/021.jakis-tam-text2.png#medium) |
There's the following sed expression to convert the lines to the required form:
sed 's#\([^[]*.\)\([^\.]*.\([^\.]*\)[^)]*.\)#\1\3\2#g'
How to apply this sed expression to only those lines that start with | ![]( ?
Your pattern is not quite right: . when unescaped matches any char. Also, note you do not need to escape . chars inside bracket expressions.
I suggest the following command:
sed '/^|[[:space:]]!\[](/s#\(|[[:space:]]!\[\)\(]([^().]*\.\([^|.]*\)\.[^()]*)\)#\1\3\2#g'
See online demo.
Here, the POSIX BRE pattern matches
/^|[[:space:]]!\[](/ - any line starting with |, whitespace, and then ![]( text
s#\(|[[:space:]]!\[\)\(]([^().]*\.\([^|.]*\)\.[^()]*)\)#\1\3\2#g -
\(|[[:space:]]!\[\) - Group 1 (\1): |, a whitespace, ![ text
\(]([^().]*\. - Group 2 start: ](, then any zero or more chars other than (, ) and ., then a . (note it is escaped)
\([^|.]*\) - Group 3 (\3): any zero or more chars other than | and .
\.[^()]*)\) - (still Group 2): . char, then zero or more chars other than ( and ) and then a ) char.
The replacement is the concatenation of Group 1, 3 and 2.
Prefix your s command with /| !\[](/.
[ is interpreted as regex and to avoid this it must be escaped.

Extract substrings between strings

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2
Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.
With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.
sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.
This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

Using SED to Remove Anything but a Pattern

I have a bunch of . pdf file names. For example:
901201_HKW_RNT_HW21_136_137_DE_442_Freigabe_DE_CLX.pdf
and i am trying to remove everything but this pattern XXX_XXX where X is always a digit.
The result should be:
136_137
So far i did the opposite .. manage to match the pattern by using :
set NoSpacesString to do shell script "echo " & quoted form of insideName & " | sed 's/([0-9][0-9][0-9]_[0-9][0-9][0-9])//'"
My goal is to set NoSpaceString to 136_137
Little bit of help please.
Thank you !
P.S. The rest of the code is in AppleScript if this matters
Fixing sed command...
You can use
sed -n 's/.*\([0-9]\{3\}_[0-9]\{3\}\).*/\1/p'
See the online demo
Details
-n - suppresses the default line output
s/.*\([0-9]\{3\}_[0-9]\{3\}\).*/\1/ - finds the .*\([0-9]\{3\}_[0-9]\{3\}\).* pattern that matches
.* - any zero or more chars
\([0-9]\{3\}_[0-9]\{3\}\) - Group 1 (the \1 in the RHS refers to this group value): three digits, _, three digits
.* - any zero or more chars
p - prints the result of the substitution only.
The regex above is a POSIX BRE compliant pattern. The same can be written in POSIX ERE:
sed -En 's/.*([0-9]{3}_[0-9]{3}).*/\1/p'
Final AppleScript code
set noSpacesString to do shell script "sed -En 's/.*([0-9]{3}_[0-9]{3}).*/\\1/p' <<<" & insideName's quoted form
This might work for you (GNU sed):
sed -E '/\n/{P;D};s/[0-9]{3}_[0-9]{3}/\n&\n/;D' file
This solution will print all occurrences of the pattern on a separate line.
The initial command is dependant on what follows.
The second command replaces the desired pattern prepending and appending newlines either side.
The D command removes up to the first newline, but as the pattern space is not empty, restarts the sed cycle (without append the next line).
Now the initial command comes into play. The front of the line is printed and then deleted along with its appended newline.
Again, the sed cycle is restarted as if the line had never been presented but minus any characters up to and including the first desired pattern.
This flip-flop pattern of control is repeated until nothing is left and then repeated on subsequent lines until the end of the file.
Here is a copy of the debug log for a suitable one line input containing two representations of the desired pattern:
SED PROGRAM:
/\n/ {
P
D
}
s/[0-9]{3}_[0-9]{3}/
&
/
D
INPUT: 'file' line 1
PATTERN: aaa123_456bbb123_456ccc
COMMAND: /\n/ {
COMMAND: }
COMMAND: s/[0-9]{3}_[0-9]{3}/
&
/
MATCHED REGEX REGISTERS
regex[0] = 3-10 '123_456'
PATTERN: aaa\n123_456\nbbb123_456ccc
MATCHED REGEX REGISTERS
regex[0] = 0-3 'aaa'
PATTERN: \n123_456\nbbb123_456ccc
COMMAND: D
PATTERN: 123_456\nbbb123_456ccc
COMMAND: /\n/ {
COMMAND: P
123_456
COMMAND: D
PATTERN: bbb123_456ccc
COMMAND: /\n/ {
COMMAND: }
COMMAND: s/[0-9]{3}_[0-9]{3}/
&
/
MATCHED REGEX REGISTERS
regex[0] = 3-10 '123_456'
PATTERN: bbb\n123_456\nccc
MATCHED REGEX REGISTERS
regex[0] = 0-3 'bbb'
PATTERN: \n123_456\nccc
COMMAND: D
PATTERN: 123_456\nccc
COMMAND: /\n/ {
COMMAND: P
123_456
COMMAND: D
PATTERN: ccc
COMMAND: /\n/ {
COMMAND: }
COMMAND: s/[0-9]{3}_[0-9]{3}/
&
/
PATTERN: ccc
MATCHED REGEX REGISTERS
regex[0] = 0-3 'ccc'
PATTERN:
COMMAND: D

sed to match first pattern among multiple matches

So for a given text like
a[test] asdfasdf [sdfsdf]b
I want the first match of text which is inside the first square brackets (regex = [.*]), so in this case [test].
I tried the following command it didn't work:
echo "a[test] asdfasdf [sdfsdf]b" | sed -n -e 's/.*\(\[.*\]\).*/\1/p'
This is returning [sdfsdf]
How do I get [test] instead ?
.* will select the longest match. Use [^[]* and [^]]* instead.
sed -n -e 's/[^[]*\(\[[^]]*\]\).*/\1/p'

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile