Sed command to add a character when pattern is matched? - sed

I have the contents of file like:
data {
x = "name",
y = "job"
z = "role"
}
I want to add = in front of the curly bracket when it is missing, otherwise ignore it
Output should be :
data = {
x = "name",
y = "job"
z = "role"
}
I tried this but doesn't work.
sed -r -ie 's|data \{ | data \= \{|g' abc.txt
Can anybody please tell what am I doing wrong?
Thanks

You can use
sed -r -i 's|(data)[[:space:]]*(\{)|\1 = \2|g' abc.txt
sed -i 's|data[[:space:]]*\{|data = {|g' abc.txt
Details:
-r (or -E) option enables POSIX ERE regex syntax
(data)[[:space:]]*(\{) - matches and captures data into Group 1, [[:space:]]* matches any zero or more whitespaces, (\{) captures into Group 2 a { char
\1 = \2 - replaces with Group 1 value, space, =, space and Group 2 value.
See an online demo.
NOTE: If you need to make sure the pattern matches the whole line add anchors, ^ and $:
's|^(data)[[:space:]]*(\{)$|\1 = \2|g'
In a GNU sed, you may use a shorter form for [[:spaces:]]*, \s*:
's|^(data)\s*(\{)$|\1 = \2|'
If there can be leading/trailing whitespace and you need to keep it use:
's|^(\s*data)\s*(\{\s*)$|\1 = \2|'

sed '/=/!s/data /&= /' abc.txt
On lines not having an equal sign, replace data+space with data+space+equal+space.

Related

bash SED command explanation with semicolon

What is this sed command doing? and is there any online utility that kind of explains sed a little bit, like regex?
sed -i '1s/$/|,a Type,b Type,c Type/;/./!b;1!s/$/|,,,/' textflile.txt
I think in the beginning it is adding csv a type, b type, c type at the end of the line but what does the rest of the command too
I don't know of any such utility, but let me explain using a text editor:
sed -i '1s/$/|,a Type,b Type,c Type/;/./!b;1!s/$/|,,,/' textflile.txt
^ ^ ^ ^ ^^ ^^ ^
| | | | || || |
modify | End Non-empty || || input
the | of lines || |Negation, file
file | line only || |i.e. lines 2,3,...
in | || |
place | || First
First line Negation, i.e.| line
empty lines only|
Branch to
script end,
i.e. skip the rest
In other words, it adds |,a type, b Type,c Type to the first line, doesn't change empty lines, and adds |,,, to all the remaining lines.
sed -i '1s/$/|,a Type,b Type,c Type/;/./!b;1!s/$/|,,,/' textflile.txt
can be written as
sed -i '
1 s/$/|,a Type,b Type,c Type/
/./! b
1! s/$/|,,,/
' textflile.txt
on line 1 only, add some text to the end of the line
if the line is empty ("matches 1 character, not"), goto next "cycle" (i.e., print current line and go to next line)
on every line except line 1, add "|,,," to the end of the line
So, it looks like you're adding some blank fields to a CSV file.
info sed contains the complete sed manual.
This doesn't answer your question but it's important for people to know and requires more space and formatting than a comment so: FYI to do what #choroba says that sed script does, i.e.
it adds |,a type, b Type,c Type to the first line,
doesn't change empty lines,
and adds |,,, to all the remaining lines.
is just this in awk:
awk '
NR==1 { print $0 "|,a type, b Type,c Type"; next }
!NF { print }
NF { print $0 "|,,," }
'
or if you're familiar with ternary expressions and want to remove the redundant code:
awk '{
sfx = "|," (NR==1 ? "a type, b Type,c Type" : ",,")
print $0 (NF ? sfx : "")
}'

Conditional substitution of patterns in bash strings depending on the beginning of a string

I am new in bash, so excuse me if do not use the right terms.
I need to substitute certain patterns of six characters in a set of files. The order by patterns are substituted depends on the beginning of each string of text.
This is an example of input:
chr1:123-123 5GGGTTAGGGTTAGGGTTAGGGTTAGGGTTA3
chr1:456-456 5TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG3
chr1:789-789 5GGGCTAGGGTTAGGGTTAGGGTTA3
chr1:123-123 etc is the name of the string, they are separated from the string I need to work with by a tab. The string I need to work with is delimited by characters 5 and 3, but I can change them.
I want that all patterns containing T, A, G in anyone of these orders is substituted with X: TTAGGG, TAGGG, AGGGTT, GGGTTA, GGTTAG, GTTAGG.
Similarly, patterns containing CTAGGG, like row 3, in orders similar to the previous one will be substituted with a different character.
The game is repeated with some specific differences for all the 6 characters composing each pattern.
I started writing something like this:
#!/bin/bash
NORMAL=`echo "\033[m"`
RED=`echo "\033[31m"` #red
#read filename for the input file and create a copy and a folder for the output
read -p "Insert name for INPUT file: " INPUT
echo "Creating OUTPUT file " "${RED}"$INPUT"_sub.txt${NORMAL}"
mkdir -p ./"$INPUT"_OUTPUT
cp $INPUT.txt ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
echo
#start the first set of instructions
perfrep
#starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism
Instructions are
perfrep() {
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGGT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGGTT/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGTTAG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
# starting a second set of instructions to substitute pattern with one difference from TTAGGG
onemism(){
sed -i -e 's/[GCA]TAGGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/G[GCA]TAGG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GG[GCA]TAG/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/GGG[GCA]TA/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/AGGG[GCA]T/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
sed -i -e 's/TAGGG[GCA]/L/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
}
I will need to repeat also with T[GCA]AGGG, TT[TCG]GGG, TTA[ACT]GG, TTAG[ACT]G and TTAGG[ACT].
Using this procedure, I get for these results for the inputs shown
5GGGXXXXTTA3
5XXXXX3
5GGGLXXTTA3
In my point of view, for my job, the first and second string are both made by X repeated five times, and the order of characters is just slightly different. On the other hand, the third one could be masked like this:
5LXXX3
How do I tell the script that if the string starts with 5GGGTTA instead of 5TTAGGG must start to substitute with
sed -i -e 's/GGGTTA/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
instead of
sed -i -e 's/TTAGGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
?
I will need to repeat with all cases; for instance, if the string starts with GTTAGG I will need to start with
sed -i -e 's/GTTAGG/X/g' ./"$INPUT"_OUTPUT/"$INPUT"_sub.txt
and so on, and add a couple of variation of my pattern.
I need to repeat the substitution with TTAGGG and the variations for all the rows of my input file.
Sorry for the very long question. Thank you all.
Adding information asked by Varun.
Patterns of 6 characters would be TTAGGG , [GCA]TAGGG , T[GCA]AGGG , TT[TCG]GGG , TTA[ACT]GG , TTAG[ACT]G , TTAGG[ACT].
Each one must be checked for a different frame, for instance for TTAGGG we have 6 frames TTAGGG , GTTAGG , GGTTAG, GGGTTA , AGGGTT , TAGGGT.
The same frames must be applied to the pattern containing a variable position.
I will have a total of 42 patterns to check, divided in 7 groups: one containing TTAGGG and derivative frames, 6 with the patterns with a variable position and their derivatives.
TTAGGG and derivatives are the most important and need to be checked first.
#! /usr/bin/awk -f
# generate a "frame" by moving the first char to the end
function rotate(base){ return substr(base,2) substr(base,1,1) }
# Unfortunately awk arrays do not store regexps
# so I am generating the list of derivative strings to match
function generate_derivative(frame,arr, i,j,k,head,read,tail) {
arr[i]=frame;
for(j=1; j<=length(frame); j++) {
head=substr(frame,1,j-1);
read=substr(frame,j,1);
tail=substr(frame,j+1);
for( k=1; k<=3; k++) {
# use a global index to simplify
arr[++Z]= head substr(snp[read],k,1) tail
}
}
}
BEGIN{
fs="\t";
# alternatives to a base
snp["A"]="TCG"; snp["T"]="ACG"; snp["G"]="ATC"; snp["C"]="ATG";
# the primary target
frame="TTAGGG";
Z=1; # warning GLOBAL
X[Z] = frame;
# primary derivatives
generate_derivative(frame, X);
xn = Z;
# secondary shifted targets and their derivatives
for(i=1; i<length(frame); i++){
frame = rotate(frame);
L[++Z] = frame;
generate_derivative(frame, L);
}
}
/^chr[0-9:-]*\t5[ACTG]*3$/ {
# because we care about the order of the prinary matches
for (i=1; i<=xn; i++) {gsub(X[i],"X",$2)}
# since we don't care about the order of the secondary matches
for (hit in L) {gsub(L[hit],"L",$2)}
print
}
END{
# print the matches in the order they are generated
#for (i=1; i<=xn; i++) {print X[i]};
#print ""
#for (i=1+xn; i<=Z; i++) {print L[i]};
}
IFF you can generate a static matching order you can live with then
something like the above Awk script could work. but you say the primary patterns should take precedence and that a secondary rule would be better applied first in some cases. (no can do).
If you need a more flexible matching pattern I would suggest looking at "recursive decent parsing with backtracking" Or "parsing expression grammars".
But then you are not in a bash shell anymore.

replace a line that contains a string with special characters

i want to replace lines which contains a string that has some special characters.
i used \ and \ for escape special characters but nothing changes in file.
i use sed like this:
> sed -i '/pnconfig\[\'dbhost\'\] = \'localhost\'/c\This line is removed.' tco.php
i just want to find lines that contains :
$pnconfig['dbhost'] = 'localhost';
and replace that line with:
$pnconfig['dbhost'] = '1.1.1.1';
Wrap the sed in double quotes as
sed -i "s/\(pnconfig\['dbhost'\] = \)'localhost'/\1'1.1.1.1'/" filename
Test
$ echo "\$pnconfig['dbhost'] = 'localhost';" | sed "s/\(pnconfig\['dbhost'\] = \)'localhost'/\1'1.1.1.1'/"
$pnconfig['dbhost'] = '1.1.1.1';
Use as below:
sed -i.bak '/pnconfig\[\'dbhost\'\] = \'localhost\'/pnconfig\[\'dbhost\'\] = \'1.1.1.1\'/' tco.php
Rather than modifying the file for the first time, create back up and then search for your pattern and then replace it with the other as above in your file tco.php
You don't have to worry about backslashing single quotes by using double quotes for sed.
sed -i.bak "/pnconfig\['dbhost'\] = 'localhost'/s/localhost/1.1.1.1/g" File
Try this one.
sed "/$pnconfig\['dbhost']/s/localhost/1.1.1.1/"

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile

Use sed/grep/awk to delete everything up until the first blank line

Can anyone help me figure out how to do this, it would be much appreciated.
example
block of //delete
non-important text //delete
important text //keep
more important text //keep
sed '1,/^$/d' file
or
awk '!$0{f=1;next}f{print}' file
Output
$ sed '1,/^$/d' <<< $'block of\nnon-important text\n\nimportant text\nmore important text'
important text
more important text
$ awk '!$0{f=1;next}f{print}' <<< $'block of\nnon-important text\n\nimportant text\nmore important text'
important text
more important text
If the blank line is empty, this'll do it:
sed '1,/^$/d' filename
with awk:
awk "BEGIN { x = 0 } /^$/ { x = 1} { if (x == 2) { print } ; if (x == 1) { x = 2 } }" filename
Another option with grep (working on lines)
grep -v PATTERN filename > newfilename
For example:
filename has the following lines:
this is not implicated but important text
this is not important text
this is important text he says
not important text he says
not this it is more important text
A filter of:
grep -v "not imp" filename > newfilename
would create newfilename with the following 3 lines:
this is not implicated but important text
this is important text he says
not this it is more important text
You would have to choose a PATTERN that would uniquely identify the lines you are trying to remove. If you use a PATTERN of "important text", it would match all of the lines while "not imp" only matches the lines that have the words "not imp" in them. Use egrep (or grep -E) for regexp filters if you want more flexibility in pattern matching.