Replace first 3 occurrences of a character in each line - sed

I have a tab-delimited file of genetic variants with an INFO column of many semicolon-delimited tags:
Chr Start End Ref Alt ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS Otherinfo QUAL DP Chr Start Ref Alt QUAL FILTER INFO
1 15847952 15847952 G C . . . . . . . . . 241.9 76196 1 15847952 . G C 241.9 PASS AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406 culprit=MQ
1 15847963 15847963 A C . . . . . . . . . 1607.1 126156 1 15847963 . A C 1607.1 PASS AC=2;AF=0;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=2;MLEAF=0;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995 culprit=QD
1 15847964 15847966 GCC - . . . . . . . . . 1607.1 126156 1 15847963 . AGCC A 1607.1 PASS AC=63;AF=0.003;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=55;MLEAF=0.002;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995 culprit=QD
1 15847978 15847978 C T . . . . . . . . . 648.41 234344 1 15847978 . C T 648.41 PASS AC=9;AF=0;AN=25894;BaseQRankSum=-0.572;ClippingRankSum=-0.404;DP=234344;ExcessHet=3.348;FS=2.639;InbreedingCoeff=-0.0098;MLEAC=6;MLEAF=0;MQ=58.71;MQRankSum=-0.456;NEGATIVE_TRAIN_SITE;QD=4.13;ReadPosRankSum=-0.456;SOR=0.452;VQSLOD=-1.238 culprit=QD
I want to split the first 3 semicolon-delimited terms in the INFO column:
AC=2;AF=0;AN=18332
So that they become:
AC=2 AF=0 AN=18332 BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406 culprit=M
So far I've tried the following expression with sed:
sed -i .bk 's/\(A.=.*\);/\1 /g' allChr_ExAC38.hg38_multianno.txt
But this yields no changes.
Ideally I was looking for a way to tell sed to replace the first 3 occurences of a semicolon ; for a tab, but 's/;/ /g3' doesn't seem to mean that.

Use Perl instead of sed:
perl -i.bk -pe '$c = 0; s/;/\t/ while $c++ < 3' -- file.txt

You can try this awk
awk '{for(i=1;i<4;i++)sub(";","\t")}1' infile

The .* in your regex is greedy, and will match as much text as possible on the line, up to just before the last semicolon (but not beyond, because then the entire regex won't match at all).
You cannot mix /3 and /g; the latter means, replace all occurrences on every line, so it is directly at odds with the /3 which says to replace only a maximum of three occurrences on a line.
"No changes" seems wrong, though; if your regex matched at all, the last semicolon on matching lines will have been replaced.
Some regex engines support non-greedy matching, but sed isn't one of them. As long as there is a single delimiter character you can use to limit the greediness, using that is a much better solution anyway. In your case, simply replace . with [^;] to say "any character except (newline or) semicolon" instead of "any character (except newline)."
sed 's/\(A.=[^;]*\);/\1 /3' allChr_ExAC38.hg38_multianno.txt
(This will print to standard output for verification; put back the -i .bk once you see the result is correct.)
Based on your example data, perhaps consider replacing the remaining . in the expression with [A-Z] and [^;] with [^;=] or even [0-9]. The more specific you can make your regex, the better.

Could you please try following and let me know if this helps you.
awk '
FNR==1{
print;
next}
{
num=split($(NF-1),array,";");
for(i=4;i<=num;i++){
val=val?val ";"array[i]:array[i]};
$(NF-1)=array[1] OFS array[2] OFS array[3] OFS val;
val="";
$1=$1
}
1
' OFS="\t" Input_file

This might work for you (GNU sed):
sed -i.bak 's/;/\n/3;h;y/;/\t/;G;s/\n.*\n/\t/' file
Replace the third ; with a newline, make a copy of the line, replace all ;'s with \t's, append the copy and replace the end of the first line to the middle of the second line with a \t.
Since by definition a line is demarcated by a newline, lines cannot contain a newline unless they are introduced by a programmer.

If the number of occurrences is reasonable you can pipe sed multiple times i.e.
sed -E -e 's/[0-9]{4}/****/'| sed -E -e 's/[0-9]{4}/****/'| sed -E -e 's/[0-9]{4}/****/'
will mask first 3 4-digit groups of credit card number like so
Input:
1234 5678 9101 1234
Output:
**** **** **** 1234

Related

How to replace a block of code between two patterns with blank lines?

I am trying replace a block of code between two patterns with blank lines
Tried using below command
sed '/PATTERN-1/,/PATTERN-2/d' input.pl
But it only removes the lines between the patterns
PATTERN-1 : "=head"
PATTERN-2 : "=cut"
input.pl contains below text
=head
hello
hello world
world
morning
gud
=cut
Required output :
=head
=cut
Can anyone help me on this?
$ awk '/=cut/{f=0} {print (f ? "" : $0)} /=head/{f=1}' file
=head
=cut
To modify the given sed command, try
$ sed '/=head/,/=cut/{//! s/.*//}' ip.txt
=head
=cut
//! to match other than start/end ranges, might depend on sed implementation whether it dynamically matches both the ranges or statically only one of them. Works on GNU sed
s/.*// to clear these lines
awk '/=cut/{found=0}found{print "";next}/=head/{found=1}1' infile
# OR
# ^ to take care of line starts with regexp
awk '/^=cut/{found=0}found{print "";next}/^=head/{found=1}1' infile
Explanation:
awk '/=cut/{ # if line contains regexp
found=0 # set variable found = 0
}
found{ # if variable found is nonzero value
print ""; # print ""
next # go to next line
}
/=head/{ # if line contains regexp
found=1 # set variable found = 1
}1 # 1 at the end does default operation
# print current line/row/record
' infile
Test Results:
$ cat infile
=head
hello
hello world
world
morning
gud
=cut
$ awk '/=cut/{found=0}found{print "";next}/=head/{found=1}1' infile
=head
=cut
This might work for you (GNU sed):
sed '/=head/,/=cut/{//!z}' file
Zap the lines between =head and =cut.

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks
Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere
If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file
This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file
A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

sed/awk/cut/grep - Best way to extract string

I have a results.txt file that is structured in this format:
Uncharted 3: Javithaxx l Rampant l Graveyard l Team Deathmatch HD (D1VpWBaxR8c)
Matt Darey feat. Kate Louise Smith - See The Sun (Toby Hedges Remix) (EQHdC_gGnA0)
The Matrix State (SXP06Oax70o)
Above & Beyond - Group Therapy Radio 014 (guest Lange) (2013-02-08) (8aOdRACuXiU)
I want to create a new file extracting the youtube URL ID specified in the last characters in each line line "8aOdRACuXiU"
I'm trying to build a URL like this in a new file:
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Note, I appended the &hd=1 to the string that I am trying to be replaced. I have tried using Linux reverse and cut but reverse or rev munges my data. The hard part here is that each line in my text file will have entries with parentheses and I only care about getting the data between the last set of parentheses. Each line has a variable length so that isn't helpful either. What about using grep and .$ for the end of the line?
In summary, I want to extract the youtube ID from results.txt and export it to a new file in the following format: http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Using awk:
awk '{
v = substr( $NF, 2, length( $NF ) - 2 )
printf "%s%s%s\n", "http://www.youtube.com/watch?v=", v, "&hd=1"
}' infile
It yields:
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
$ sed 's!.*(\(.*\))!http://www.youtube.com/watch?v=\1\&hd=1!' results.txt
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Here, .*(\(.*\)) looks for the last occurrence of a pair of parentheses, and captures the characters inside those parentheses. The captured group is then inserted into the URL using \1.
Using a perl one-liner :
perl -lne 'printf "http://www.youtube.com/watch?v=%s&hd=1\n", $& if /[^\(]+(?=\)$)/' file.txt
Or multi-line version :
perl -lne '
printf(
"http://www.youtube.com/watch?v=%s&hd=1\n",
$&
) if /[^\(]+(?=\)$)/
' file.txt

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile

sed - comment a matching line and x lines after it

I need help with using sed to comment a matching lines and 4 lines which follows it.
in a text file.
my text file is like this:
[myprocess-a]
property1=1
property2=2
property3=3
property4=4
[anotherprocess-b]
property1=gffgg
property3=gjdl
property2=red
property4=djfjf
[myprocess-b]
property1=1
property4=4
property2=2
property3=3
I want to prefix # to all the lines having text '[myprocess' and 4 lines that follows it
expected output:
#[myprocess-a]
#property1=1
#property2=2
#property3=3
#property4=4
[anotherprocess-b]
property1=gffgg
property3=gjdl
property2=red
property4=djfjf
#[myprocess-b]
#property1=1
#property4=4
#property2=2
#property3=3
Greatly appreciate your help on this.
You can do this by applying a regular expression to a set of lines:
sed -e '/myprocess/,+4 s/^/#/'
This matches lines with 'myprocess' and the 4 lines after them. For those 4 lines it then inserts a '#' at the beginning of the line.
(I think this might be a GNU extension - it's not in any of the "sed one liner" cheatsheets I know)
sed '/\[myprocess/ { N;N;N;N; s/^/#/gm }' input_file
Using string concatenation and default action in awk.
http://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
awk '/myprocess/{f=1} f>5{f=0} f{f++; $0="#" $0} 1' foo.txt
or if the block always ends with empty line
awk '/myprocess/{f=1} !NF{f=0} f{$0="#" $0} 1' foo.txt