sed: delete lines that match a pattern in a given field - sed

I have a file tab delimited that looks like this:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
53_234 78 . CCG GAT 999 . . GT:PL:DP:DPR
45_569 5 . TCCG GTTA 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
I am trying to use sed to delete all the lines that contain more than one letter in the 4th field (in the case above, line 7 and 8 from the top). I have tried the following regular expression but there must be a glitch some where that I cannot find:
sed '5,${;/\([^.]*\t\)\{3\}\[A-Z][A-Z]\+\t/d;}' input.vcf>new.vcf
The syntax is as follows:
5,$ #start at line 5 until the end of the file ($)
([^.]*\t) #matching group is any single character followed by a zero or more characters followed by a tab.
{3} #previous block repeated 3 times (presumably for the 4th field)
[A-Z][A-Z]+\t #followed by any string of two letters or more followed by a tab.
Unfortunately, this doesn' t work but I know I am close to make it to work. Any hints or help will make this a great teaching moment.
Thanks.

If awk is okay for you, you can use below command:
awk '(FNR<5){print} (FNR>=5)&&length($4)<=1' input.vcf
Default delimiter is space, you can use -F"\t" to switch it to tab, put it after awk. for instance, awk -F"\t" ....
(FNR<5){print} FNR is file number record, when it is less than 5, print the whole line
(FNR>=5) && length($4)<=1 will handle the rest lines and filter lines which 4th field has one character or less.
Output:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
You can redirect the output to an output file.

$ awk 'NR<5 || $4~/^.$/' file
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR

Fixed your sed filter (took me a while almost went crazy over it)
5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}
Your errors:
[^.]*: everything but a dot.
Thanks to Ed, now I know that. I thought dot had to be escaped, but that does not seem to apply between brackets. Anyhow, this could match a tabulation char and match 2 or 3 groups instead of one, failing to match your line (regex are greedy by default)
\[A-Z][A-Z]: bad backslash. What did it do? hum, dunno!
test:
$ sed '5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}' foo.Txt
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
conclusion: to process delimited fields, awk is better :)

Related

Print specific lines that have two or more occurrences of a particular character

I have file with some text lines. I need to print lines 3-7 and 11 if it has two "b". I did
sed -n '/b\{2,\}/p' file but it printed lines where "b" occurs two times in a row
You can use
sed -n '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' file
## that is equal to
sed -n '3,7{/b[^b]*b/p};11{//p}' file
Note that b[^b]*b matches b, then any zero or more chars other than b and then a b. The //p in the second part matches the most recent pattern , i.e. it matches the same b[^b]*b regex.
Note you might also use b.*b regex if you want, but the bracket expressions tend to word faster.
See an online demo, tested with sed (GNU sed) 4.7:
s='11bb1
b222b
b n b
ww
ee
bb
rrr
fff
999
10
11 b nnnn bb
www12'
sed -ne '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' <<< "$s"
Output:
b n b
bb
11 b nnnn bb
Only lines 3, 6 and 11 are returned.
Just use awk for simplicity, clarity, portability, maintainability, etc. Using any awk in any shell on every Unix box:
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") >= 2 )' file
Notice how if you need to change a range, add a range, add other line numbers, change how many bs there are, add other chars and/or strings to match, add some completely different condition, etc. it's all absolutely clear and trivial.
For example, want to print the line if there's exactly either 13 or 27 bs instead of 2 or more:?
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") ~ /^(13|27)$/ )' file
Want to print the line if the line number is between 23 and 59 but isn't 34?
awk '( 23<=NR && NR<=59 && NR!=34 ) && ( gsub(/b/,"&") >= 2 )' file
Try making similar changes to a sed script. I'm not saying you can't force it to happen, but it's not nearly as trivial, clear, portable, etc. as it is using awk.

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!
Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt
To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt
This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

Replace first 3 occurrences of a character in each line

I have a tab-delimited file of genetic variants with an INFO column of many semicolon-delimited tags:
Chr Start End Ref Alt ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS Otherinfo QUAL DP Chr Start Ref Alt QUAL FILTER INFO
1 15847952 15847952 G C . . . . . . . . . 241.9 76196 1 15847952 . G C 241.9 PASS AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406 culprit=MQ
1 15847963 15847963 A C . . . . . . . . . 1607.1 126156 1 15847963 . A C 1607.1 PASS AC=2;AF=0;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=2;MLEAF=0;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995 culprit=QD
1 15847964 15847966 GCC - . . . . . . . . . 1607.1 126156 1 15847963 . AGCC A 1607.1 PASS AC=63;AF=0.003;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=55;MLEAF=0.002;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995 culprit=QD
1 15847978 15847978 C T . . . . . . . . . 648.41 234344 1 15847978 . C T 648.41 PASS AC=9;AF=0;AN=25894;BaseQRankSum=-0.572;ClippingRankSum=-0.404;DP=234344;ExcessHet=3.348;FS=2.639;InbreedingCoeff=-0.0098;MLEAC=6;MLEAF=0;MQ=58.71;MQRankSum=-0.456;NEGATIVE_TRAIN_SITE;QD=4.13;ReadPosRankSum=-0.456;SOR=0.452;VQSLOD=-1.238 culprit=QD
I want to split the first 3 semicolon-delimited terms in the INFO column:
AC=2;AF=0;AN=18332
So that they become:
AC=2 AF=0 AN=18332 BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406 culprit=M
So far I've tried the following expression with sed:
sed -i .bk 's/\(A.=.*\);/\1 /g' allChr_ExAC38.hg38_multianno.txt
But this yields no changes.
Ideally I was looking for a way to tell sed to replace the first 3 occurences of a semicolon ; for a tab, but 's/;/ /g3' doesn't seem to mean that.
Use Perl instead of sed:
perl -i.bk -pe '$c = 0; s/;/\t/ while $c++ < 3' -- file.txt
You can try this awk
awk '{for(i=1;i<4;i++)sub(";","\t")}1' infile
The .* in your regex is greedy, and will match as much text as possible on the line, up to just before the last semicolon (but not beyond, because then the entire regex won't match at all).
You cannot mix /3 and /g; the latter means, replace all occurrences on every line, so it is directly at odds with the /3 which says to replace only a maximum of three occurrences on a line.
"No changes" seems wrong, though; if your regex matched at all, the last semicolon on matching lines will have been replaced.
Some regex engines support non-greedy matching, but sed isn't one of them. As long as there is a single delimiter character you can use to limit the greediness, using that is a much better solution anyway. In your case, simply replace . with [^;] to say "any character except (newline or) semicolon" instead of "any character (except newline)."
sed 's/\(A.=[^;]*\);/\1 /3' allChr_ExAC38.hg38_multianno.txt
(This will print to standard output for verification; put back the -i .bk once you see the result is correct.)
Based on your example data, perhaps consider replacing the remaining . in the expression with [A-Z] and [^;] with [^;=] or even [0-9]. The more specific you can make your regex, the better.
Could you please try following and let me know if this helps you.
awk '
FNR==1{
print;
next}
{
num=split($(NF-1),array,";");
for(i=4;i<=num;i++){
val=val?val ";"array[i]:array[i]};
$(NF-1)=array[1] OFS array[2] OFS array[3] OFS val;
val="";
$1=$1
}
1
' OFS="\t" Input_file
This might work for you (GNU sed):
sed -i.bak 's/;/\n/3;h;y/;/\t/;G;s/\n.*\n/\t/' file
Replace the third ; with a newline, make a copy of the line, replace all ;'s with \t's, append the copy and replace the end of the first line to the middle of the second line with a \t.
Since by definition a line is demarcated by a newline, lines cannot contain a newline unless they are introduced by a programmer.
If the number of occurrences is reasonable you can pipe sed multiple times i.e.
sed -E -e 's/[0-9]{4}/****/'| sed -E -e 's/[0-9]{4}/****/'| sed -E -e 's/[0-9]{4}/****/'
will mask first 3 4-digit groups of credit card number like so
Input:
1234 5678 9101 1234
Output:
**** **** **** 1234

Data losing original format

I am relatively new to powershell and having a bit of a strange problem with a script. I have searched the forums and haven't been able to find anything that works.
The issue I am having is that when I covert output of commands to and from base64 for transport via a custom protocol we use in our environment it is losing its formatting. Commands are executed on the remote systems by passing the command string to IEX and store the output to a variable. I convert the output to base64 format using the following command
$Bytes = [System.Text.Encoding]::Unicode.GetBytes($str1)
$EncodedCmd = [Convert]::ToBase64String($Bytes)
At the other end when we recieve the output we convert back using the command
[System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($EncodedCmd))
The problem I am having is that although the output is correct the formatting of the output has been lost. For example if I run the ipconfig command
Windows IP Configuration Ethernet adapter Local Area Connection 2: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Ethernet
adapter Local Area Connection 3: Connection-specific DNS Suffix . : Link-local IPv6 Address . . . . . : fe80::3cd8:3c7f:c78b:a78f%14 IPv4 Address. . . . . . . . . . .
: 192.168.10.64 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 192.168.10.100 Ethernet adapter Local Area Connection: Connection-sp
ecific DNS Suffix . : IPv4 Address. . . . . . . . . . . : 172.10.15.201 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 172.10.15
1.200 Tunnel adapter isatap.{42EDCBE-8172-5478-AD67E-8A28273E95}: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Tunnel ada
pter isatap.{42EDCBE-8172-5478-AD67E-8A28273E95}: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Tunnel adapter isatap.{42EDCBE-8172-5478-AD67E-8A28273E95}: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Tunnel adapter Teredo Tunneling Pseudo-Inter
face: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . :
The formatting is all over the place and hard to read, I have played around with it a bit, but I can't find a really good way of returning the command output in the correct format. Appreciate any ideas on how I can fix the formatting
What happens here is that the $str1 variable is an array of strings. It doesn't contain newline characters but each line is on its own row.
When the variable is converted as Base64, all the rows in the array are catenated together. This can be seen easily enough:
$Bytes[43..60] | % { "$_ -> " + [char] $_}
0 ->
105 -> i
0 ->
111 -> o
0 ->
110 -> n
0 ->
32 ->
0 ->
32 ->
0 ->
32 ->
0 ->
69 -> E
0 ->
116 -> t
0 ->
104 -> h
Here the 0 are caused by double byte Unicode. Pay attention to 32 that is space character. So one sees that there is just space padding, no line terminators in the source string
Windows IP Configuration
Ethernet
As a solution, either add line feed characters or serialize the whole array as XML.
Adding line feed characters is done via joining the array elements with -join and using [Environment]::NewLine as the separator caracter. Like so,
$Bytes = [System.Text.Encoding]::Unicode.GetBytes( $($str1 -join [environment]::newline))
$Bytes[46..67] | % { "$_ -> " + [char] $_}
105 -> i
0 ->
111 -> o
0 ->
110 -> n
0 ->
13 ->
0 ->
10 ->
0 ->
13 ->
0 ->
10 ->
0 ->
13 ->
0 ->
10 ->
0 ->
69 -> E
0 ->
116 -> t
0 ->
Here, the 13 and 10 are CR and LF characters that Windows uses for line feed. After adding the line feed characters, the result string looks like the source. Be aware that thought it looks the same, it is not the same. Source is an array of strings, the outcome is single string containing line feeds.
If you must preserve the original, serialization is the way to go.

Pass "file name" from a text file to a command line where each line of a file is file name

I'm running the following code
git log --pretty=format: --numstat -- SOMEFILENAME |
perl -ane '$i += ($F[0]-$F[1]); END{print "changed: $i\n"}' \
>> random.txt
What this does is it takes a file with a name "SOMEFILENAME" and saves the sum of the total amount of added and removed lines to a textfile called "random.txt"
I need to run this program on every file in repository and there are looots of them. What would be an easy way to do this?
If you want a total per file:
git log --pretty=format: --numstat |
perl -ane'
$c{$F[2]} += $F[0]-$F[1] if $F[2];
END { print "$_\t$c{$_}\n" for sort keys %c }
' >random.txt
If you want a single total:
git log --pretty=format: --numstat |
perl -ane'
$c += $F[0]-$F[1];
END { print "$c\n" }
' >random.txt
Their respective outputs are:
.gitignore 22
Build.PL 48
CHANGES.txt 0
Changes 25
LICENSE 132
LICENSE.txt 0
MANIFEST 18
MANIFEST.SKIP 9
README.txt 67
TODO.txt 1
lib/feature/qw_comments.pm 129
lib/feature/qw_comments.xs 250
t/00_load.t 13
t/01_basic.t 85
t/02_pragma.t 56
t/03_line_numbers.t 37
t/04_errors.t 177
t/05-unicode.t 39
t/devel-pod-coverage.t 26
t/pod.t 17
and
1151
Rather than use find, you can just let git give you all the files by using the name . (representing the current directory). With that, here's a version using awk that prints out stats per file:
git log --pretty=format: --numstat -- . |
awk '
NF == 3 {changed[$3] += $1 - $2}
END { for (name in changed) { printf("%s: %d changed\n", name, changed[name]); } }
'
And an even shorter one that prints a single overall changed line:
git log --pretty=format: --numstat -- . |
awk '
NF == 3 {changed += $1 - $2}
END { printf("%d changed\n", changed); }
'
(The NF == 3 is to account for the fact that git seems to print spurious blank lines in its output. I didn't try to figure out if there's a better git command.)