Given the following in 8-bit 2s complement numbers:
11000011 = -61 (decimal)
00011111 = +31 (decimal)
I am required to obtain a boolean expression of a logic circuit whose output out goes high when its 8-bit input in (also in 2s complement representation) is in the following range:
-61 < in < 31
Number line for 8 bit numbers (2s complement):
10000000 (most negative) ..... 11000011 (-61) ..... 00000000 ..... 00011111 (31) ..... 01111111 (most positive)
Is there any way of solving this problem besides brute-forcing and comparing bit-by-bit?
Edit: The following statement is not allowed
out = ((in < 11000011 && in > 10000000) || (in > 00011111 && in < 01111111)) ? 1'b0 : 1'b1;
I'm not sure if there is a faster way to do this. But what I did was to list the numbers out in 2s complement format before trying to find a pattern. The following chunks of numbers are sorted in numerical order (from 00000000 to 11111111 so that the pattern can be more clearly seen).
Let the MSB be A and LSB be H. The equation is: A B C + A B D + A B E + A B F + A' B' C' D' + A' B' C' E' + A' B' C' F' + A' B' C' G' + A' B' C' H'
A' B' C' D' (easiest to observe):
00000000 (<- min)
00000001
00000010
00000011
00000100
00000101
00000110
00000111
00001000
00001001
00001010
00001011
00001100
00001101
00001110
00001111
A' B' C' E' + A' B' C' F' + A' B' C' G' + A' B' C' H':
00010000
00010001
00010010
00010011
00010100
00010101
00010110
00010111
00011000
00011001
00011010
00011011
00011100
00011101
00011110
A B D + A B E + A B F:
11000100
11000101
11000110
11000111
11001000
11001001
11001010
11001011
11001100
11001101
11001110
11001111
11010000
11010001
11010010
11010011
11010100
11010101
11010110
11010111
11011000
11011001
11011010
11011011
11011100
11011101
11011110
11011111
A B C (easiest to observe):
11100000
11100001
11100010
11100011
11100100
11100101
11100110
11100111
11101000
11101001
11101010
11101011
11101100
11101101
11101110
11101111
11110000
11110001
11110010
11110011
11110100
11110101
11110110
11110111
11111000
11111001
11111010
11111011
11111100
11111101
11111110
11111111 (<-max)
Related
I am using two different systems:
A: SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 2 with GPL Ghostscript 8.62
B: SUSE Linux Enterprise Server 12 (x86_64) VERSION = 12 PATCHLEVEL = 5 with GPL Ghostscript 9.53.3
I convert the same postscript file jjtest3.ps to png using the same command:
/usr/bin/gs -I/wws/daten/var -dNOPAUSE -sDEVICE=png16m -dUseFastColor=true -r120x120 -sOutputFile=output.png jjtest3.ps -c quit;
An example off output.png from system A is in the file sles11_2.png (see the link below).
An example off output.png from system B is in the file sles12_5.png (see the link below).
The problem is, that the png created on system B is not in that quality as from system A. There are missing some thin lines – see the examples.
The GS option -dGraphicsAlphaBits=4 yields better result, but I don’t want to use it, because the output device is an e-ink display, which uses only 100% black, white and red colors. So the result png should have only 2 colors (red, white).
My question is: Which Ghostscript options under the GS version 9.53 would yield the same (or better result) as under the version 8.62? Or any other ideas?
I think a downgrade to the older GS version on SLES12 system is not really a good option for the future.
sles11_2.pngsles12_5.png
The content of the Postscript file jjtest3.ps is:
%!PS-Adobe-3.0
%%CreationDate: 20201229 16:26:53
%%Pages: 1
%%Orientation: Portrait
%%BoundingBox: 0 0 240 180
%%LanguageLevel: 2
%%EndComments
%%BeginProlog
/lib_logo_nr_00639 {
% BEGIN_LOGO 00639 ***** Logo ESL Jetzt in Aktion *****
/X 284 def /Y 170 def % Breite und Hoehe des gesamten Feldes
/weiss { 0.0 0.0 0.0 0.0 setcmykcolor } def
/rot { 0.0 1.0 1.0 0.0 setcmykcolor } def
/s {stroke} def /sl {setlinewidth } def
/c {curveto} def /l {lineto} def /m {moveto} def /f {fill} def
/HG_rot { 153.0 2.9 m 207.9 23.8 l 211.2 24.9 215.4 27.6 215.4 34.5 c
215.4 158.6 l 215.4 158.6 215.4 169.9 204.0 169.9 c 79.9 169.9 l
79.9 169.9 68.5 169.9 68.5 158.6 c 68.5 34.5 l 68.5 34.5 68.5 27.0 74.6 24.2 c
78.8 22.6 l 134.4 1.4 l 134.4 1.4 137.4 0.0 141.7 0.0 c
146.0 0.0 153.0 2.9 153.0 2.9 c 153.0 2.9 l } def
/Innen { 199.7 55.5 m 199.7 97.8 l 193.4 97.8 l 193.4 87.2 l
193.4 84.0 193.6 78.2 193.6 75.3 c 192.8 78.5 191.2 84.0 190.1 87.5 c
186.7 97.8 l 180.6 97.8 l 180.6 55.5 l 186.9 55.5 l 186.9 69.2 l
186.9 72.9 186.8 78.7 186.7 81.7 c 187.5 78.6 189.2 72.7 190.5 68.5 c
194.6 55.5 l 199.7 55.5 l 170.2 76.5 m 170.2 63.7 169.5 61.0 167.3 61.0 c
165.2 61.0 164.4 63.8 164.4 76.8 c 164.4 89.6 165.0 92.3 167.2 92.3 c
169.4 92.3 170.2 89.5 170.2 76.5 c 170.2 76.5 l 177.2 76.8 m
177.2 91.2 174.8 98.5 167.3 98.5 c 159.8 98.5 157.4 91.3 157.4 76.5 c
157.4 61.9 159.7 54.8 167.2 54.8 c 174.8 54.8 177.2 62.0 177.2 76.8 c
177.2 76.8 l 153.9 55.5 m 153.9 97.8 l 147.1 97.8 l 147.1 55.5 l 153.9 55.5 l
144.6 91.5 m 144.6 97.8 l 126.3 97.8 l 126.3 91.5 l 132.0 91.5 l 132.0 55.5 l
138.8 55.5 l 138.8 91.5 l 144.6 91.5 l 126.2 55.5 m 120.0 82.1 l 126.0 97.8 l
119.1 97.8 l 116.5 90.3 l 115.6 87.9 114.4 83.9 113.8 81.3 c
113.9 83.9 113.9 86.9 113.9 90.0 c 113.9 97.8 l 107.2 97.8 l 107.2 55.5 l
113.9 55.5 l 113.9 71.8 l 115.0 74.6 l 119.1 55.5 l 126.2 55.5 l 96.3 70.9 m
92.5 70.9 l 93.4 78.6 l 93.8 81.6 94.2 85.1 94.3 88.5 c
94.5 85.1 94.9 81.7 95.3 78.7 c 96.3 70.9 l 104.9 55.5 m 98.1 97.8 l
90.9 97.8 l 84.2 55.5 l 90.9 55.5 l 92.0 64.8 l 96.9 64.8 l 98.1 55.5 l
104.9 55.5 l 198.6 101.8 m 198.6 138.3 l 195.4 138.3 l 195.4 121.4 l
195.4 117.8 195.5 114.1 195.6 111.9 c 194.9 114.9 193.6 119.1 192.3 123.1 c
187.4 138.3 l 184.1 138.3 l 184.1 101.8 l 187.2 101.8 l 187.2 120.5 l
187.2 124.1 187.2 127.8 187.1 130.0 c 187.7 126.9 189.0 122.8 190.3 118.8 c
195.9 101.8 l 198.6 101.8 l 178.4 101.8 m 178.4 138.3 l 175.1 138.3 l
175.1 101.8 l 178.4 101.8 l 162.4 135.2 m 162.4 138.3 l 148.5 138.3 l
148.5 135.2 l 153.8 135.2 l 153.8 101.8 l 157.1 101.8 l 157.1 135.2 l
162.4 135.2 l 146.0 101.8 m 146.0 105.0 l 137.1 105.0 l 146.0 135.3 l
146.0 138.3 l 134.5 138.3 l 134.5 135.1 l 142.6 135.1 l 133.7 104.8 l
133.7 101.8 l 146.0 101.8 l 131.8 135.2 m 131.8 138.3 l 117.9 138.3 l
117.9 135.2 l 123.2 135.2 l 123.2 101.8 l 126.5 101.8 l 126.5 135.2 l
131.8 135.2 l 115.8 101.8 m 115.8 105.1 l 107.0 105.1 l 107.0 119.7 l
112.8 119.7 l 112.8 122.9 l 107.0 122.9 l 107.0 135.0 l 115.3 135.0 l
115.3 138.3 l 103.7 138.3 l 103.7 101.8 l 115.8 101.8 l 98.3 110.3 m
98.3 138.3 l 95.0 138.3 l 95.0 110.4 l 95.0 106.2 93.9 104.5 92.1 104.5 c
90.3 104.5 89.2 106.2 88.6 111.0 c 85.3 110.4 l
86.1 104.5 87.9 101.3 92.0 101.3 c 95.8 101.3 98.3 104.3 98.3 110.3 c
98.3 110.3 l 204.6 38.5 m 79.3 38.5 l 79.3 40.6 l 204.6 40.6 l 204.6 38.5 l
204.6 34.5 m 79.3 34.5 l 79.3 36.6 l 204.6 36.6 l 204.6 34.5 l 204.6 157.1 m
79.3 157.1 l 79.3 159.2 l 204.6 159.2 l 204.6 157.1 l 204.6 153.1 m 79.3 153.1 l
79.3 155.2 l 204.6 155.2 l 204.6 153.1 l 204.6 38.5 m 79.3 38.5 l 79.3 40.6 l
204.6 40.6 l 204.6 38.5 l } def
/Rahmen { 152.6 3.8 m 207.6 24.7 l 207.6 24.7 l
209.1 25.2 210.8 26.1 212.2 27.6 c 213.5 29.1 214.4 31.3 214.4 34.5 c
214.4 158.6 l 214.4 158.6 l 214.4 158.9 214.3 169.0 204.0 169.0 c
204.0 169.0 l 79.9 169.0 l 79.6 169.0 69.5 168.9 69.5 158.6 c
69.5 34.5 l 69.5 34.5 l 69.5 34.2 69.6 27.6 75.0 25.0 c 75.0 25.0 l 79.1 23.5 l
79.1 23.5 l 134.8 2.3 l 134.8 2.3 l 134.8 2.3 137.6 0.9 141.7 0.9 c
145.8 0.9 152.6 3.8 152.6 3.8 c 152.6 3.8 l 208.2 22.9 m 153.3 2.1 l 153.3 2.1 l
152.8 1.8 146.0 -0.9 141.7 -0.9 c 137.5 -0.9 134.4 0.4 134.0 0.5 c
78.5 21.7 l 78.4 21.7 l 74.2 23.3 l 74.2 23.3 l 67.6 26.4 67.6 34.4 67.6 34.4 c
67.6 34.5 l 67.6 158.6 l 67.6 170.9 79.8 170.9 79.9 170.9 c
79.9 170.9 l 204.0 170.9 l 204.0 170.9 l 216.3 170.9 216.3 158.7 216.3 158.6 c
216.3 158.6 l 216.3 34.5 l 216.3 30.8 215.2 28.2 213.6 26.4 c
212.0 24.5 210.0 23.5 208.2 22.9 c } def gsave 1 sl weiss Rahmen s rot HG_rot f
weiss Innen f grestore
% END_LOGO
} def
%%EndProlog
%%BeginSetup
%%BeginFeature: *InputSlot Tray1
<< /InputAttributes << /Priority [0] >> >> setpagedevice
%%EndFeature
%%BeginFeature: *PageSize E8
<< /Policies << /PageSize 2 >> /PageSize [240 180] >> setpagedevice
%%EndFeature
%%BeginFeature: *Trimbox Offset
%%<< /PDFXTrimBoxToMediaBoxOffset [0 0 0 0] >> setdistillerparams
%%EndFeature
%%EndSetup
%%Page: 1 1
newpath
0 0 moveto
0 180 lineto
240 180 lineto
240 0 lineto
closepath
gsave
1 0 0 setrgbcolor
fill
grestore
/g_kf_xabs 0 def /g_kf_yabs 90.72 def
g_kf_xabs g_kf_yabs moveto currentpoint translate .4836 .4836 scale 0 rotate
lib_logo_nr_00639
showpage
Right here I have an example of an input that I want to use to generate a specific output.
1.2840 -1.6830 1.4460 C
1.5660 -0.8240 0.2163 C
0.5584 0.2995 -0.0595 C
-0.8805 -0.1514 -0.2412 C
-1.6205 -0.3785 0.8741 O
-1.4770 0.3883 2.0816 C
-1.3875 -0.2971 -1.3503 O
-2.0561 1.7788 1.8987 C
1.8097 -1.5560 -0.8246 O
2.8979 -0.2777 0.3226 N
1.2555 -1.0711 2.3543 H
0.3266 -2.2122 1.3666 H
2.0525 -2.4514 1.6103 H
0.8193 0.8445 -0.9811 H
0.6122 1.0718 0.7144 H
-0.4400 0.4244 2.4204 H
-2.0425 -0.1400 2.8563 H
-3.0985 1.7209 1.5688 H
-1.5098 2.3359 1.1312 H
-2.0131 2.3425 2.8349 H
3.1017 0.3084 1.1257 H
3.4860 -1.1139 0.2817 H
3.0712 0.1760 -0.5761 H
The last column corresponds to atoms of a molecule and the other columns are the coordinates of these atoms.
In the original file I have a lot of lines like on this example and a lot of other things, using some codes I could (with grep) catch these lines and isolate than in one unique file.
But, the problem is that the column corresponding to the atoms is to the right of the coordinates, and I need it to be to the left of the coordinates in my output. Is there any way to do this?
I was thinking of grep because I'm using it to extract the coordinates and atoms from my original file, but it is not working until now.
Below is what I need, exactly:
C 1.2840 -1.6830 1.4460
C 1.5660 -0.8240 0.2163
C 0.5584 0.2995 -0.0595
C -0.8805 -0.1514 -0.2412
O -1.6205 -0.3785 0.8741
C -1.4770 0.3883 2.0816
O -1.3875 -0.2971 -1.3503
C -2.0561 1.7788 1.8987
O 1.8097 -1.5560 -0.8246
N 2.8979 -0.2777 0.3226
H 1.2555 -1.0711 2.3543
H 0.3266 -2.2122 1.3666
H 2.0525 -2.4514 1.6103
H 0.8193 0.8445 -0.9811
H 0.6122 1.0718 0.7144
H -0.4400 0.4244 2.4204
H -2.0425 -0.1400 2.8563
H -3.0985 1.7209 1.5688
H -1.5098 2.3359 1.1312
H -2.0131 2.3425 2.8349
H 3.1017 0.3084 1.1257
H 3.4860 -1.1139 0.2817
H 3.0712 0.1760 -0.5761
awk '{print $4 " " $1 " " $2 " " $3 }' /tmp/atom.txt
Will give you
C 1.2840 -1.6830 1.4460
C 1.5660 -0.8240 0.2163
C 0.5584 0.2995 -0.0595
C -0.8805 -0.1514 -0.2412
O -1.6205 -0.3785 0.8741
C -1.4770 0.3883 2.0816
O -1.3875 -0.2971 -1.3503
C -2.0561 1.7788 1.8987
O 1.8097 -1.5560 -0.8246
N 2.8979 -0.2777 0.3226
H 1.2555 -1.0711 2.3543
H 0.3266 -2.2122 1.3666
H 2.0525 -2.4514 1.6103
H 0.8193 0.8445 -0.9811
H 0.6122 1.0718 0.7144
H -0.4400 0.4244 2.4204
H -2.0425 -0.1400 2.8563
H -3.0985 1.7209 1.5688
H -1.5098 2.3359 1.1312
H -2.0131 2.3425 2.8349
H 3.1017 0.3084 1.1257
H 3.4860 -1.1139 0.2817
H 3.0712 0.1760 -0.5761
You can use awk with printf to get the aligned columns:
awk '{ printf "%s %8s %9s %9s\n", $4, $1, $2, $3}' test.txt
which gives:
C 1.2840 -1.6830 1.4460
C 1.5660 -0.8240 0.2163
C 0.5584 0.2995 -0.0595
C -0.8805 -0.1514 -0.2412
O -1.6205 -0.3785 0.8741
C -1.4770 0.3883 2.0816
O -1.3875 -0.2971 -1.3503
C -2.0561 1.7788 1.8987
O 1.8097 -1.5560 -0.8246
N 2.8979 -0.2777 0.3226
H 1.2555 -1.0711 2.3543
H 0.3266 -2.2122 1.3666
H 2.0525 -2.4514 1.6103
H 0.8193 0.8445 -0.9811
H 0.6122 1.0718 0.7144
H -0.4400 0.4244 2.4204
H -2.0425 -0.1400 2.8563
H -3.0985 1.7209 1.5688
H -1.5098 2.3359 1.1312
H -2.0131 2.3425 2.8349
H 3.1017 0.3084 1.1257
H 3.4860 -1.1139 0.2817
H 3.0712 0.1760 -0.5761
With sed you can get last character and put it first
sed "s/^\(.*\)\(.\)$/\2\1/g" /tmp/atom.txt
C 1.2840 -1.6830 1.4460
C 1.5660 -0.8240 0.2163
C 0.5584 0.2995 -0.0595
C -0.8805 -0.1514 -0.2412
O -1.6205 -0.3785 0.8741
C -1.4770 0.3883 2.0816
O -1.3875 -0.2971 -1.3503
C -2.0561 1.7788 1.8987
O 1.8097 -1.5560 -0.8246
N 2.8979 -0.2777 0.3226
H 1.2555 -1.0711 2.3543
H 0.3266 -2.2122 1.3666
H 2.0525 -2.4514 1.6103
H 0.8193 0.8445 -0.9811
H 0.6122 1.0718 0.7144
H -0.4400 0.4244 2.4204
H -2.0425 -0.1400 2.8563
H -3.0985 1.7209 1.5688
H -1.5098 2.3359 1.1312
H -2.0131 2.3425 2.8349
H 3.1017 0.3084 1.1257
H 3.4860 -1.1139 0.2817
H 3.0712 0.1760 -0.5761
Using the matlab function fracfactgen I can generate the generators for a two-level fractional-factorial design. Let's say that I have Factor = 7 and I need the generators for a design plan of resolution 3.
generators = fracfactgen('a b c d e f g',[],3)
generators =
'a'
'b'
'c'
'abc'
'bc'
'ac'
'ab'
Now, I know that this is just one of the 16 possibile alternative to build a 2^(7-4) DoE plan, so how can I obtain all possibile generator combinations?
Please Note: The other combinations are:
case 1
'a b c -ab -ac -bc -abc'
case 2
'a b c -ab -ac -bc abc'
case 3
'a b c -ab -ac bc -abc'
case 4
'a b c -ab -ac bc abc'
case 5
'a b c -ab ac -bc -abc'
case 6
'a b c -ab ac -bc abc'
case 7
'a b c -ab ac bc -abc'
case 8
'a b c -ab ac bc abc'
case 9
'a b c ab -ac -bc -abc'
case 10
'a b c ab -ac -bc abc'
case 11
'a b c ab -ac bc -abc'
case 12
'a b c ab -ac bc abc'
case 13
'a b c ab ac -bc -abc'
case 14
'a b c ab ac -bc abc'
case 15
'a b c ab ac bc -abc'
case 16
'a b c ab ac bc abc'
I have some text files and I need to remove the first character from the fourth column only if the column has four characters
file1 as follows
ATOM 5181 N AMET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA AMET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C AMET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N AARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2 as follows
ATOM 41 CA ATRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA BTRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB ASER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB BSER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG CHIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
Desired output
file1
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2
ATOM 41 CA TRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA TRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB SER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB SER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG HIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
This might work for you (GNU sed):
sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1 \3/' file
This replaces the first character of the fourth column with a space if that column has four non-space characters.
Use the length() function to find the length of the column and the substr() function to print the substring you need:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
Piping to column -t rebuilds a nice table format. To store the changes back to a file uses the redirection operator:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t > new_file
With sed you could do:
$ sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
To store the changes back to the original file you can use the -i option:
$ sed -ri 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file
I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 C - C C C - -
tg93 79 C G C C C C G C
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 105 A G A A A A A G A
tg93 108 A G A A A A G A A
tg93 114 T C T T T T T C T
tg93 131 A C A A A A A A A
tg93 136 G C C G C C G G G
tg93 150 CTCTC - CTCTC - CTCTC CTCTC
In this file, in the heading
CHROM - name
POS - position
REF - reference
ALT - alternate
10 - 16_sample.bam - samplesd
I
Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row.
For example
In the first row, i have 'T' in REF and 'C' in ALT . I see in 7 samples, there are 5 T's and 2 blanks and no C. So i need to delete this row.
In Second row, REF is 'C' and Alt is '-'. Now in seven samples we have 3 C's, 2 '-'s and 2 blanks. So we keep this row as C and - have repeated more than 2 times.
Always we ignore the blanks while counting
The final file after filtering is
#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurrences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.
#!/usr/bin/env perl
use strict;
use warnings;
print scalar(<>); # Read and output the header.
while (<>) { # Read a line.
chomp; # Remove the newline from the line.
my ($chrom, $pos, $ref, $alt, #samples) =
split /\t/; # Parse the remainder of the line.
my %counts; # Count the occurrences of sample values.
++$counts{$_} for #samples; # e.g. Might end up with $counts{"G"} = 3.
print "$_\n" # Print line if we want to keep it.
if ($counts{$ref} || 0) >= 2 # ("|| 0" avoids a spurious warning.)
&& ($counts{$alt} || 0) >= 2;
}
Output:
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
You included 108 in your desired output, but it only has one instance of ALT in the seven samples.
Usage:
perl script.pl file.in >file.out
Or in-place:
perl -i script.pl file
Here's an approach that does not assume tab separation between fields
use IO::All;
my $chrom = "tg93";
my #lines = io('file.txt')->slurp;
foreach(#lines) {
%letters = ();
# use regex with backreferences to extract data - this method does not depend on tab separated fields
if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {
# initialize hash counts
$letters{$1} = 0;
$letters{$2} = 0;
# loop through the samples and increment the counter when matches are found
foreach($3, $4, $5, $6, $7, $8, $9) {
if ($_ eq $1) {
++$letters{$1};
}
if ($_ eq $2) {
++$letters{$2};
}
}
# if the counts for both POS and REF are greater than or equal to 2, print the line
if($letters{$1} >= 2 && $letters{$2} >= 2) {
print $_;
}
}
}