Can GNU sed be used to ID a pattern based on rows? Or in other words, how can you insert a line break in the pattern you're using sed to ID?
For example, in the following dataset (which is much larger in actuality), I have an error that should have been removed when I searched for duplicates, but was not because the information is slightly different in two rows (which is irrelevant at this point).
In this case, I want to remove the error entirely from the original file.In other words, if, within my file, two rows of rs#### follow each other, I would like to erase these two copies, and also the six lines that follow them. It would be nice to relocate them to a new file, but what is most critical is that they are removed from the original.
rs1038864 16 73762557 A G
1 1633 0.5835 -0.0004 0.0035
1 1643 0.8902 0.004436 0.004354
0 0 0 0 0
rs1019567 16 83343715 G T
rs1019567 16 83343715 G T
1 1641 0.4692 0.0009 0.0035
1 559 0.4612 -0.0025 0.0060
1 1643 0.5178 -0.002244 0.002745
1 1643 0.5178 -0.002244 0.002745
1 1909 0.493842692 0.0008 0.0027
1 1950 0.493842692 0.0008 0.0027
rs1038556 16 55132072 C T
1 6388 0.7773 0.0020 0.0044
1 6843 0.1161 0.001379 0.004275
1 1509 0.978660942 0.0041 0.0096
rs1019797 16 87788686 C G
rs1019797 16 87788686 C G
1 1639 0.717 0.0022 0.0038
1 5557 0.7193 0.0020 0.0064
1 1643 0.6691 -0.001044 0.002888
1 6843 0.6691 -0.001044 0.002888
1 1959 0.315280799 -0.0041 0.0032
1 1909 0.315280799 -0.0041 0.0032
rs1038887 16 62660698 A G
1 1688 0.4947 -0.0028 0.0035
0 0 0 0 0
1 1909 0.464393658 0.0007 0.0028
Something like,
sed -i '/^rs.*d
^rs.*/,+6d' test.data
or perhaps
sed -i '/^rs.*;^rs.*/,+6d' test.data
?
Any thoughts would be appreciated!
If infile contains the listed input, something like this should do (GNU sed):
<infile sed -r 'N; /([^\n]+)\n\1/ { N; N; N; N; N; N; d }; P; D'
If you want to save the deleted bits to deleted.txt use this:
<infile sed -r 'N; /([^\n]+)\n\1/ { N; N; N; N; N; N; w deleted.txt
d }; P; D'
Note that the w command needs to be terminated by a newline.
Explanation
This loads a second line into the pattern space (N) and checks if the lines are duplicates (/([^\n]+)\n\1/), if the are six more lines are loaded into pattern space and deleted (d).
I don't think sed is the right tool for the job (but I may be wrong; it depends in part on whether there are always exactly 6 lines to delete and maybe on whether the adjacent ID lines always have the same ID). You probably can do it with awk, but I'd reach for Perl:
#!/usr/bin/env perl
use strict;
use warnings;
my $rejects = "reject.lines";
open my $fh, '>', $rejects or die "Failed to create $rejects";
my $old = "";
while (<>)
{
if ($_ =~ /^rs\d+ /)
{
if ($old =~ /^rs\d+ /)
{
print $fh $old;
print $fh $_;
while (<>)
{
last if /^rs\d+ /;
print $fh $_;
}
$old = $_;
next;
}
}
print $old;
$old = $_;
}
print $old if $old ne "";
close $fh;
This will handle arbitrary numbers of lines after the adjacent marker lines, and doesn't depend on the two markers being identical.
Output
rs1038864 16 73762557 A G
1 1633 0.5835 -0.0004 0.0035
1 1643 0.8902 0.004436 0.004354
0 0 0 0 0
rs1038556 16 55132072 C T
1 6388 0.7773 0.0020 0.0044
1 6843 0.1161 0.001379 0.004275
1 1509 0.978660942 0.0041 0.0096
rs1038887 16 62660698 A G
1 1688 0.4947 -0.0028 0.0035
0 0 0 0 0
1 1909 0.464393658 0.0007 0.0028
Reject lines
rs1019567 16 83343715 G T
rs1019567 16 83343715 G T
1 1641 0.4692 0.0009 0.0035
1 559 0.4612 -0.0025 0.0060
1 1643 0.5178 -0.002244 0.002745
1 1643 0.5178 -0.002244 0.002745
1 1909 0.493842692 0.0008 0.0027
1 1950 0.493842692 0.0008 0.0027
rs1019797 16 87788686 C G
rs1019797 16 87788686 C G
1 1639 0.717 0.0022 0.0038
1 5557 0.7193 0.0020 0.0064
1 1643 0.6691 -0.001044 0.002888
1 6843 0.6691 -0.001044 0.002888
1 1959 0.315280799 -0.0041 0.0032
1 1909 0.315280799 -0.0041 0.0032
Related
I used following command to get a specific format that the output of it is in one line:
MASH P 0 3.64 NAMD P 0 3.79 AGHA P 0 4.50 SARG P 0 4.71 BENG P 0 5.47 BANR P 0 6.75 ABZA P 0 6.25 KALI P 0 6.91
I want to have a output with 85 characters in each line, could someone explain how I have to use print in this field?
You can use a regular expression with a quantifier:
$_ = 'MASH P 0 3.64 NAMD P 0 3.79 AGHA P 0 4.50 SARG P 0 4.71 BENG P 0 5.47 BANR P 0 6.75 ABZA P 0 6.25 KALI P 0 6.91';
print $&, "\n" while /.{1,85}/g;
or, if it's a part of a larger program and you don't want to suffer the performance penalty, use ${^MATCH} instead of $&:
use Syntax::Construct qw{ /p };
print ${^MATCH}, "\n" while /.{1,85}/gp;
You can also use the four argument substr:
print substr($_, 0, 85, q()), "\n" while $_;
Hi I am trying to use tiff2pdf to convert some tiff files captured by the fax application in asterisk to PDF
I have installed tiff2pdf by the following method:
sudo yum install ghostscript libtiff
When I execute the command to convert I get the following:
[root#cloud01 tmp]# tiff2pdf FAX-443439791001-2015-09-26_00-34-40.tiff
II*%PDF-1.1
%âãÏÓ
1 0 obj
<<
/Type /Catalog
/Pages 3 0 R
>>
endobj
2 0 obj
<<
/CreationDate (D:20150926003458)
/ModDate (D:20150926003458)
/Producer (libtiff / tiff2pdf - 20100615)
/Creator (Spandsp 20110122 075024)
/Subject (01279 850795)
>>
endobj
3 0 obj
<<
/Type /Pages
/Kids [ 4 0 R ]
/Count 1
>>
endobj
4 0 obj
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0.0000 0.0000 609.8823 833.8776]
/Contents 5 0 R
/Resources <<
/XObject <<
/Im1 7 0 R >>
/ProcSet [ /ImageB ]
>>
>>
endobj
5 0 obj
<<
/Length 6 0 R
>>
stream
q 609.8823 0.0000 0.0000 833.8776 0.0000 0.0000 cm /Im1 Do Q
endstream
endobj
6 0 obj
62
endobj
7 0 obj
<<
/Length 8 0 R
/Type /XObject
/Subtype /Image
/Name /Im1
/Width 1728
/Height 1135
/BitsPerComponent 1
/ColorSpace /DeviceGray
/Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 1728 /Rows 1135>>
>>
stream
ó"yËhªŠf.f)àù˜¦b™Šf))dJ†Nï…3dO<)P)àç…<𧃖îØsÁ ó<Ÿ)ò|¾Ol¸97päù S`ä\S7™Šfág³#Ìú#ÌöyžÏ3ÙæqS¼#Ìâù÷¡æ{;Â<Ϥ#Ìâ3Ùô<Ϥ#Ìövúgsœó9ÿüÏ:s›èsŸ‚/„yÎyžÎðó=ÿü(A…ÂøþyáGð‚ÿð¡ÿÿðÿÿÿ|(þxøAaøA„wáÿòïÇÿä'B;ÿú_„g°‹þoÿ ïÂ
ïá!Â
, °‚ð‚Ãá<ó>î^AØpžg¸Aay½ŽCŸøg¿ñü„7â¾l/ð‚øAO3ðPü ‚
:xAcÃð‚Ê|B<Ϧ”øóÌî{>‚
ï
ïþsþs} AäüáaøACøA
, ¡p/…á_ð°‚ÿÃ9ðøAÿÿø_ÂøAxA~Aÿáty?þ_æÿüßþÏœü8aÿÎxgïø‡à°Âðˆ0ŒÅ"á!úN0ð#ˆyɆaH0‚!îì„¿ÿæÂrnÿ
“ä\#Á0Â<""""""""""""""""""""†ÊÐ`ÃÃÁ·ÃÛvØpÏn‚#Ì8»D|·vÛaÓ
†G 0Èl0s.a†pÑ0í°Ão[•6
“c<Ž¥ºª
838—Á²:†Q±œÙÍŒÂ#¢?hŽ¡¢:#òÜÎàÎoSQa¶áÃ`ÌÔ
‘û: 0qg:# `ì60pØaƒÎ`î`ÁĆÁƒ†RˆüCDt9œÃÐn´pÎpÃD
C†Q æøCMˆÚ#Ñ#â
8‘Ó ÍA{
;(€Á¢=g!†ì2l;‰ ìM9¼ñf Ù6/ƒ32ÎÎn€áœÝÃ9ºˆ35+&ÂH0â
0vpØ0pÊÐ0hYômÄŽ˜`ÈÚåB ÍB °¦ÑaÄš–M‹qf¡©6#ʦ™‘Ón“e4šdŒÆâ0eˆMLHèm0Q&êù6JØ8“r|›lIºNm00aÉ¿`ÌÓV ÎMφ[L`ƒ‘Õ‡&õ `ø0âGÃÍÄŒÓÀÊ7;NŽƒ-¤hÊ7%ÐaÄŽ–ÒXe”
›\8—#Ëi2Êq:³´‹ ¹b
°Ãƒ ¡ÎÒ(>!¤,ᙤ&ÎvØg73Hè0â
0ᆠÍ%Ýa;H‰†f•AÈê6GRÈjƒ$tMÕh†šC†å’h33L™ÍÉC‰už$tÁÄpe³BË"t
ƒá»1²ƒ5•C °`á†8aƒD~ØhŽ¡°ÁØ0`àØr:†cdáÛGáÃe› ;9ºXq.™šTÎn—#¨em`¶hÃaÈü0Á²:¶
ÌÑ0ÖA¤ C†
ÌÐ! pØdšÃb
82™¤Ë ®$z ÎÍ™µƒ;7PÌÍ C”Í£q#ðÎÍ€ÎÍ33
Ãøg6*Dua†Aƒ–á2mRVs`³5U1°P83›paٔʦ©P3›)2:ˆ4G帲#¡aDGxlÁ˜–Ù‹a–Ì%Gá”l.Ìl.6
¦b›-˜ ¶¨ÛDt8³›"?l[‰0{&Ëuƒ
82‡)˜TðËf
Pp`ááœr™„Æ0á¸ff'veMÃ;00Î9LÊàÎÌàÎ9LÇ,°‰B
ã”Îf,ìƒ$u:n!¸fgXfuPggS
ÌáMÁœµÈüC(r™ÅÞÇ)œL0àäu;O †PægV؆í†S8FÙ™Ä6
Ìõ¶Ë ªft
A”9LómÈü0Ãv0á†på‘ÐaØ1v„°Ã
í
a†ÚC
ˆgh(ÃveSB“bÎÐl8a†îh0ɹšv‚¡
0Ì9Ml0n`ÌÐdÜÉ)0ÙLº”9Ú†ÄC)¡C
Hô0Á”Ð00Ùm†jVˆêÛpaƒ;Ea™£ƒ
òlD C”ц,ã”Êh
8lNG3(#ÊeXlDYÇ)•P0Ɔ
¡Êe
1e2‚tËe*l›¢Ê*¡ƒ[(«eob
Ì£A†YðgelC)”À`ÙC”9L¡‚G†Å™”C`ÎÊ¢>YfeÔ6$t
ã”ʀ؋;-eMPå2ðlXv
µ
CbÎÈhÙdAÈ0Ä2ÙŠÞe¨¢Û3! l³ŠàÎȨÃ,‚hC(r™Ú#¡+‹(s²†ˆùdÄC)±±e²0ËQf†ÁM°Ê¦A¦dBÎÈD|C;"FÃ3"€ÃfBí‰Ô6Ã°Ë ¢,£jسŽS!Ãa¢?aˆgdæÃ)“
HêÙdÄ¿lDÃD|3ŽQºËe]0ÄŽ»læó*#f7S
1HHý±
ÛÎ9Fá £pPÛ)Ê7+aŠVÄŽá†Z¥øl2ÅlæâFE;aœÜhCDtÌ0Ë!J
7aƒ9¸€Øg7NÙd*A”oXaˆ‘øhŽƒ9¹
0Ë7˜lvnØÃ
²"Ê«‘Õ±sc0Øg6Ka–B¤FÉA†$tÌl²Øƒ(Ù`0Ù•[,…H7
6vU£›)4GÄ2‚ÃbGPØ‘û
†Q²#ÑB¤$zÃg6.#áœØB¤Í«Ä3›jÃ,…(3ŽQ²DtÎʬå{ 3BGPÃ,…)”l,âÊ£a0Ùd,BÎl6ÃÎl†sc
Rf6˜lAœÛš#¢ÈT˜vƒ(ÚDtY
БղÈT™Û’†vêË!ZŽ…á†$u
¡ÊnÎÈTS#gn
L…ESB²mRB›–ÙÛ•‹!^Û”3µVJn.l…4jvâa¢:Èè0pÑtÃ… 2š*؆a‡GEY7r:ƈùdÃhŽƒ)¢·ÎÑXpÎ9L‚Å4'pÃ:
0ŸhŽƒrÈ-‡.¡œsµVpÓš«AÃdt
6pÁ†ˆü9dMÛ.™M
8°pÙ
0ᙪ´0eÂåÔ2Ú«HŽ¡Ë Þ0pÊd)¡[™3µRÈgj¥¸g¦ªË‡#àÁÃE
89è7hCpÌÕZr:aÑÑdHƒpÑt!¢?i‡
¶ªË†Ë"M‡
†4GA»
†áœrš«á†Y¦¡¡ 0àá‡g¦Bgh0¡Å“‘fj¬‡
Ðghpì;†[UT©l°ƒ&ÕUvVÔ˜aÄebAÃ8å5IQ†Æš¤í–#xhÃdtMn#¡.ƒÃe4
Œ3ŽS\Ì0á²Éª†
KB¯n
ÃDu`à݇7d)˜¬f
Ãd¥a†áÈè3³¸eµë 1†á–×GE’Šc
”ÌpÃ
²QYšÆA‡
8a–OW#¡#¨mœ°8ÙlžS9Hla‡xgk[X¡¢:‡†fL)œ Èèa‡GPÙd…
1†
²ðÑtÝ¡†vAÎÏXeµªGPÑtMÌÕ¢>0ðÃ&éju
Û†qÊ7S) ”60e5JÚ¢Á–Õ ÍaÁ”9MS6჆áìíU€Î9MU#ðÊrš¡O
-¨”#¨³µXw#¨n 8jÎå5Rƒ;U[ÙmRá¢:Œ8`àäur:‹;V™«ÅrÔÑE¸2š°ÎÕ3µ
ÆãE4TÃn
íBÆᙨPpÃ)¨&$z
qÊj
Ã;L˜gi¨ 87aÄXpx‘Ôe˜±Ã8å5 9Fá¦fM3"È ˆfi¬šjA¸fi‚;L°3´Ê„Û†qÊi‰
í0a¦ 34ëSN¦f˜`ÌÓ8lŽ£pÌÓ‡#¨aÃpfi%2Òú%Õ™¤Êí"Ñ
í"°Êi2‡)¤.ÉuNpÙ6PÎÒHÚJÃ+f§fƒf\ÌíÁs,3´«
Í-‘ÔYC”Ò£Ã2à…4(På4 x‘øe.[#êðÌÍ[
82™«
¦i(‘è2™ªLÌÐPÙ‡†0á¸gfŠ2lÑ aÃpÃ;5CpÎÍ.LÑ]YÙº†På3HS”͘ áá›l†a˜g¦bØr:gf†vb¨g¦`XaÑÔ<Hý¸ff
ᙘ$pÃ;1 gf+
ã”Ì.ܦeA¨gfj™„á™™0e3=™ƒ-C.Ä‹n¬×Ã2*†eáCiÔHðÈêLêB]Ylå`ñ#«38°3³Š7
Ìã#ÊgfgH$tGV¬¡Êg."GVfzƒÄŽˆêäu
Œ3³Àe´pÌДpÜHü2šYÚÃ)aR<0ÁÃgh
Ã;A0ÎÐ 2‡;# ã”Ð&/Ň
¦‚GVM €n
í`àÃ3#íÔYl©e¯jÎÊ´l¤…®…geS
¡Êe
,мY‡)”¬E”¤´–÷#¨Hn"Gì832
µÅh3²„
ÄCeq¥ ¡Ê7Ny‹;*pÌ˨3²†ÙPl¾
¶C"Ø…e²,–ÄÕ”9C”Ȧ-ÔYC”È
ĺ³2Êd+fd(vA`Ì9LƒcpÎÈ™ ƒ;"€Ëd
¶JÙl˜[ HŽø†ss0Ê72?aÁœÜ”3í(¦ë!œÝL3›…‰DŽˆê,æå!˜ÜXHèŽÆ["|3p‚Gâ]YFõ ‰ÔF[+J&”n8—â]YFËB_†se#Ê]ÊP—ÆÍ‹F[#è‘øe
Í€ƒ9²#Ê£eÊ&8e$~9m,(‘øg6/Æ[n¢]G-°ÔHê9L£-µj2ÛQ.£ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿø€
endstream
endobj
8 0 obj
4041
endobj
xref
0 9
0000000000 65535 f
0000000016 00000 n
0000000068 00000 n
0000000253 00000 n
0000000317 00000 n
0000000493 00000 n
0000000611 00000 n
0000000629 00000 n
0000004913 00000 n
trailer
<<
/Size 9
/Root 1 0 R
/Info 2 0 R
/ID[<67458B6BC6237B3269983C6473483366><67458B6BC6237B3269983C6473483366>]
>>
startxref
4933
%%EOF
[root#cloud01 tmp]#
Am I missing a library or some other dependency?
Thanks
From the man page:
tiff2pdf opens a TIFF image and writes a PDF document to standard output.
Perhaps if you tried:
tiff2pdf FAX-443439791001-2015-09-26_00-34-40.tiff > FAX-443439791001-2015-09-26_00-34-40.pdf
Let's we are given a number that we would like to compare it with the whole numbers which are in a column of matrix. For example:
value = 210;
A = [
0.0010 68
0.0011 277
0.0011 129
0.0012 87
0.0015 78
0.0016 248
0.0019 270
0.0019 133
0.0022 258
0.0025 264
0.0029 255
0.0030 81
0.0032 242
0.0033 27
0.0036 124];
Now, we want to compare value with all the numbers in column two under a condition and if it satisfies for all the numbers in the second column then do some computations otherwise do some other computations. If it does not hold for one then exit and continue code.
In the example:
if abs(value - A(:,2)) > 50 % should be true for all A(:,2)
do something
else
do something
How could one write it in code?
I have four files. File 1 (named as inupt_22.txt) is an input file containing two columns (space delimited). First column is the alphabetically sorted list of ligandcode (three letter/number code for a particular ligand). Second column is a list of PDBcodes (Protein Data Bank code) respective of each ligandcode (unsorted list though).
File 1 (input_22.txt):
803 1cqp
AMH 1b2i
ASC 1f9g
ETS 1cil
MIT 1dwc
TFP 1ctr
VDX 1db1
ZMR 1a4g
File 2(named as SD_2.txt) is a SDF (Structure Data file) for fragments of each ligand. A ligand can contain one or more than one fragments. For instance, here 803 is the ligandcode and it has two fragments. So the file will look like: four dollar sign ($$$$) followed by ligandcode (i.e 803 in this example) in next line. every fragment follows the same thing. Next, in the 5th line of each fragment (third line from $$$$.\n803), there is a number that represents number of rows in next block of rows, like 7 in first fragment and 10 in next fragment of 803 ligand. Now, next block of rows contains a column (61-62) which contains specific number that refers to atoms in fragments. For example in first fragment of 803, these numbers are 15,16,17,19,20,21,22. These numbers need to be matched in file 3.
File 2 (SD_2.txt) looks like:
$$$$
803
SciTegic05101215222D
7 7 0 0 0 0 999 V2000
3.0215 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 15 0 0
2.3070 -0.9900 0.0000 C 0 0 0 0 0 0 0 0 0 16 0 0
1.5926 -0.5775 0.0000 C 0 0 0 0 0 0 0 0 0 17 0 0
1.5926 0.2475 0.0000 C 0 0 0 0 0 0 0 0 0 19 0 0
2.3070 0.6600 0.0000 C 0 0 0 0 0 0 0 0 0 20 0 0
2.3070 1.4850 0.0000 O 0 0 0 0 0 0 0 0 0 21 0 0
3.0215 0.2475 0.0000 O 0 0 0 0 0 0 0 0 0 22 0 0
1 2 1 0
1 7 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 2 0
5 7 1 0
M END
> <Name>
803
> <Num_Rings>
1
> <Num_CSP3>
4
> <Fsp3>
0.8
> <Fstereo>
0
$$$$
803
SciTegic05101215222D
10 11 0 0 0 0 999 V2000
-1.7992 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 1 0 0
-2.5137 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 2 0 0
-2.5137 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 3 0 0
-1.7992 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 5 0 0
-1.0847 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 6 0 0
-0.3702 -0.0957 0.0000 C 0 0 0 0 0 0 0 0 0 7 0 0
0.3442 -0.5082 0.0000 C 0 0 0 0 0 0 0 0 0 8 0 0
0.3442 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 9 0 0
-0.3702 -1.7457 0.0000 C 0 0 0 0 0 0 0 0 0 11 0 0
-1.0847 -1.3332 0.0000 C 0 0 0 0 0 0 0 0 0 12 0 0
1 2 1 0
1 10 1 0
2 3 1 0
3 4 1 0
4 5 2 0
5 6 1 0
5 10 1 0
6 7 2 0
7 8 1 0
8 9 1 0
10 9 1 0
M END
> <Name>
803
> <Num_Rings>
2
> <Num_CSP3>
6
> <Fsp3>
0.6
> <Fstereo>
0.1
File 3 is CIF (Crystallographic Information file). This file can be obtained from following link: File_3
This file is a collection of individual cif files for several ligand molecules. Each part in file starts with data_ligandcode. For our example it will be data_803. After 46 lines from the start of each small file in collection, there is a block that gives structural information about the molecule. The number of rows in this block is not fixed. However, this block ends with an Hash sign (#). In this block two columns are important which are 53-56 and 62-63. 62-63 column contains numbers that can be matched from numbers obtained from file 2. And, 53-56 contains atom names like C1 (Carbon 1) etc. This column can be used to match with file 4.
File 4 is a Grow.out file that contains information about interaction of each ligand with their target protein. The file name is the PDBcode given in file 1 against each ligand. For example for ligand 803 the PDBcode is 1cqp. So, the grow.out file will be having name of 1cqp. 1cqp
In this file those rows are important those contain ligandcode (for example 803) and and the atom name obtained from 53-56 column of file three.
Task: I need a script that reads ligandcode from File 1, goes to file 2 search for $$$$ . \nLigandcode and then obtain numbers from column 61-62 for each fragment. Then in next step my script should pass these number to file 3 and match the rows containing these number in column 62-63 of file 3 and then pull out the information in column 53-56 (atom names). And last step will be opening of file 4 with the name of PDBcode and then printing the rows containing ligandcode and the atom names obtained from file 3. The printing should be done in an output file.
I am a Biomedical Research student. I don't have computer science background. However, I have to use Perl programming for some task. For the above mentioned task I wrote a script, but it is not working properly and I can not find the reason behind it. The script I wrote is :
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
use Carp qw(croak);
{
my $a;
my $b;
my $input_file = "input_22.txt";
my #lines = slurp($input_file);
for my $line (#lines){
my ($ligandcode, $pdbcode) = split(/\t/, $line);
my $i=0;
my $k=0;
my #array;
my #array1;
open (FILE, '<', "SD_2.txt");
while (<FILE>) {
my $i=0;
my $k=0;
my #array;
my #array1;
if ( $_=~/\x24\x24\x24\x24/ . /\n$ligandcode/) {
my $nextline1 = <FILE>;
my $nextline2 = <FILE>;
my $nextline3 = <FILE>;
my $nextline4= <FILE>;
my $totalatoms= substr( $nextline4, 1,2);
print $totalatoms,"\n";
while ($i<$totalatoms)
{
my $nextlines= <FILE>;
my $sub= substr($nextlines, 61, 2);
print $sub;
$array[$i] = $sub;
open (FH, '<', "components.txt");
while (my $ship=<FH>) {
my $var="data_$ligandcode";
if ($ship=~/$var/)
{
while ($k<=44)
{
$k++;
my $nextline = <FH>;
}
my $j=0;
my $nextline3;
do
{
$nextline3=<FH>;
print $nextline3;
my $part= substr($nextline3, 62, 2);
my $part2= substr($nextline3, 53, 4);
$array1[$j] = $part;
if ($array1[$j] eq $array[$i])
{
print $part2, "\n";
open (GH, '<', "$pdbcode");
open (OH, ">>out_grow.txt");
while (my $grow = <GH>)
{
if ( $grow=~/$ligandcode/){
print OH $grow if $grow=~/$part2/;
}}
close (GH);
close (OH);
}
$j++;
} while $nextline3 !~/\x23/;
}
}
$i++;
close (FH);
}
}}
close (FILE);
}
}
##Slurps a file into a list
sub slurp {
my ($file) = #_;
my (#data, #data_chomped);
open IN, "<", $file or croak "can't open $file\n";
#data = <IN>;
for my $line (#data){
chomp($line);
push (#data_chomped, $line);
}
close IN;
return (#data_chomped);
}
I want to make it a script that works fast and works for 1000 fragments altogether, if I make a list of 400 molecules in file 1. Kindly help me to make this script working. I ll be grateful.
You need to break your code into manageable steps.
Create data-structures from the files
use Slurp;
my #input = map{
[ split /\s+/, $_, 2 ]
} slurp $input_filename;
# etc
Process each element of input_22.txt, using those data structures.
I really think you should look into PerlMol. After all, half the reason to use Perl is CPAN.
Things you did well
Using 3-arg open
use strict;
use warnings;
Things you shouldn't have done
(Re)defined $a and $b
They are already defined for you.
Reimplemented slurp (poorly)
Read the same file in multiple times.
You opened SD_2.txt once for every line of input_22.txt.
Defined symbols outside of the scope where you use them.
$j, $k, #array and #array1 are defined twice, but only one of the definitions is being used.
Used open and close without some sort of error checking.
Either open ... or die; or use autodie;
You used bareword filehandles. IN, FILE etc
Instead use open my $FH, ...
Most of those aren't that big of a deal though, for a one-off program.
I have a text file with tab delimited data spread across 16 columns.
I want to delete the complete row where the values 1260, 1068 and 907 found in 6th column.
9513 2010-06-15 17:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-15 18:00:02 \N
9523 2010-06-15 18:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-15 19:00:02 \N
9534 2010-06-15 19:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-15 20:00:02 \N
9543 2010-06-15 20:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-15 21:00:02 \N
9552 2010-06-15 21:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-15 22:00:02 \N
9560 2010-06-15 22:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-15 23:00:02 \N
9569 2010-06-15 23:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-16 00:00:02 \N
9579 2010-06-16 00:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-16 01:00:02 \N
9589 2010-06-16 01:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-16 02:00:01 \N
9599 2010-06-16 02:00:00 94 0 69 12 0 0 0 0.0000 0 \N \N \N 2010-06-16 03:00:02 \N
95642733 2011-10-19 19:00:00 4341 0 1263 0 11 0 0 0.0000 0 \N \N \N 2011-10-19 20:05:06 \N
95642732 2011-10-19 19:00:00 4341 0 1260 0 24635 0 0 0.0000 0 \N \N \N 2011-10-19 20:05:06 \N
95642540 2011-10-19 19:00:00 4050 0 1068 103 113 2 0 0.0000 0 \N \N \N 2011-10-19 20:05:06 \N
95642539 2011-10-19 19:00:00 4050 0 907 19 0 0 0 0.0000 0 \N \N \N 2011-10-19 20:05:06 \N
Awk is the tool you want to use.
awk '$6==1260 || $6==1068 || $6==907 {next} {print}'
What does this do?
Awk runs a block of code on each line of your file. The code starts with an expression that must evaluate true (in this case the three possible values of the 6th field), followed by commands in curly braces. In this case, the command next tells it to proceed to the next input line without running any more commands.
If the three comparisons FAIL, and we don't run the next, then we print the line.
What you want to us is awk. Awk is an amazingly powerful language inside UNIX, and if you ever run into a complicated test-streaming problem, awk is your solution.
Try this script:
awk '{
if ($6 != 1260 || $6 != 1068 || $6 != 907)
print $0;
}' file.txt >> output_file.txt
This might work for you (GNU sed?):
sed '/^\(\S*\s*\)\{5\}\(1260\|1068\|907\)\s/d' file
or generally:
sed '/^\([^[:space:]]*[[:space:]]*\)\{5\}\(1260\|1068\|907\)[[:space:]]/!d'
awk '$6!=1260 && $6!=1068 && $6!=907' file