Matching a line and printing lines above - perl

Code :
#!/usr/bin/perl
my $file = $ARGV[0];
my $position = $ARGV[1]; # POSITION OF THE RESIDUE
open (FILE, $file);
while (<FILE>) {
my #f = split;
if (($f[0] == "ANNOT_RESID_NO") && ($f[1] == $position)){
push #line, $_;
}
}
print #line;
close(FILE);
INPUT :
ANNOT_TYPE[1] 0
ANNOT_TYPE_NAME[1] CATRES
ANNOT_NUMBER[1][1] 1
ANNOT_NAME[1][1] 3.1.3.16
ANNOT_DESC[1][1] Phosphoprotein phosphatase.
ANNOT_RESID_NO[1][1][1] 91
ANNOT_RESID_NAME[1][1][1] ASP
ANNOT_RESID_NUM[1][1][1] 95
ANNOT_RESID_NO[1][1][2] 92
ANNOT_RESID_NAME[1][1][2] ARG
ANNOT_NRESID[1][1] 6
ANNOT_NUMBER[1][2] 2
ANNOT_NAME[1][2] 3.1.3.53
ANNOT_DESC[1][2] [Myosin-light-chain] phosphatase.
ANNOT_RESID_NO[1][2][1] 91
ANNOT_RESID_NAME[1][2][1] ASP
ANNOT_RESID_NUM[1][2][1] 95
ANNOT_RESID_NO[1][2][2] 92
ANNOT_RESID_NAME[1][2][2] ARG
Question :
I am printing the line with has $position(for example 91) for the line starting with "ANNOT_RESID_NO". Along with this line, I also want to print, every time, in #line is the first line above this match containing "ANNOT_DESC". This "ANNOT_DESC" line is not necessarily always the line just above my matched line.

Try (complete code):
#!/usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0];
my $position = $ARGV[1];
open (FILE, $file) or die $!;
my $desc;
my #line;
while (<FILE>) {
my #f = split " ";
if ( $f[0] =~ /^ANNOT_DESC/ ) {
$desc = $_;
next;
}
if ( $f[0] =~ /^ANNOT_RESID_NO/ and $f[1] == $position ) {
push #line, $desc, $_;
}
}
output:
ANNOT_DESC[1][1] Phosphoprotein phosphatase.
ANNOT_RESID_NO[1][1][1] 91
ANNOT_DESC[1][2] [Myosin-light-chain] phosphatase.
ANNOT_RESID_NO[1][2][1] 91

With a data set that small you can push the lines from the file to an array(e.g. #file_data) , iterate the #file_data array and push the values you want into your #line array.

Related

Perl: How to print a random section (word definition) from a dictionary file

I want to print a random new word English in dictionary file in terminal Unix by Perl. I want to select and print a random line and 2 follow lines.
But my code doesn't complete this work.
Please help me to improve it.
An example of the output I wish:
#inspire: ....
ghk
lko...
Dictionary file:
#inspiration: mean....
abc def...
ghk lmn
...
#inspire: ....
ghk
lko...
#people: ...
...
The complete dictionary file is here anhviet109K.txt. It's about 14MB
My code:
use strict;
use warnings;
use File::Copy qw(copy move);
my $files = 'anhviet109K.txt';
my $fh;
my $linewanted = 16 + int( rand( 513796 - 16 ) );
# 513796: number of lines of file dic.txt
open( $fh, "<", $files ) or die "cannot open < $fh: $!";
my $del = " {2,}";
my $temp = 0;
my $count = 0;
while ( my $line = <$fh> ) {
if ( ( $line =~ "#" ) && ( $. > $linewanted ) ) {
$count = 4;
}
else {
next;
}
if ( $count > 0 ) {
print $line;
$count--;
}
else {
last;
}
}
close $fh;
Something like this, perhaps?
Your data has helped me to exclude the header entries in your dictionary file
This program finds the location of all of the entries (lines beginning with #) in the file, then chooses one at random and prints it
Tốt học tiếng Anh may mắn
use strict;
use warnings 'all';
use Fcntl ':seek';
use constant FILE => 'anhviet109K.txt';
open my $fh, '<', FILE or die qq{Unable to open "#{[FILE]}" for input: $!};
my #seek; # Locations of all the definitions
my $addr = tell $fh;
while ( <$fh> ) {
push #seek, $addr if /^\#(?!00-)/;
$addr = tell $fh;
}
my $choice = $seek[rand #seek];
seek $fh, $choice, SEEK_SET;
print scalar <$fh>;
while ( <$fh> ) {
last if /^\#/;
print;
}
output
#finesse /fi'nes/
* danh từ
- sự khéo léo, sự phân biệt tế nhị
- mưu mẹo, mánh khoé
* động từ
- dùng mưu đoạt (cái gì); dùng mưu đẩy (ai) làm gì; dùng mưu, dùng kế
=to finesse something away+ dùng mưu đoạt cái gì
A single pass approach:
use strict;
use warnings;
use autodie;
open my $fh, '<:utf8', 'anhviet109K.txt';
my $definition = '';
my $count;
my $select;
while (my $line = <$fh>) {
if ($line =~ /^#(?!00-)/) {
++$count;
$select = rand($count) < 1;
if ($select) {
$definition = $line;
}
}
elsif ($select) {
$definition .= $line;
}
}
# remove blank line that some entries have
$definition =~ s/^\s+\z//m;
binmode STDOUT, ':utf8';
print $definition;
This iterative random selection always selects the first item, has a 1/2 chance of replacing it with the second item, a 1/3 for the third, and so on.

perl, <Correct Input, grep

Please don't comment to say I already asked this, It's a logic question, I know it's mostly similar code but there are underlying syntax problems that I cannot decipher and have spent hours debugging this with no hope and I just really need this answered. And that other account was deleted so I did post this half an hour ago but can't view it. Please only comment if you want to help.
It should work everything is in data and it should be turning up results, i've had it working before so it must just be so syntax thing I'm not noticing. I can't get this work. I'm almost certain it's the grep statement.
#!/usr/bin/perl
use warnings;
use strict;
open ("data", "<text.txt") or die "Can't open"; #
my #data = <data>; #file looking into
close "data"; #
while(<>){
chomp;
my $temp = $_;
my ($name, $number, $expression) = split("\t", $temp);
my $pattern = "\t";
my #found = grep ( /(^$name$pattern\|$pattern$number$)/, #data );
if(defined($found[0])){
print $_;
my ($what, $start, $stop, $chr, $who) = split("\t", $found[0]);
print "\t", $chr, $start, $stop;
#found = ();
}
}
print "\n";
Input is of the format
A1B 1 68
A1C 299 0
A2B 547 0
A2L 877 30
A2M 2 7944
And this is the format of the data file
CLDN8 30214006 30216073 21 68
A1C 20808776 20811809 Y
UBE2Q2P5Y 25431156 25437315 Y
OR5M9 56462469 56463401 11 390162
I want to search for the instances of items in the first or second column of the input file in the data file which should match up with the first and 5th column(which may not exist) respectively
Expected output should be for this example
A1B 1 68 21 30214006 30216073
A1C 299 0 Y 20808776 20811809
But I'm getting nothing
I think what you're looking for is this, but it's really very hard to tell because you have described your problem so poorly
I've had to make a lot of assumptions, but at least the output matches what you say you're expecting
use strict;
use warnings 'all';
my $data_file = 'text.txt';
my #data;
{
open my $fh, '<', $data_file or die qq{Unable to open "$data_file" for input: $1};
while ( <$fh> ) {
next unless /\S/;
push #data, [ split ];
}
}
while ( <> ) {
next unless /\S/;
my ($name, $number, $expression) = split;
for my $item ( #data ) {
my ($what, $start, $stop, $chr, $who) = #$item;
if ( $what eq $name or defined $who and $who eq $expression ) {
print join("\t", $name, $number, $expression, $chr, $start, $stop), "\n";
}
}
}
output
A1B 1 68 21 30214006 30216073
A1C 299 0 Y 20808776 20811809

passing a paramter into a loop perl using a subroutine

i want to read a file1 that containfor every word a numeric reprsentation, for example:
clinton 279
capital 553
fond|fonds 1410
I read a second file, every time i find a number i replace it with the corresponding word. above an example of second file
279 695 696 152 - 574
553 95 74 96 - 444
1410 74 95 96 - 447
The problem in my code is that it execute the subroutine only one time. and it only show:
279 clinton
normally in this example it should show 3 words, when i add print $b; in the subrtoutine it show the different numbers.
#!/usr/bin/perl
use stricrt;
use warnings;
use autodie;
my #a,my $i,my $k;
my $j;
my $fich_in = "C:\\charm\\ats\\4.con";
my $fich_nom = "C:\\charm\\ats\\ats_dict.txt";
open(FICH1, "<$fich_in")|| die "Problème d'ouverture : $!";
open my $fh, '<', $fich_nom;
#here i put the file into a table
while (<FICH1>) {
my $ligne=$_;
chomp $ligne;
my #numb=split(/-/,$ligne);
my $b=$numb[0];
$k=$#uniq+1;
print "$b\n";
my_handle($fh,$b);
}
sub my_handle {
my ($handle,$b) = #_;
my $content = '';
#print "$b\n";
## or line wise
while (my $line = <$handle>){
my #liste2=split(/\s/,$line);
if($liste2[1]==$b){
$i=$liste2[0];
print "$b $i";}
}
return $i;
}
close $fh;
close(FIC1);
The common approach to similar problems is to hash the "dictionary" first, than iterate over the second file and search for replacements in the hash table:
#!/usr/bin/perl
use warnings;
use strict;
my $fich_in = '4.con';
my $fich_nom = 'ats_dict.txt';
open my $F1, '<', $fich_in or die "Problème d'ouverture $fich_in : $!";
open my $F2, '<', $fich_nom or die "Problème d'ouverture $fich_nom : $!";;
my %to_word;
while (<$F1>) {
my ($word, $code) = split;
$to_word{$code} = $word;
}
while (<$F2>) {
my ($number_string, $final_num) = split / - /;
my #words = split ' ', $number_string;
$words[0] = $to_word{ $words[0] } || $words[0];
print "#words - $final_num";
}

match columns on different lines and sum

I have a csv with about 160,000 lines, it looks like this:
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
I would like to pair lines where column 1 is the same, column 3 matches column 2 and where column 6 is a '+' while on the other line it is a '-'. If this is true I would like to sum column 4 and column 5.
My desired out put would be
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,-
the best solution I can think of is to duplicate the file and then match columns between the file and it's duplicate with perl:
#!/usr/bin/perl
use strict;
use warnings;
open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a,
$total_b,$sign_a,$sign_b);
while (<$firstfile>) {
my #col = split /,/;
$chr_a = $col[0];
$start = $col[1];
$end = $col[2];
$sum_a = $col[3];
$total_a = $col[4];
$sign_a = $col[5];
seek($secondfile,0,0);
while (<$secondfile>) {
my #seccol = split /,/;
$chr_b = $seccol[0];
$begin = $seccol[1];
$finish = $seccol[2];
$sum_b = $seccol[3];
$total_b = $seccol[4];
$sign_b = $seccol[5];
print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3],
$col[4]+=$seccol[4], $col[5]),
"\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
}
}
And that works fine, but ideally I'd like to be able to do this within the file itself without having to duplicate it, because I have many files and so I would like to run a script over all of them that is less time-consuming.
Thanks.
In the absence of a response to my comment, this program will do as you ask with the data you provide.
use strict;
use warnings;
my #last;
while (<DATA>) {
s/\s+\z//;
my #line = split /,/;
if (#last
and $last[0] eq $line[0]
and $last[2] eq $line[1]
and $last[5] eq '+' and $line[5] eq '-') {
$last[3] += $line[3];
$last[4] += $line[4];
print join(',', #last), "\n";
#last = ()
}
else {
print join(',', #last), "\n" if #last;
#last = #line;
}
}
print join(',', #last), "\n" if #last;
__DATA__
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
output
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+

Parsing out text from string

I have a tab-delimited file1:
20 50 80 110
520 590 700 770
410 440 20 50
300 340 410 440
read and put them into an array:
while(<INPUT>)
{
chomp;
push #inputarray, $_;
}
Now I'm looping through another file2:
20, 410, 700
80, 520
300
foreach number of each line in file2, I want to search the #inputarray for the number. If it exists, I want to grab the corresponding number that follows. For instance, for number 20, I want to grab the number 50. I assume that they are still separated by a tab in the string that exists as an array element in #inputarray.
while(my $line = <INPUT2>)
{
chomp $line;
my #linearray = split("\t", $line);
foreach my $start (#linearray)
{
if (grep ($start, #inputarray))
{
#want to grab the corresponding number
}
}
}
Once grep finds it, i don't know how to grab that array element to find the position of the number to extract the corresponding number using perhaps the substr function. How do i grab the array element that grep found?
A desired output would be:
line1:
20 50
410 440
700 770
line2:
80 110
520 590
line3:
300 340
IMHO, it would be best to store the numbers from file1 in a hash. Referring to the example clontent of file1 as you provided above you can have something like below
{
'20' => '50',
'80' => '110',
'520'=> '590',
'700'=> '770',
'410'=> '440',
'20' => '50',
'300'=> '340',
'410' => '440'
}
A sample piece of code will be like
my %inputarray;
while(<INPUT>)
{
my #numbers = split $_;
my $length = scalar $numbers;
# For $i = 0 to $i < $length;
# $inputarray{$numbers[$i]} = $numbers[$i+1];
# $i+=2;
}
An demonstration of the above loop
index: 0 1 2 3
numbers: 20 50 80 110
first iteration: $i=0
$inputarray{$numbers[0]} = $numbers[1];
$i = 2; #$i += 2;
second iteration: $i=2
$inputarray{$numbers[2]} = $numbers[3];
And then while parsing file2, you just need to treat the number as the key of %inputarray.
I believe this gets you close to what you want.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $prev = undef;
foreach my $curr ( split /\s+/ )
{
$follows{$prev} = $curr if ($prev);
$prev = $curr;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you only want to consider numbers from file1 in pairs, as opposed to seeing which numbers follow what other numbers without taking pair boundaries into account, then you need to change the logic in the first while loop slightly.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $line = $_;
while ( $line =~ s/(\S+)\s+(\S+)\s*// )
{
$follows{$1} = $2;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you want to read the input once but check for numbers a lot, you might be better off to split the input line into individual numbers. Then add each each number as key into a hash with the following number as value. That makes reading slow and takes more memory but the second part, where you want to check for following numbers will be a breeze thanks to exist and the nature of hashes.
If i understood your question correct, you could use just one big hash. That is of course assuming that every number is always followed by the same number.