How to identify nth lines of n files in while<> - perl

I have a code which adds all vectors in all files.
There can be any number of input files. For example first input file is:
0.55 0 0.3335 1.2
0.212 0 2.2025 1
and the second one is:
0.25 0 0.3333 1.0
0.1235 0 0.2454 1
What I get is the sum of all vectors, thus in result i get one vector
which is:
1.13550 0 3.1147 4.2
But I'm trying to sum the first vector of the first file with the first vector of the second file and so on. In result according to this example I should get 2 vectors.
For now I have this:
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my #sum = 0;
my $dim = 0;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
my $vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum[$i] += $vectors[$i];
}
}
$" = "\t\t";
print "\n --- \n #sum \n";
What I need is just to find out how to identify each file's nth line
and to sum the column values of those lines while keeping in mind, that there can be n number of files.
I saw filehandling question over here with similar issue, however
I didn't find my answer there.
Just looking for some suggestions and guidance. Got stuck on this.

Open each file yourself and use the $. variable to know which line you are on (or count the files yourself). Here's the basic structure:
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $. ] = ...; # $. is the line number
}
}
If you don't like $., you can use its longer name. You have to turn on English (which comes with Perl):
use English;
## use English qw( -no_match_vars ); # for v5.16 and earlier
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $INPUT_LINE_NUMBER ] = ...;
}
}
Or, you can count yourself, which might be handy if the vectors in the files don't line up by strict line number (perhaps because of comments or some other formatting oddity):
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
my $line = -1;
while( <$fh> ) {
$line++;
chomp;
$sum[ $line ] = ...;
}
}
The harder way is the answer bart gives which inspects eof at the end of every line to see if the magical ARGV handle is looking at a new file, and resetting $. if it is. It's an interesting trick but hardly anyone is going to understand what it's doing (or even notice it).
For the other part of the problem, I think you're doing the vector sum wrong, or using confusing variable names. A line is a vector, and the numbers in the lines are a component. A two dimensional array will work. The first index is the line number and the second in the component index:
while( <$fh> ) {
chomp;
... skip unwanted lines
my #components = split;
... various dimension checks
foreach my $i ( 0 .. $#components ) {
$sum[ $. ][ $i ] += $components[ $i ];
}
}
The Data::Dumper module is handy for complex data structures. You can also see the perldsc (Perl Data Structures Cookbook) documentation. The $. variable is found in perlvar.

$. is the line number of the most recently read file handle. close(ARGV) if eof; can be used to reset the file number between files (as documented in eof). (Note: eof() is different than eof.) So you now have line numbers.
The second problem you have is that you are adding vector components ($vectors[$i]) to a vectors ($sum[$i]). You need to add vector components to vectors components. Start by using more appropriate variable names.
This is what we get:
my #sum_vectors;
while (<>) {
s/#.*//; # Remove comments.
next if /^\s*$/; # Ignore blank lines.
my #vector = split;
if ($sum_vectors[$.] && #{ $sum_vectors[$.] } != #vector) {
die("$ARGV:$.: Vector dimensions do not match\n");
}
for my $i (0..$#vector) {
$sum_vectors[$.][$i] += $vector[$i];
}
} continue {
close(ARGV) if eof; # Reset line numbers for each file.
}
Two other errors fixed:
$! did not contain anything meaningful when you used it.
You ignored lines that contain comments, even if they contained valid data too.

Try this:
#!/usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my %sum;
my $dim = 0;
my $vector_length;
my $line_number;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
$vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum{$.}{$i} += $vectors[$i];
}
$line_number = $.;
$. = 0 if eof;
}
$" = "\t\t";
for (my $line=1; $line<=$line_number; $line++)
{
print $line;
for (my $vector=0; $vector<$vector_length; $vector++)
{
print " " . $sum{$line}{$vector};
}
print "\n";
}

Related

How can I calculate the geometric center of a protein in Perl?

I have a PDB file which contains information about a specific protein. One of the information it holds is the positions of the different atoms (xyz coordinates).
The file is the following https://files.rcsb.org/view/6U9D.pdb . With this file I want to calculate the geometric center of the atoms. In theory I know what I need to do, but the script I wrote does not seem to work.
The first part of the program, before the for loop, is another part of the assignment which requires me to read the sequence and convert it from the 3 letter nomenclature to the 1 letter one. The part that interests me is the for loop until the end. I tried to pattern match in order to isolate the XYZ coordinates. Then I used a counter that I had set up in the beginning which is the $k variable. When I check the output on cygwin the only output I get is the sequence 0 0 0 instead of the sum of each dimension divided by $k. Any clue what has gone wrong?
$k=0;
open (IN, '6U9D.pdb.txt');
%amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
while (<IN>) {
if ($_=~m/HEADER\s+(.*)/){
print ">$1\n"; }
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq.=$1;
$seq=~s/ //g;
}
}
for ($i=0;$i<=length $seq; $i+=3) {
print "$amino_acid_conversion{substr($seq,$i,3)}";
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
print "\n";
#print $k;
$xgk=($x/$k); $ygk=($y/$k); $zgk=($z/$k);
print "$xgk $ygk $zgk \n";
I do not know bioinformatics but it seems like you should do something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = '6U9D.pdb';
open ( my $IN, '<', $fn ) or die "Could not open file '$fn': $!";
my $seq = '';
my $x = 0;
my $y = 0;
my $z = 0;
my $k = 0;
while (<$IN>) {
if ($_ =~ m/HEADER\s+(.*)/) {
say ">$1";
}
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq .= $1;
}
if ($_ =~ m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
close $IN;
$seq =~ s/ //g;
my %amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
my %unknown_keys;
my $conversion = '';
say "Sequence length: ", length $seq;
for (my $i=0; $i < length $seq; $i += 3) {
my $key = substr $seq, $i, 3;
if (exists $amino_acid_conversion{$key}) {
$conversion.= $amino_acid_conversion{$key};
}
else {
$unknown_keys{$key}++;
}
}
say "Conversion: $conversion";
say "Unknown keys: ", join ",", keys %unknown_keys;
say "Number of atoms: ", $k;
my $xgk=($x/$k);
my $ygk=($y/$k);
my $zgk=($z/$k);
say "Geometric center: $xgk $ygk $zgk";
This gives me the following output:
[...]
Number of atoms: 76015
Geometric center: 290.744642162734 69.196842162731 136.395842938893

compare two txt files

I'm new to Perl. I have two text files and I need to check matching string on both lists.
For example matching strings are:
line - file 1: fe/bla/blablabla/abcdefg
line - file 2: blablabla/abcdefg
There is a match!
In addition, the location (line number) of the matching strings is not the same on both files.
I tried put the lists in arrays and compare the arrays with nested loop, but the running time of the program is huge (the lists contain thousand of lines) and I believe there is another way, less naïve and more productive.
This is the way I put the data in the array:
my $list1 = /path/to/the/file;
open (FILE , '<' , $list1) or die ("Could not open the file");
while ( my $line = <FILE> ) {
chomp ($line);
$list_1[$i] = $line;
$i = $i+1;
}
close FILE;
I did it to the other list as well.
And this is my nested loop.
for ( $k = 0 ; $k < #list_1 ; $k = $k+1 ) {
for ($i = 0 ; $i < #list_2 ; $i = $i+1 ) {
if (index($list_1[$k] , $list_2[$i]) != -1) {
splice (#list_2 , $i , 1);
last;
}
}
}
As long as file2 isn't enormous, the simplest way is to build a regular expression pattern from its contents and check each line in file1 against the pattern.
You don't say what output you want, so I have printed all lines in file1 that have a match in file2.
use strict;
use warnings;
use 5.010;
use autodie;
my ($list1, $list2) = qw( /path/to/list1 /path/to/list2 );
open my $fh, '<', $list2;
my $re = join '|', map { chomp; quotemeta; } <$fh>;
$re = qr/$re/;
open $fh, '<', $list2;
while ( <$fh> ) {
print if /$re/;
}

Use perl to check if the next line is a duplicate

I'm trying to read data from a fairly big file. I need to be able to read lines through the file and report on any duplicate records in the file beginning with a G.
THIS IS THE DATA:
E123456789
G123456789
h12345
E1234567
E7899874
G123456798
G123465798
h1245
This is example data as there are about 6000 lines of data muddled in amongst this.
But this is the important data records beginning with E, G or h.
Here is my code so far:
#!/usr/bin/perl
use strict;
use warnings;
my $infile = $ARGV[0];
my $found_E = 0;
my $sets = 0;
open my $ifh, '<', $infile;
while (<$ifh>) {
if (/^E/) {
$found_E = 1;
next;
}
if ($found_E) {
if (/^G/) {
$sets += 1;
$found_E = 0;
next;
}
if (/^h/) {
print "Error! No G Record at line $.\n";
exit;
}
}
}
close($ifh);
printf "Found %d sets of Enrichment data with G Records \n", $sets;
my #lines;
my %duplicates;
open $ifh, '<', $infile;
while (<$ifh>) {
#lines = split('', $_);
if ($lines[0] eq 'G') {
print if !defined $duplicates{$_};
$duplicates{$_}++;
}
}
close($ifh);
As you can see I'm checking that G occurs only after E records and before h records.
The second loop is intended to find duplicates, but right now it just prints all G records.
Also if someone could advise what to do about reporting if there are no E records in the file that would be appreciated.
Grouped Duplicate Checking
If you just want to check for duplicates which are grouped together, that's easy. You can just check if the current line is the same as the last line:
my $line;
while(<$ifh>) {
next if (defined $line && $line eq $_);
$line = $_;
...
All Duplicate Checking
If you want to check for all duplicate lines in the file, regardless of their positioning, you'll have to do something like this:
my %seen;
while (<$ifh>) {
next if exists $seen{$_};
$seen{$_} = 1;
...
This won't be fast on a large file as hash lookups are pretty poor, but it's the best option if you don't want to modify the source file.
my %seen_G;
LINE:
while(<$ifh>)
{
my $c = substr( $_, 0, 1 );
if ( $found_E ) {
die "Error! No G Record at line $." if $c eq 'h';
print if ( $c eq 'G' and not $seen_G{ $_ }++ );
}
$found_E = ( $c eq 'E' );
}
It's not clear whether you want to skip lines that are duplicates of the previous line or lines that are duplicate of any earlier line.
Skip lines that are duplicate of the previous line
Just fetch another line if the next line is the same as the last.
my $last;
while (<>) {
next if /^G/ && defined($last) && $_ eq $last;
$last = $_;
...
}
I'll leave it to you to determine when you actually want to look for duplicates, but I think you want to add a $found_G check to that if.
Skip lines that are duplicate of any previous line
Maintain a collection of the lines you've already seen. Using a hash will allow for quick insertion and lookup.
my %seen;
while (<>) {
next if /^G/ && $seen{$_}++;
...
}

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

Generate random pairs from a list of numbers making sure that the generated random pairs are not already present

Given a set of genes and existing pair of genes, I want to generate new pairs of genes which are not already existing.
The genes file has the following format :
123
134
23455
3242
3423
...
...
The genes pairs file has the following format :
12,345
134,23455
23455,343
3242,464452
3423,7655
...
...
But I still get few common elements between known_interactions and new_pairs. I'm not sure where the error is.
For the arguments,
perl generate_random_pairs.pl entrez_genes_file known_interactions_file 250000
I got a common elements of 15880. The number 250000 is to tell how many random pairs I want the program to generate.
#! usr/bin/perl
use strict;
use warnings;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open (IN, $e_file) or die "Error!! Cannot open $e_file\n";
open (IN2, $k_file) or die "Error!! Cannot open $k_file\n";
my #e_file = <IN>; s/\s+\z// for #e_file;
my #k_file = <IN2>; s/\s+\z// for #k_file;
my (%known_interactions);
my %entrez_genes;
$entrez_genes{$_}++ foreach #e_file;
foreach my $line (#k_file) {
my #array = split (/,/, $line);
$known_interactions{$array[0]} = $array[1];
}
my $count = 0;
foreach my $key1 (keys %entrez_genes) {
foreach my $key2 (keys %entrez_genes) {
if ($key1 != $key2) {
if (exists $known_interactions{$key1} && ($known_interactions{$key1} == $key2)) {next;}
if (exists $known_interactions{$key2} && ($known_interactions{$key2} == $key1)) {next;}
if ($key1 < $key2) { print "$key1,$key2\n"; $count++; }
else { print "$key2,$key1\n"; $count++; }
}
if ($count == $interactions) {
die "$count\n";
}
}
}
I can see nothing wrong with your code. I wonder if you have some whitespace in your data - either after the comma or at the end of the line? It would be safer to extract just the digit fields with, for instance
my #e_file = map /\d+/g, <IN>;
Also, you would be better off keeping both elements of the pair as the hash key, so that you can just check the existence of the element. And if you make sure the lower number is always first you don't need to do two lookups.
This example should work for you. It doesn't address the random selection part of your requirement, but that wasn't in your own code and wasn't your immediate problem
use strict;
use warnings;
#ARGV = qw/ entrez_genes.txt known_interactions.txt 9 /;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open my $fh, '<', $e_file or die "Error!! Cannot open $e_file: $!";
my #e_file = sort { $a <=> $b } map /\d+/g, <$fh>;
open $fh, '<', $k_file or die "Error!! Cannot open $k_file: $!";
my %known_interactions;
while (<$fh>) {
my $pair = join ',', sort { $a <=> $b } /\d+/g;
$known_interactions{$pair}++;
}
close $fh;
my $count = 0;
PAIR:
for my $i (0 .. $#e_file-1) {
for my $j ($i+1 .. $#e_file) {
my $pair = join ',', #e_file[$i, $j];
unless ($known_interactions{$pair}) {
print $pair, "\n";
last PAIR if ++$count >= $interactions;
}
}
}
print "\nTotal of $count interactions\n";
first of all, you are not chomping (removing newlines) from your file of known interactions. That means that given a file like:
1111,2222
you will build this hash:
$known_interactions{1111} = "2222\n";
That is probably why you are getting duplicate entries. My guess is (can't be sure without your actual input files) that these loops should work ok:
map{
chomp;
$entrez_genes{$_}++ ;
}#e_file;
and
map {
chomp;
my #array = sort(split (/,/));
$known_interactions{$array[0]} = $array[1];
}#k_file;
Also, as a general rule, I find my life is easier if I sort the interacting pair (the joys of bioinformatics :) ). That way I know that 111,222 and 222,111 will be treated in the same way and I can avoid multiple if statements like you have in your code.
Your next loop would then be (which IMHO is more readable):
my #genes=keys(%entrez_genes);
for (my $i=0; $i<=$#genes;$i++) {
for (my $k=$n; $k<=$#genes;$k++) {
next if $genes[$n] == $genes[$k];
my #pp=sort($genes[$n],$genes[$k]);
next unless exists $known_interactions{$pp[0]};
next if $known_interactions{$pp[0]} == $pp[1];
print "$pp[0], $pp[1]\n";
$count++;
die "$count\n" if $count == $interactions;
}
}