Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am trying to print values of range from #arr3 which are same and lie outside the range from #arr4 (not included in range of arr4) but I am not getting the desired output. Please suggest me the modifications in the following code to get the output as 1,2,8,13 (without repeating the values if any)
File 1: result
1..5
5..10
10..15
File 2: annotation
3..7
9..12
14..17
Code:
#!/usr/bin/perl
open($inp1, "<result") or die "not found";
open($inp2, "<annotation") or die "not found";
my #arr3 = <$inp1>;
my #arr4 = <$inp2>;
foreach my $line1 (#arr4) {
foreach my $line2 (#arr3) {
my ($from1, $to1) = split(/\.\./, $line1);
my ($from2, $to2) = split(/\.\./, $line2);
for (my $i = $from1 ; $i <= $to1 ; $i++) {
for (my $j = $from2 ; $j <= $to2 ; $j++) {
$res = grep(/$i/, #result); #to avoid repetition
if ($i != $j && $res == 0) {
print "$i \n";
push(#result, $i);
}
}
}
}
}
Try this:
#!/usr/bin/perl
use strict;
open (my $inp1,"<result.txt") or die "not found";
open (my $inp2,"<annotation.txt") or die "not found";
my #result;
my #annotation;
foreach my $line2 (<$inp2>) {
my ($from2,$to2)=split(/\.\./,$line2);
#annotation = (#annotation, $from2..$to2);
}
print join(",",#annotation),"\n";
my %in_range = map {$_=> 1} #annotation;
foreach my $line1 (<$inp1>) {
my ($from1,$to1)=split(/\.\./,$line1);
#result = (#result, $from1..$to1);
}
print join(",",#result),"\n";
my %tmp_hash = map {$_=> 1} #result;
my #unique = sort {$a <=> $b} keys %tmp_hash;
print join(",",#unique),"\n";
my #out_of_range = grep {!$in_range{$_}} #unique;
print join(",",#out_of_range),"\n";
The print statements are temporary, of course, to help show what's going on when you run this. The basic idea is you use one hash to eliminate duplicate numbers in your "result", and another hash to indicate which ones are in the "annotations".
If you used pattern-matching rather than split then I think it would be a little easier to make this ignore extra lines of input that aren't ranges of numbers, in case you ever have input files with a few "extra" lines that you need to skip over.
If the contents of the files is under your control, you can make use of eval for parsing them. On the other hand, if there might be something else than what you specified, the following is dangerous to use.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
open my $inc, '<', 'result';
open my $exc, '<', 'annotation';
my (%include, %exclude, #result);
while (<$inc>) { $include{$_} = 1 for eval $_ }
while (<$exc>) { $exclude{$_} = 1 for eval $_ }
for (sort {$a <=> $b} keys %include) {
push #result, $_ unless $exclude{$_}
}
print Dumper \#result;
Returns:
$VAR1 = [ 1, 2, 8, 13 ];
The only major tool you need is a %seen style hash as represented in perlfaq4 - How can I remove duplicate elements from a list or array?
The following opens filehandles to string references, but obviously these could be replaced with the appropriate file names:
use strict;
use warnings;
use autodie;
my %seen;
open my $fh_fltr, '<', \ "3..7\n9..12\n14..17\n";
while (<$fh_fltr>) {
my ($from, $to) = split /\.{2}/;
$seen{$_}++ for $from .. $to;
}
my #result;
open my $fh_src, '<', \ "1..5\n5..10\n10..15\n";
while (<$fh_src>) {
my ($from, $to) = split /\.{2}/;
push #result, $_ for grep {!$seen{$_}++} $from .. $to;
}
print "#result\n";
Outputs:
1 2 8 13
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 months ago.
Improve this question
I have inherited an older script that uses hash references that I don't understand. It results in:
Can't use an undefined value as a HASH reference at
./make_quar_dbfile.pl line 65.
63 my $bucket = sprintf('%02x', $i);
64 my $file = sprintf('%s/%02x.db', $qdir, $i);
65 %{$hashes{$bucket}} ? 1 : next;
66 tie (my %hash, 'DB_File', $file, O_RDWR, 0600) || die "Can't open db file: $! \n ";
67 %hash = %{$hashes{$bucket}};
68 untie %hash;
The script reads through a number of gzipd emails to identify the sender/recip/subject/date etc., then writes that info to a DB_File hash.
This script used to work with older versions of Perl, but looks like it now is no longer compliant.
I'd really like to understand how this works, but I don't fully understand reference/dereference, why it's even necessary here, and the %{$var} notation. All of the references I've studied talk about hash references in terms of $hash_ref = \%author; not %hash_ref = %{$author}, for example.
Ideas on how to get this to work with hash references would be greatly appreciated.
#!/usr/bin/perl -w
use DB_File;
use File::Basename qw(basename);
use vars qw($verbose);
use strict;
use warnings;
sub DBG($);
$verbose = shift || 1;
my $qdir = '/var/spool/amavisd/qdb';
my $source_dir = '/var/spool/amavisd/quarantine';
my $uid = getpwnam('amavis');
my $gid = getgrnam('amavis');
my %hashes = ( );
my $me = basename($0);
my $version = '1.9';
my $steps = 100;
my $cnt = 0;
DBG("- Creating initial database files...");
for (my $i = 0; $i < 256; $i++) {
my $file = sprintf('%s/%02x.db', $qdir, $i);
unlink $file || DBG("Could not unlink $file to empty db: $! \n");
tie (my %hash, "DB_File", $file, O_CREAT, 0600) || die "Can't open db file: $! \n";
untie %hash;
chown($uid, $gid, $file) || die "Unable to set attributes on file: $! \n";
}
DBG("done\n");
opendir SOURCEDIR, $source_dir || die "Cannot open $source_dir: $! \n";
DBG("- Building hashes... ");
foreach my $f (sort readdir SOURCEDIR) {
next if ($f eq "." || $f eq "..");
if ($f =~ m/^(spam|virus)\-([^\-]+)\-([^\-]+)(\.gz)?/) {
my $type = $1;
my $key = $3;
my $bucket = substr($key, 0, 2);
my $d = $2;
my $subj = '';
my $to = '';
my $from = '';
my $size = '';
my $score = '0.0';
if (($cnt % $steps) == 0) { DBG(sprintf("\e[8D%-8d", $cnt)); } $cnt++;
if ($f =~ /\.gz$/ && open IN, "zcat $source_dir/$f |") {
while(<IN>) {
last if ($_ eq "\n");
$subj = $1 if (/^Subject:\s*(.*)$/);
$to = $1 if (/^To:\s*(.*)$/);
$from = $1 if (/^From:\s*(.*)$/);
$score = $1 if (/score=(\d{1,3}\.\d)/);
}
close IN;
$to =~ s/^.*\<(.*)\>.*$/$1/;
$from =~ s/^.*\<(.*)\>.*$/$1/;
$size = (stat("$source_dir/$f"))[7];
$hashes{$bucket}->{$f} = "$type\t$d\t$size\t$from\t$to\t$subj\t$score";
}
}
}
closedir SOURCEDIR;
DBG("...done\n\n- Populating database files...");
for (my $i = 0; $i < 256; $i++) {
my $bucket = sprintf('%02x', $i);
my $file = sprintf('%s/%02x.db', $qdir, $i);
%{$hashes{$bucket}} ? 1 : next;
tie (my %hash, 'DB_File', $file, O_RDWR, 0600) || die "Can't open db file: $! \n ";
%hash = %{$hashes{$bucket}};
untie %hash;
}
exit(0);
sub DBG($) { my $msg = shift; print $msg if ($verbose); }
What is $hash{$key}? A value associated with the (value of) $key, which must be a scalar. So we get the $value out of my %hash = ( $key => $value ).
That's a string, or a number. Or a filehandle. Or, an array reference, or a hash reference. (Or an object perhaps, normally a blessed hash reference.) They are all scalars, single-valued things, and as such are a legitimate value in a hash.
The syntax %{ ... } de-references a hash reference† so judged by %{ $hashes{$bucket} } that code expects there to be a hash reference. So the error says that there is actually nothing in %hashes for that value of a would-be key ($bucket), so it cannot "de-reference" it. There is either no key that is the value of $bucket at that point in the loop, or there is such a key but it has never been assigned anything.
So go debug it. Add printing statements through the loops so you can see what values are there and what they are, and which ones aren't even as they are assumed to be. Hard to tell what fails without running that program.
Then, the line %{$hashes{$bucket}} ? 1 : next; is a little silly. The condition of the ternary operator evaluates to a boolean, "true" (not undefined, not 0, not empty string '') or false. So it tests whether $hashes{$bucket} has a hashref with at least some keys, and if it does then it returns 1; so, the for loop continues. Otherwise it skips to the next iteration.
Well, then skip to next if there is not a (non-empty) hashref there:
next if not defined $hashes{$bucket} or not %{ $hashes{$bucket} };
Note how we first test whether there is such a key, and only then attempt to derefence it.
† Whatever expression may be inside the curlies must evaluate to a hash reference. (If it's else, like a number or a string, the code would still exit with an error but with a different one.)
So, in this code, the hash %hashes must have a key that is the value of $bucket at that point, and the value for that key must be a hash reference. Then, the ternary operator tests whether the hash obtained from that hash reference has any keys.
You need to understand references first, this is a kind of how-to :
#!/usr/bin/perl
use strict; use warnings;
use feature qw/say/;
use Data::Dumper;
my $var = {}; # I create a HASH ref explicitly
say "I created a HASH ref explicitly:";
say ref($var);
say "Now, let's add any type of content:";
say "Adding a ARRAY:";
push #{ $var->{arr} }, (0..5);
say Dumper $var;
say "Now, I add a new HASH";
$var->{new_hash} = {
foo => "value",
bar => "other"
};
say Dumper $var;
say 'To access the data in $var without Data::Dumper, we need to dereference what we want to retrieve';
say "to retrieve a HASH ref, we need to dereference with %:";
while (my ($key, $value) = each %{ $var->{new_hash} }) {
say "key=$key value=$value";
}
say "To retrieve the ARRAY ref:";
say join "\n", #{ $var->{arr} };
Output
I created a HASH ref explicitely:
HASH
Now, let's add any type of content:
Adding a ARRAY:
$VAR1 = {
'arr' => [
0,
1,
2,
3,
4,
5
]
};
Now, I add a new HASH
$VAR1 = {
'new_hash' => {
'foo' => 'value',
'bar' => 'other'
},
'arr' => [
0,
1,
2,
3,
4,
5
]
};
To access the data in $var without Data::Dumper, we need to dereference what we want to retrieve
to retrieve a HASH ref, we need to dereference with %:
key=foo value=value
key=bar value=other
To retrieve the ARRAY ref:
0
1
2
3
4
5
Now with your code, instead of
%{$hashes{$bucket}} ? 1 : next;
You should test the HASH ref first, because Perl say it's undefined, let's debug a bit:
use Data::Dumper;
print Dumper $hashes;
print "bucket=$bucket\n";
if (defined $hashes{$bucket}) {
print "Defined array\n";
}
else {
print "NOT defined array\n";
}
I have been stuck with some script!
Well i made this script in 2008 and now i am using with some modifications and i get error!
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
sub getSequences ($) {
my $file = $_[0];
open (INFILE, "<$file") || die "Error1 in opening in file: $file. $!\n";
my #lines = <INFILE>;
my $header; my %header2seq = ();
foreach my $line (#lines) {
chomp $line;
if ($line =~ /^(>.+)$/o) {
$header = $1;
}
else {$header2seq {$header} .= $line; }
}
#print %header2seq;
close (INFILE);
return (\%header2seq);
}
sub MakeSpList ($) {
my $sp_list = $_[0]; my %sp_names = ();
open (INFILE2, "<$sp_list") || die "Error2 in opening in file: $sp_list. $!\n";
my #sps = <INFILE2>;
foreach my $line (#sps) { chomp $line; $sp_names {$line} = 1; }
close (INFILE2);
#print Dumper (%sp_names);
return (\%sp_names);
}
sub CompareSpList2Sequences ($$$) {
my $ref_header2seq = $_[0] ; my $ref_sp_names = $_[1]; my $file = $_[2];
open (OUTFILE, ">$file.subdata") || die ("Error3 in opening out file: $file.subdata. $!\n");
foreach my $key (keys %$ref_header2seq) {
$key =~ m/^>([A-Z]+[0-9]+[A-Z+]).+$/o;
#print "$1\n";
my $header_sub = $1;
#print $header_sub, "\n";
#print $ref_sp_names, "\n";
if (exists $ref_sp_names -> {$header_sub}) {
my $seq = $ref_header2seq -> {$key};
print OUTFILE ">$key\n$seq\n";
}
}
close (OUTFILE);
return "42";
}
my $fasta_seqs = $ARGV[0]; my $sp_list = $ARGV[1];
my $ref_header2seq = getSequences ($fasta_seqs);
my $ref_sp_names = MakeSpList ($sp_list);
CompareSpList2Sequences ($ref_header2seq , $ref_sp_names, $fasta_seqs);
exit;
What i want to do is:
i have a fasta file with sequences:
YAL004W YAL004W SGDID:S000002136, Chr I from 140760-141407, Genome Release 64-2-1, Dubious ORF, "Dubious open reading frame; unlikely to encode a functional protein, based on available experimental and comparative sequence data; completely overlaps verified gene SSA1/YAL005C" ATGGGTGTCACCAGCGGTGGCCTTAACTTCAAAGATACCGTCTTCAATGGACAACAAAGAGACATCGAAAGTACCACCACCCAAGTCGAAAATCAAGACGTGTTCTTCCTTACCCTTCTTGTCCAAACCGTAAGCAATGGCAGCGGCGGTAGGTTCGTTAATAATACGCAAGACATTCAAACCAGCAATGGTACCAGCATCCTTGGTAGCTTGTCTTTGAGAATCGTTGAA
YAL005C SSA1 SGDID:S000000004, Chr I from 141431-139503, Genome Release 64-2-1, reverse complement, Verified ORF, "ATPase involved in protein folding and NLS-directed nuclear transport; member of HSP70 family; forms chaperone complex with Ydj1p; localized to nucleus, cytoplasm, and cell wall; 98% identical with paralog Ssa2p, but subtle differences between the two proteins provide functional specificity with respect to propagation of yeast [URE3] prions and vacuolar-mediated degradations of gluconeogenesis enzymes; general targeting factor of Hsp104p to prion fibrils" ATGTCAAAAGCTGTCGGTATTGATTTAGGTACAACATACTCGTGTGTTGCTCACTTTGCTAATGATCGTGTGGACATTATTGCCAACGATCAAGGTAACAGAACCACTCCATCTTTTGTCGCTTTCACTGACACTGAAAGATTGATTGGTGATGCTGCTAAGAATCAAGCTGCTATGAATCCTTCGAATACCGTTTTCGACGCTAAGCGTTTGATCGGTAGAAACTTCAAC
and i have another file with ID's:
YAL005C
YAL012W
I want to retrieve the sequences and the all header when match with ID's file.
but i get this error: don´t print anything!
Please can you help me?
Thanks in advance.
i already searched for other methods (and i can´t get the results either) but i really want to know about this error!
no bioperl please!
OK, so - line 45 is:
if (exists $ref_sp_names -> {$header_sub}) {
Your error is telling you that $header_sub is undefined. It's set by:
my $header_sub = $1;
Which follows:
$key =~ m/^(>[A-Z])\s.+$/o;
So - this means the regex isn't matching. I don't see any > in your sample data, so it can't match it. When the match fails, $1 is undefined, hence your error. What do you get out of your print $key statements?
I would also note - .+$ is most likely redundant. Likewise - the o flag - you probably don't want that either. http://perldoc.perl.org/perlre.html#Modifiers
have you tried using Bioperl? Here's some code to get you started.
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
my $fasta = shift; #this will just push whatever in cli in.
my $seqio_obj = Bio::SeqIO->(-file => $fasta, -format => 'fasta');
while ( my $seq = $seqio_obj->next_seq){
print $seq->id . ' = ' . $seq->seq() . "\n";
#in here you can do your fasta handling with the seq obj
}
I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764
#!/usr/bin/perl
use strict;
use Data::Dumper;
use warnings;
my #mdsum;
open (IN1,"$ARGV[0]") || die "counldn't open";
open (MYFILE, '>>md5sum-problem.txt');
open (IN2, "mdsumfile.txt");
my %knomexl=();
my %knomemdsum = ();
my #arrfile ;
my $tempkey ;
my $tempval ;
my #values ;
my $val;
my $i;
my #newarra;
my $testxl ;
my $testmdsum;
while(<IN1>){
next if /barcode/;
#arrfile = split('\t', $_);
$knomexl{$arrfile[0]} = $arrfile[2];
}
while(<IN2>){
chomp $_;
#newarra = split(/ {1,}/, $_);
$tempval = $newarra[0];
$tempkey = $newarra[1];
$tempkey=~ s/\t*$//g;
$tempval=~ s/\s*$//g;
$tempkey=~s/.tar.gz//g;
$knomemdsum{$tempkey} = $tempval;
}
#values = keys %knomexl;
foreach $i(#values){
$testxl = $knomexl{$values[$i]};
print $testxl."\n";
$testmdsum = $knomemdsum{$values[$i]};
print $testmdsum."\n";
if ( $testxl ne $testmdsum ) {
if ($testxl ne ""){
print MYFILE "Files hasving md5sum issue $i\n";
}
}
}
close (MYFILE);
I have two files one both having File name and Mdsum values and I need to check that which all file's md5sum values are not matching so I understand that in some case where Value and corresponding values will not be their and I want those cases only. Any work around on this code ? Please. This code is pretty simple but don't know why it's not working!! :( :(
#values = keys %knomexl;
foreach $i(#values){
#print Dumper $knomexl{$values[$i]};
$testxl = $knomexl{$i};
print $testxl."\n";
$testmdsum = $knomemdsum{$i};
print $testmdsum."\n";
$i is an element of #values because of the foreach, not an index, so you shouldn't use $values[$i].
Given a set of genes and existing pair of genes, I want to generate new pairs of genes which are not already existing.
The genes file has the following format :
123
134
23455
3242
3423
...
...
The genes pairs file has the following format :
12,345
134,23455
23455,343
3242,464452
3423,7655
...
...
But I still get few common elements between known_interactions and new_pairs. I'm not sure where the error is.
For the arguments,
perl generate_random_pairs.pl entrez_genes_file known_interactions_file 250000
I got a common elements of 15880. The number 250000 is to tell how many random pairs I want the program to generate.
#! usr/bin/perl
use strict;
use warnings;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open (IN, $e_file) or die "Error!! Cannot open $e_file\n";
open (IN2, $k_file) or die "Error!! Cannot open $k_file\n";
my #e_file = <IN>; s/\s+\z// for #e_file;
my #k_file = <IN2>; s/\s+\z// for #k_file;
my (%known_interactions);
my %entrez_genes;
$entrez_genes{$_}++ foreach #e_file;
foreach my $line (#k_file) {
my #array = split (/,/, $line);
$known_interactions{$array[0]} = $array[1];
}
my $count = 0;
foreach my $key1 (keys %entrez_genes) {
foreach my $key2 (keys %entrez_genes) {
if ($key1 != $key2) {
if (exists $known_interactions{$key1} && ($known_interactions{$key1} == $key2)) {next;}
if (exists $known_interactions{$key2} && ($known_interactions{$key2} == $key1)) {next;}
if ($key1 < $key2) { print "$key1,$key2\n"; $count++; }
else { print "$key2,$key1\n"; $count++; }
}
if ($count == $interactions) {
die "$count\n";
}
}
}
I can see nothing wrong with your code. I wonder if you have some whitespace in your data - either after the comma or at the end of the line? It would be safer to extract just the digit fields with, for instance
my #e_file = map /\d+/g, <IN>;
Also, you would be better off keeping both elements of the pair as the hash key, so that you can just check the existence of the element. And if you make sure the lower number is always first you don't need to do two lookups.
This example should work for you. It doesn't address the random selection part of your requirement, but that wasn't in your own code and wasn't your immediate problem
use strict;
use warnings;
#ARGV = qw/ entrez_genes.txt known_interactions.txt 9 /;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open my $fh, '<', $e_file or die "Error!! Cannot open $e_file: $!";
my #e_file = sort { $a <=> $b } map /\d+/g, <$fh>;
open $fh, '<', $k_file or die "Error!! Cannot open $k_file: $!";
my %known_interactions;
while (<$fh>) {
my $pair = join ',', sort { $a <=> $b } /\d+/g;
$known_interactions{$pair}++;
}
close $fh;
my $count = 0;
PAIR:
for my $i (0 .. $#e_file-1) {
for my $j ($i+1 .. $#e_file) {
my $pair = join ',', #e_file[$i, $j];
unless ($known_interactions{$pair}) {
print $pair, "\n";
last PAIR if ++$count >= $interactions;
}
}
}
print "\nTotal of $count interactions\n";
first of all, you are not chomping (removing newlines) from your file of known interactions. That means that given a file like:
1111,2222
you will build this hash:
$known_interactions{1111} = "2222\n";
That is probably why you are getting duplicate entries. My guess is (can't be sure without your actual input files) that these loops should work ok:
map{
chomp;
$entrez_genes{$_}++ ;
}#e_file;
and
map {
chomp;
my #array = sort(split (/,/));
$known_interactions{$array[0]} = $array[1];
}#k_file;
Also, as a general rule, I find my life is easier if I sort the interacting pair (the joys of bioinformatics :) ). That way I know that 111,222 and 222,111 will be treated in the same way and I can avoid multiple if statements like you have in your code.
Your next loop would then be (which IMHO is more readable):
my #genes=keys(%entrez_genes);
for (my $i=0; $i<=$#genes;$i++) {
for (my $k=$n; $k<=$#genes;$k++) {
next if $genes[$n] == $genes[$k];
my #pp=sort($genes[$n],$genes[$k]);
next unless exists $known_interactions{$pp[0]};
next if $known_interactions{$pp[0]} == $pp[1];
print "$pp[0], $pp[1]\n";
$count++;
die "$count\n" if $count == $interactions;
}
}