Analyzing a txt list in perl [closed] - perl

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I am trying to analyze a list of coordinates. The txt file is set up like this:
ID START END
A 10 20
B 15 17
C 20 40
How would I check this dataset to check if START and END is included within a user-specified region, e.g. START=10 END=15?
Any help greatly appreciated
// edit //
if(#AGRV != 4) {
print STDOUT "Searches genomic data for CNV within range. \n";
print STDOUT "CNV FILE FORMAT: <ID><CHR>BPS><BPE><AGE><etc...> \n";
print STDOUT "USAGE: [CNVLIST][CHR][BPS][BPE][OUTFILE] \n";
exit;
}
open(CNVLIST,"<$ARGV[0]");
open(OUTFILE,">$ARGV[3]");
$BPS = $ARGV[1];
$BPE = $ARGV[2];
#put CNV file in hash table
$line = <CNVFILE>;
while($line = <CNVFILE>) {
chomp $line;
($Cchr,$CS,$CE,$CID) = split(/\t/,$line);
}
I need to look through each line and find if the start/end lies within the user specified range.

it is unclear whether you can assume that the ID will never overlap with each other, but assuming it won't , you can use hash to store the lines that are within the range. If it's possible that the ID might overlap, I think you can push #{$result{id}}, [$start, $end]; but that'll make the data structure a bit more complicated.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $in_file = "input.txt";
# User-specified region
my $range_start = 10;
my $range_end = 15;
open my $fh, $in_file or die $!;
my %result;
while (<$fh>) {
my ($id, $start, $end) = split " ", $_;
next unless $start =~ /\d/;
# Swap if START is larger than END
($start, $end) = ($end, $start) if $start > $end;
$result{$id} = [$start, $end]
if $start >= $range_start and $end <= $range_end;
}
print Dumper(%result);

You can split() each line and check second and third field:
#!/usr/bin/env perl
use strict;
use warnings;
my ($start, $end) = (shift, shift);
die if $start > $end;
## Skip header
<>;
while ( <> ) {
chomp;
my #f = split;
if ( $f[1] <= $start && $f[2] >= $end ) {
printf qq|%s\n|, $_;
}
}
It accepts three arguments, the first one is the start region, the second one is the end region and the last one is the file to process. It prints to output all lines that pass the condition.
Run it like:
perl script.pl 10 15 infile
That yields:
A 10 20

Related

How do I read strings into a hash in Perl [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a file with a series of random A's, G's, C's and T's in them that look like this:
>Mary
ACGTACGTACGTAC
>Jane
CCCGGCCCCTA
>Arthur
AAAAAAAAAAT
I took those letters and concatinated them to end up with ACGTACGTACGTACCCCGGCCCCTAAAAAAAAAAT. I now have a series of positions within that concatenated sequence that are of interest to me, and I want to find the associated Names that match with those positions (coordinates). I'm using the Perl function length to calculate the legnth of each sequence, and then associate the culmulative length with the name in a hash.
So far I have:
#! /usr/bin/perl -w
use strict;
my $seq_input = $ARGV[0];
my $coord_input = $ARGV[1];
my %idSeq; #Stores sequence and associated ID's.
open (my $INPUT, "<$seq_input") or die "unable to open $seq_input";
open (my $COORD, "<$coord_input") or die "unable to open $fcoord_input";
while (<$INPUT>) {
if ($_ = /^[AGCT/) {
$idSeq{$_
my $id = ( /^[>]/)
#put information into a hash
#loop through hash looking for coordinates that are lower than the culmulative length
foreach $id
$totallength = $totallength + length($seq)
$lengthId{$totalLength} = $id
foreach $position
foreach $length
if ($length >= $position) { print; last }
close $fasta_input;
close $coord_input;
print "Done!\n";
So far I'm having trouble reading the file into a hash. Also would I need an array to print the hash?
Not completely clear on what you want; maybe this:
my $seq;
my %idSeq;
while ( my $line = <$INPUT> ) {
if ( my ($name) = $line =~ /^>(.*)/ ) {
$idSeq{$name} = length $seq || 0;
}
else {
chomp $line;
$seq .= $line;
}
}
which produces:
$seq = 'ACGTACGTACGTACCCCGGCCCCTAAAAAAAAAAAT';
%idSeq = (
'Mary' => 0,
'Jane' => 14,
'Arthur' => 25,
);

Difficulty in finding error in perl script

Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
Now what I want to do is, in a simplistic way,
take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.
Here is the code what I've tried so far.
#!/usr/bin/perl -w
use strict;
use warnings;
my $in = $ARGV[0];
open(IN, $in);
my #short_reads = <IN>;
my $first_read = $short_reads[0];
chomp $first_read;
my #all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
if($short_reads[$i+1] =~ /^$end_of_kmers/){
my $end_res = substr($short_reads[$i], -1);
push(#all_end_res, $end_res);
}
}
my $end_res2 = join('', #all_end_res);
print $first_read.$end_res2,"\n\n";
At the end I should get an output like ATGGAAGTCGCGGAATC but I'm getting ATGGAAGGTCGCGGAAT. The error must be in if, any help is greatly appreciated.
There are three huge problems in IT.
Naming of things.
Off by one error.
And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.
#!/usr/bin/env perl
use strict;
use warnings;
use constant LENGTH => 6;
my $read = <>;
chomp $read;
while (<>) {
chomp;
last unless length > LENGTH;
if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
$read .= substr( $_, LENGTH );
}
else {last}
}
print $read, "\n";
I didn't get this ARGV[0] thing. It is useless and inflexible.
$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data
ATGGAAGTCGCGGAATC
But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict
last unless length == LENGTH + 1;
Edit:
If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < #a; $i++) is more idiomatic.)
my #short_reads = <>;
chomp #short_reads;
my #all_end_res;
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
push #all_end_res, $1;
}
}
print $short_reads[0], join('', #all_end_res), "\n";
The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.
my #short_reads = <>;
chomp #short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.
my #short_reads = <>;
chomp #short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.
my #short_reads = <>;
chomp #short_reads;
my $read = shift #short_reads;
for my $curr_read (#short_reads) {
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.
Edit2:
It is almost ten times faster using substr and eq then using regular expressions.
$ time ./code.pl data.out > data.test
real 0m0.480s
user 0m0.468s
sys 0m0.008s
$ time ./code2.pl data.out > data2.test
real 0m4.520s
user 0m4.516s
sys 0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test
with minor modification:
use warnings;
use strict;
open my $in, '<', $ARGV[0] or die $!;
chomp(my #short_reads = <$in>);
my $first_read = $short_reads[0];
my #all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
my ($next_read) = $short_reads[$i+1];
if( (defined $next_read) and ($next_read =~ /^\Q$end_of_kmers/)){
my $end_res = substr($next_read, -1);
push(#all_end_res, $end_res);
}
}
my $end_res2 = join('', #all_end_res);
print $first_read.$end_res2,"\n";
ATGGAAGTCGCGGAATC

How to match numbers that lie outside the range [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am trying to print values of range from #arr3 which are same and lie outside the range from #arr4 (not included in range of arr4) but I am not getting the desired output. Please suggest me the modifications in the following code to get the output as 1,2,8,13 (without repeating the values if any)
File 1: result
1..5
5..10
10..15
File 2: annotation
3..7
9..12
14..17
Code:
#!/usr/bin/perl
open($inp1, "<result") or die "not found";
open($inp2, "<annotation") or die "not found";
my #arr3 = <$inp1>;
my #arr4 = <$inp2>;
foreach my $line1 (#arr4) {
foreach my $line2 (#arr3) {
my ($from1, $to1) = split(/\.\./, $line1);
my ($from2, $to2) = split(/\.\./, $line2);
for (my $i = $from1 ; $i <= $to1 ; $i++) {
for (my $j = $from2 ; $j <= $to2 ; $j++) {
$res = grep(/$i/, #result); #to avoid repetition
if ($i != $j && $res == 0) {
print "$i \n";
push(#result, $i);
}
}
}
}
}
Try this:
#!/usr/bin/perl
use strict;
open (my $inp1,"<result.txt") or die "not found";
open (my $inp2,"<annotation.txt") or die "not found";
my #result;
my #annotation;
foreach my $line2 (<$inp2>) {
my ($from2,$to2)=split(/\.\./,$line2);
#annotation = (#annotation, $from2..$to2);
}
print join(",",#annotation),"\n";
my %in_range = map {$_=> 1} #annotation;
foreach my $line1 (<$inp1>) {
my ($from1,$to1)=split(/\.\./,$line1);
#result = (#result, $from1..$to1);
}
print join(",",#result),"\n";
my %tmp_hash = map {$_=> 1} #result;
my #unique = sort {$a <=> $b} keys %tmp_hash;
print join(",",#unique),"\n";
my #out_of_range = grep {!$in_range{$_}} #unique;
print join(",",#out_of_range),"\n";
The print statements are temporary, of course, to help show what's going on when you run this. The basic idea is you use one hash to eliminate duplicate numbers in your "result", and another hash to indicate which ones are in the "annotations".
If you used pattern-matching rather than split then I think it would be a little easier to make this ignore extra lines of input that aren't ranges of numbers, in case you ever have input files with a few "extra" lines that you need to skip over.
If the contents of the files is under your control, you can make use of eval for parsing them. On the other hand, if there might be something else than what you specified, the following is dangerous to use.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
open my $inc, '<', 'result';
open my $exc, '<', 'annotation';
my (%include, %exclude, #result);
while (<$inc>) { $include{$_} = 1 for eval $_ }
while (<$exc>) { $exclude{$_} = 1 for eval $_ }
for (sort {$a <=> $b} keys %include) {
push #result, $_ unless $exclude{$_}
}
print Dumper \#result;
Returns:
$VAR1 = [ 1, 2, 8, 13 ];
The only major tool you need is a %seen style hash as represented in perlfaq4 - How can I remove duplicate elements from a list or array?
The following opens filehandles to string references, but obviously these could be replaced with the appropriate file names:
use strict;
use warnings;
use autodie;
my %seen;
open my $fh_fltr, '<', \ "3..7\n9..12\n14..17\n";
while (<$fh_fltr>) {
my ($from, $to) = split /\.{2}/;
$seen{$_}++ for $from .. $to;
}
my #result;
open my $fh_src, '<', \ "1..5\n5..10\n10..15\n";
while (<$fh_src>) {
my ($from, $to) = split /\.{2}/;
push #result, $_ for grep {!$seen{$_}++} $from .. $to;
}
print "#result\n";
Outputs:
1 2 8 13

script conversion from Perl to shell [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
following is the code in perl.
Can we write the same thing in shell scripts ??
If yes how ?
I have used associative arrays but unable to achieve what this is doing
open MYFILE, "<", "$ARGV[0]" or die "Can't open $ARGV[0] file \n";
############ to retieve the info and put them in associative arrray ##############
$line = <MYFILE>;
#line1 = split(/,/ , $line);
$length = #line1;
$count = 0;
while($count < $length)
{
$line1[$count] =~ s/^\"//;
$line1[$count] =~ s/\"$//;
$count++;
}
$line = <MYFILE>;
#line2 = split(/,/ , $line);
$length = #line2;
$count = 0;
while($count < $length)
{
$line2[$count] =~ s/^\"//;
$line2[$count] =~ s/\"$//;
$count++;
}
$count = 0;
while($count < $length)
{
$array{$line1[$count]}=$line2[$count];
$count++;
}
Of course you can translate that to a shell script: Just wrap the perl script in a here-doc, pass it to perl, and put #!/bin/sh at the top…
#!/bin/sh
perl - <<'END' $1
...
END
But more seriously, you might achieve enlightenment by rewriting the code in a different fashion. What you are doing is reading a line, splitting it at commata, and removing quotation marks at the beginning and end of each field:
sub get_fields {
map { s/^"//; s/"$//; $_ } split /,/, $_[0];
}
my #keys = get_fields scalar <>; # 1st line
my #vals = get_fields scalar <>; # 2nd line
my %hash;
#hash{ #line1 } = #line2;
Except for the slice operation at the end, you can now more easily rewrite the code because it uses data flow instead of structured programming as the predominant paradigm. Not to mention that my code is shorter by an order of magnitude (in base 3).
If you are writing code for production purposes, don't do this. It will break. I assume you are processing CSV. Stick with Perl, and use Text::CSV. Then:
use strict; use warnings; use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 });
open my $fh, "<:utf8", $ARGV[0];
my $keys = $csv->getline($fh);
my $vals = $csv->getline($fh);
my %hash;
#hash{#$keys} = #$vals;
It isn't even much longer, but very unlikely to break (It doesn't split on commas inside quotes).

Using Perl hashes to handle tab-delimited files

I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png
A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.
You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.
The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;
I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;