How do I read strings into a hash in Perl [closed] - perl

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a file with a series of random A's, G's, C's and T's in them that look like this:
>Mary
ACGTACGTACGTAC
>Jane
CCCGGCCCCTA
>Arthur
AAAAAAAAAAT
I took those letters and concatinated them to end up with ACGTACGTACGTACCCCGGCCCCTAAAAAAAAAAT. I now have a series of positions within that concatenated sequence that are of interest to me, and I want to find the associated Names that match with those positions (coordinates). I'm using the Perl function length to calculate the legnth of each sequence, and then associate the culmulative length with the name in a hash.
So far I have:
#! /usr/bin/perl -w
use strict;
my $seq_input = $ARGV[0];
my $coord_input = $ARGV[1];
my %idSeq; #Stores sequence and associated ID's.
open (my $INPUT, "<$seq_input") or die "unable to open $seq_input";
open (my $COORD, "<$coord_input") or die "unable to open $fcoord_input";
while (<$INPUT>) {
if ($_ = /^[AGCT/) {
$idSeq{$_
my $id = ( /^[>]/)
#put information into a hash
#loop through hash looking for coordinates that are lower than the culmulative length
foreach $id
$totallength = $totallength + length($seq)
$lengthId{$totalLength} = $id
foreach $position
foreach $length
if ($length >= $position) { print; last }
close $fasta_input;
close $coord_input;
print "Done!\n";
So far I'm having trouble reading the file into a hash. Also would I need an array to print the hash?

Not completely clear on what you want; maybe this:
my $seq;
my %idSeq;
while ( my $line = <$INPUT> ) {
if ( my ($name) = $line =~ /^>(.*)/ ) {
$idSeq{$name} = length $seq || 0;
}
else {
chomp $line;
$seq .= $line;
}
}
which produces:
$seq = 'ACGTACGTACGTACCCCGGCCCCTAAAAAAAAAAAT';
%idSeq = (
'Mary' => 0,
'Jane' => 14,
'Arthur' => 25,
);

Related

How to match numbers that lie outside the range [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am trying to print values of range from #arr3 which are same and lie outside the range from #arr4 (not included in range of arr4) but I am not getting the desired output. Please suggest me the modifications in the following code to get the output as 1,2,8,13 (without repeating the values if any)
File 1: result
1..5
5..10
10..15
File 2: annotation
3..7
9..12
14..17
Code:
#!/usr/bin/perl
open($inp1, "<result") or die "not found";
open($inp2, "<annotation") or die "not found";
my #arr3 = <$inp1>;
my #arr4 = <$inp2>;
foreach my $line1 (#arr4) {
foreach my $line2 (#arr3) {
my ($from1, $to1) = split(/\.\./, $line1);
my ($from2, $to2) = split(/\.\./, $line2);
for (my $i = $from1 ; $i <= $to1 ; $i++) {
for (my $j = $from2 ; $j <= $to2 ; $j++) {
$res = grep(/$i/, #result); #to avoid repetition
if ($i != $j && $res == 0) {
print "$i \n";
push(#result, $i);
}
}
}
}
}
Try this:
#!/usr/bin/perl
use strict;
open (my $inp1,"<result.txt") or die "not found";
open (my $inp2,"<annotation.txt") or die "not found";
my #result;
my #annotation;
foreach my $line2 (<$inp2>) {
my ($from2,$to2)=split(/\.\./,$line2);
#annotation = (#annotation, $from2..$to2);
}
print join(",",#annotation),"\n";
my %in_range = map {$_=> 1} #annotation;
foreach my $line1 (<$inp1>) {
my ($from1,$to1)=split(/\.\./,$line1);
#result = (#result, $from1..$to1);
}
print join(",",#result),"\n";
my %tmp_hash = map {$_=> 1} #result;
my #unique = sort {$a <=> $b} keys %tmp_hash;
print join(",",#unique),"\n";
my #out_of_range = grep {!$in_range{$_}} #unique;
print join(",",#out_of_range),"\n";
The print statements are temporary, of course, to help show what's going on when you run this. The basic idea is you use one hash to eliminate duplicate numbers in your "result", and another hash to indicate which ones are in the "annotations".
If you used pattern-matching rather than split then I think it would be a little easier to make this ignore extra lines of input that aren't ranges of numbers, in case you ever have input files with a few "extra" lines that you need to skip over.
If the contents of the files is under your control, you can make use of eval for parsing them. On the other hand, if there might be something else than what you specified, the following is dangerous to use.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
open my $inc, '<', 'result';
open my $exc, '<', 'annotation';
my (%include, %exclude, #result);
while (<$inc>) { $include{$_} = 1 for eval $_ }
while (<$exc>) { $exclude{$_} = 1 for eval $_ }
for (sort {$a <=> $b} keys %include) {
push #result, $_ unless $exclude{$_}
}
print Dumper \#result;
Returns:
$VAR1 = [ 1, 2, 8, 13 ];
The only major tool you need is a %seen style hash as represented in perlfaq4 - How can I remove duplicate elements from a list or array?
The following opens filehandles to string references, but obviously these could be replaced with the appropriate file names:
use strict;
use warnings;
use autodie;
my %seen;
open my $fh_fltr, '<', \ "3..7\n9..12\n14..17\n";
while (<$fh_fltr>) {
my ($from, $to) = split /\.{2}/;
$seen{$_}++ for $from .. $to;
}
my #result;
open my $fh_src, '<', \ "1..5\n5..10\n10..15\n";
while (<$fh_src>) {
my ($from, $to) = split /\.{2}/;
push #result, $_ for grep {!$seen{$_}++} $from .. $to;
}
print "#result\n";
Outputs:
1 2 8 13

Analyzing a txt list in perl [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I am trying to analyze a list of coordinates. The txt file is set up like this:
ID START END
A 10 20
B 15 17
C 20 40
How would I check this dataset to check if START and END is included within a user-specified region, e.g. START=10 END=15?
Any help greatly appreciated
// edit //
if(#AGRV != 4) {
print STDOUT "Searches genomic data for CNV within range. \n";
print STDOUT "CNV FILE FORMAT: <ID><CHR>BPS><BPE><AGE><etc...> \n";
print STDOUT "USAGE: [CNVLIST][CHR][BPS][BPE][OUTFILE] \n";
exit;
}
open(CNVLIST,"<$ARGV[0]");
open(OUTFILE,">$ARGV[3]");
$BPS = $ARGV[1];
$BPE = $ARGV[2];
#put CNV file in hash table
$line = <CNVFILE>;
while($line = <CNVFILE>) {
chomp $line;
($Cchr,$CS,$CE,$CID) = split(/\t/,$line);
}
I need to look through each line and find if the start/end lies within the user specified range.
it is unclear whether you can assume that the ID will never overlap with each other, but assuming it won't , you can use hash to store the lines that are within the range. If it's possible that the ID might overlap, I think you can push #{$result{id}}, [$start, $end]; but that'll make the data structure a bit more complicated.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $in_file = "input.txt";
# User-specified region
my $range_start = 10;
my $range_end = 15;
open my $fh, $in_file or die $!;
my %result;
while (<$fh>) {
my ($id, $start, $end) = split " ", $_;
next unless $start =~ /\d/;
# Swap if START is larger than END
($start, $end) = ($end, $start) if $start > $end;
$result{$id} = [$start, $end]
if $start >= $range_start and $end <= $range_end;
}
print Dumper(%result);
You can split() each line and check second and third field:
#!/usr/bin/env perl
use strict;
use warnings;
my ($start, $end) = (shift, shift);
die if $start > $end;
## Skip header
<>;
while ( <> ) {
chomp;
my #f = split;
if ( $f[1] <= $start && $f[2] >= $end ) {
printf qq|%s\n|, $_;
}
}
It accepts three arguments, the first one is the start region, the second one is the end region and the last one is the file to process. It prints to output all lines that pass the condition.
Run it like:
perl script.pl 10 15 infile
That yields:
A 10 20

script conversion from Perl to shell [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
following is the code in perl.
Can we write the same thing in shell scripts ??
If yes how ?
I have used associative arrays but unable to achieve what this is doing
open MYFILE, "<", "$ARGV[0]" or die "Can't open $ARGV[0] file \n";
############ to retieve the info and put them in associative arrray ##############
$line = <MYFILE>;
#line1 = split(/,/ , $line);
$length = #line1;
$count = 0;
while($count < $length)
{
$line1[$count] =~ s/^\"//;
$line1[$count] =~ s/\"$//;
$count++;
}
$line = <MYFILE>;
#line2 = split(/,/ , $line);
$length = #line2;
$count = 0;
while($count < $length)
{
$line2[$count] =~ s/^\"//;
$line2[$count] =~ s/\"$//;
$count++;
}
$count = 0;
while($count < $length)
{
$array{$line1[$count]}=$line2[$count];
$count++;
}
Of course you can translate that to a shell script: Just wrap the perl script in a here-doc, pass it to perl, and put #!/bin/sh at the top…
#!/bin/sh
perl - <<'END' $1
...
END
But more seriously, you might achieve enlightenment by rewriting the code in a different fashion. What you are doing is reading a line, splitting it at commata, and removing quotation marks at the beginning and end of each field:
sub get_fields {
map { s/^"//; s/"$//; $_ } split /,/, $_[0];
}
my #keys = get_fields scalar <>; # 1st line
my #vals = get_fields scalar <>; # 2nd line
my %hash;
#hash{ #line1 } = #line2;
Except for the slice operation at the end, you can now more easily rewrite the code because it uses data flow instead of structured programming as the predominant paradigm. Not to mention that my code is shorter by an order of magnitude (in base 3).
If you are writing code for production purposes, don't do this. It will break. I assume you are processing CSV. Stick with Perl, and use Text::CSV. Then:
use strict; use warnings; use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 });
open my $fh, "<:utf8", $ARGV[0];
my $keys = $csv->getline($fh);
my $vals = $csv->getline($fh);
my %hash;
#hash{#$keys} = #$vals;
It isn't even much longer, but very unlikely to break (It doesn't split on commas inside quotes).

Using Perl hashes to handle tab-delimited files

I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png
A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.
You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.
The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;
I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;

Common elements from two Perl arrays [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Comparing Two Arrays Using Perl
I am trying to find elements that are common in both the files:
below is my code. Please tell me what mistake I am doing.
open IN, "New_CLDB.txt" or die "couldn't locate input file";
open IN1, "New_adherent.txt" or die "couldn't locate input file";
use Data::Dumper;
#array = ();
while (<IN>) {
$line = $_;
chomp $line;
$a[$i] = $line;
++$i;
}
while (<IN1>) {
$line1 = $_;
chomp $line1;
$b[$m] = $line1;
++$m;
}
for ( $k = 0; $k < $i; ++$k ) {
for ( $f = 0; $f < $m; ++$f ) {
if ( $a[$k] ne $b[$f] ) {
push( #array, $a[$k] );
}
}
}
print #array, "\n";
Please tell me what mistake I am doing.
From superficially eyeballing your code, here's a list:
not using the strict pragma
not having a precise spec of what you want to achieve
attempting to do too much at once
Take a step away from the code and think about it in plain English. What would you need to do?
read files - open, read, close
read file data into an array - how exactly?
use a function not to repeat yourself for file A and file B
compare arrays
Do each task in isolation, always using strict. Always. Only then compose the single steps to a larger script.
You could also take a look at this other SO question.
If there are no duplicates in the second set:
my %set1;
while (<$fh1>) {
chomp;
++$set1{$_};
}
while (<$fh2>) {
chomp;
print("$_ is common to both sets\n")
if $set1{$_};
}
If there are possibly duplicates in the second set:
my %set1;
while (<$fh1>) {
chomp;
++$set1{$_};
}
my %set2;
while (<$fh2>) {
chomp;
print("$_ is common to both sets\n")
if $set1{$_} && !$set2{$_}++;
}
There are several things one should improve:
Always use strict; and use warnings;
Use the three argument version of open
Use lexical filehandles
Use meaningful formatting/indention
Append an array with push #array, $value;
And for SO questions ... what exactly is your problem and what do you expect.