Perl read and write text file with strings - perl

Friends need help. Following my INPUT TEXT FILE
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
To convert the above into hash and to write back into file when the records exist in a directory. I have tried following (it does not work).
#!/usr/perl/5.14.1/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash = ();
my $file = ".../input_and_output.txt";
my $people;
my $country;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
my $line;
while (my $line =<$fh>) {
my ($people) = split("", $line);
$hash{$people} = 1;
}
foreach my $people (sort keys %hash) {
my #country = $people;
foreach my $c (#country) {
my $c_folder = `country/test1_testdata/17.26.6/$c/`;
if (-d $cad_root){
print "Exit\n";
} else {
print "NA\n";
}
}

This is the primary problem:
my ($people) = split("", $line);
Your are splitting using an empty string, and you are assigning the return value to a single variable (which will just end up with the first character of each line).
Instead, you should split on ' ' (a single space character which is a special pattern):
As another special case, ... when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20" , but not e.g. / /). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.
Limit the number of fields returned to ensure the integrity of country names with spaces:
#!/usr/bin/env perl
use strict;
use warnings;
my #people;
while (my $line = <DATA>) {
$line =~ /\S/ or next;
$line =~ s/\s+\z//;
push #people, [ split ' ', $line, 2 ];
}
use YAML::XS;
print Dump \#people;
__DATA__
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
The entries are added to an array so 1) The input order is preserved; and 2) Two people with the same name but from different countries do not result in one entry being lost.
If the order is not important, you could just use a hash keyed on country names with people's names in an array reference for each entry. For now, I am going to assume order matters (it would help us help you if you put more effort into formulate a clear question).
One option is to now go through the list of person-country pairs, and print all those pairs for which the directory country/test1_testdata/17.26.6/$c/ exists (incidentally, in your code you have
my $c_folder = `country/test1_testdata/17.26.6/$c/`;
That will try to execute a program called country/test1_testdata/17.26.6/$c/ and save its output in $c_folder if it produces any. To moral of the story: In programming, precision matters. Just because ` looks like ', that doesn't mean you can use one to mean the other.)
Given that your question is focused on hashes, I use an array of references to anonymous hashes to store the list of people-country pairs in the code below. I cache the result of the lookup to reduce the number of times you need to hit the disk.
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV == 2 ? run( #ARGV )
: die_usage()
;
sub run {
my $people_data_file = shift;
my $country_files_location = shift;
open my $in, '<', $people_data_file
or die "Failed to open '$people_data_file': $!";
my #people;
my %countries;
while (my $line = <$in>) {
next unless $line =~ /\S/; # ignore lines consisting of blanks
$line =~ s/\s+\z//;# remove all trailing whitespace
my ($name, $country) = split ' ', $line, 2;
push #people, { name => $name, country => $country };
$countries{ $country } = undef;
}
# At this point, #people has a list of person-country pairs
# We are going to use %countries to reduce the number of
# times we need to check the existence of a given directory,
# assuming that the directory tree is stable while this program
# is running.
PEOPLE:
for my $person ( #people ) {
my $country = $person->{country};
if ($countries{ $country }) {
print join("\t", $person->{name}, $country), "\n";
}
elsif (-d "$country_files_location/$country/") {
$countries{ $country } = 1;
redo PEOPLE;
}
}
}
sub die_usage {
die "Need data file name and country files location\n";
}
Now, there are a bazillion variations on this which is why it is important for you to formulate a clear and concise question so people trying to help you can answer your specific questions, instead of each coming up his/her own solution to the problem as they see it. For example, one could also do this:
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV == 2 ? run( #ARGV )
: die_usage()
;
sub run {
my $people_data_file = shift;
my $country_files_location = shift;
open my $in, '<', $people_data_file
or die "Failed to open '$people_data_file': $!";
my %countries;
while (my $line = <$in>) {
next unless $line =~ /\S/; # ignore lines consisting of blanks
$line =~ s/\s+\z//;# remove all trailing whitespace
my ($name, $country) = split ' ', $line, 2;
push #{ $countries{$country} }, $name;
}
for my $country (keys %countries) {
-d "$country_files_location/$country"
or delete $countries{ $country };
}
# At this point, %countries maps each country for which
# we have a data file to a list of people. We can then
# print those quite simply so long as we don't care about
# replicating the original order of lines from the original
# data file. People's names will still be sorted in order
# of appearance in the original data file for each country.
while (my ($country, $people) = each %countries) {
for my $person ( #$people) {
print join("\t", $person, $country), "\n";
}
}
}
sub die_usage {
die "Need data file name and country files location\n";
}

If what you want is a counter of names in a hash, then I got you, buddy!
I won't attempt the rest of the code because you are checking a folder of records
that I don't have access to so I can't trouble shoot anything more than this.
I see one of your problems. Look at this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say'; # Really like using say instead of print because no need for newline.
my $file = 'input_file.txt';
my $fh; # A filehandle.
my %hash;
my $people;
my $country;
my $line;
unless(open($fh, '<', $file)){die "Could not open file $_ because $!"}
while($line = <$fh>)
{
($people, $country) = split(/\s{2,}/, $line); # splitting on at least two spaces
say "$people \t $country"; # Just printing out the columns in the file or people and Country.
$hash{$people}++; # Just counting all the people in the hash.
# Seeing how many unique names there are, like is there more than one Cindy, etc ...?
}
say "\nNow I'm just sorting the hash of people by names.";
foreach(sort{$a cmp $b} keys %hash)
{
say "$_ => $hash{$_}"; # Based on your file. The counter is at 1 because nobody has the same names.
}
Here is the output. As you can see I fixed the problem by splitting on at least two white-spaces so the country names don't get cut out.
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
Andrew United States
Now I'm just sorting the hash of people by names.
Andrew => 2
Cindy => 1
Gordon => 1
Peter => 1
Rupa => 1
I added another Andrew to the file. This Andrew is from the United States
as you can see. I see one of your problems. Look at this:
my ($people) = split("", $line);
You are splitting on characters as there is no space between those quotes.
If you look at this change now, you are splitting on at least one space.
my ($people) = split(" ", $line);

Related

Sorting 5th column in descending order error message

The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.

Parsing file based on column ID: perl

I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:
AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193
For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:
m081216|101|123
m081216|100|1987
m081216|927|463729
I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure.
Here is my code so far.
use warnings;
use strict;
use List::MoreUtils 'true';
open(IN, "<", "/path/to/in_file") or die $!;
my #array;
my $queryID;
while(<IN>){
chomp;
my $OutputLine = $_;
processOutputLine($OutputLine);
}
sub processOutputLine {
my ($OutputLine) = #_;
my #Columns = split("\t", $OutputLine);
my ($queryID, $target) = #Columns;
push(#array, $target, "\n") unless grep{$queryID eq $_} #array;
my $delineator = "\n";
my $count = true { /$delineator/g } #array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
foreach(#array){
print OUT #array;
}
}
I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.
use warnings;
use strict;
use feature qw(say);
my $filename = 'rep_seqs.txt'; # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";
my %seqs;
foreach my $line (<$in_fh>) {
chomp $line;
my ($id, $seq) = split /\t/, $line;
push #{$seqs{$id}}, $seq;
}
close $in_fh;
my $out_fh;
for (sort keys %seqs) {
my $outfile = $_ . '_' . scalar #{$seqs{$_}} . '.txt';
open $out_fh, '>', $outfile or do {
warn "Can't open $outfile: $!";
next;
};
say $out_fh $_ for #{$seqs{$_}};
}
close $out_fh;
With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | should be split you can do that while writing it out, for example.
Comments
The anonymous array for a key $seqs{$id} is created once we push, if not there already
If there are issues with tabs (converted to spaces?), use ' '. See the comment.
A filehandle is closed and re-opened on every open, so no need to close every time
The default pattern for split is ' ', also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / / matches a single space, turning off this special behavior of ' '.) See a more precise description on the split page. Thus it is advisable to use ' ' when splitting on unspecified number of spaces, since in the case of split this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/).
Note that in this case, since ' ' is the default along with $_, we can shorten it a little
for (<$in_fh>) {
chomp;
my ($id, $seq) = split;
push #{$seqs{$id}}, $seq;
}

Count the number of items derived from split without putting into an array

I am looking to spare the use of an array for memory's sake, but still get the number of items derived from the split function for each pass of a while loop.
The ultimate goal is to filter the output files according to the number of their sequences, which could either be deduced by the number of rows the file has, or the number of carrots that appear, or the number of line breaks, etc.
Below is my code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "AGTCATGCTTTATCGCGATCGAT",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
foreach my $sequence (split /\t/, $line){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}
The input file, "Clustered_Barcodes.txt" when opened, looks like the following:
TTTATGC TTTATGG TTTATCC TTTATCG
TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
CTTGTAA
There will be three output files from the code, "Clustered_Barcode_1.txt", "Clustered_Barcode_2.txt", and "Clustered_Barcode_3.txt". An example of what the output files would look like could be the 3rd and final file, which would look like the following:
>CTTGTAA
ATCGATCGCTTGTAACGATTAGC
I need some way to modify my code to identify the number of rows, carrots, or sequences that appear in the file and work that into the title of the file. The new title for the above sequence could be something like "Clustered_Barcode_Number_3_1_Sequence.txt"
PS- I made the hash in the above code manually in attempt to make things simpler. If you want to see the original code, here it is. The input file format is something like:
>TAGCTAGC
GCTAAGCGATGCTACGGCTATTAGCTAGCCGGTA
Here is the code for setting up the hash:
my $dir = ("~/Documents/Sequences");
open(INFILE, "<", "~/Documents/Clustered_Barcodes.txt") or die $!;
my %hash = ();
my #ArrayofFiles = glob "$dir/*"; #put all files from the specified directory into an array
#print join("\n", #ArrayofFiles), "\n"; #this is a diagnostic test print statement
foreach my $file (#ArrayofFiles){ #make hash of barcodes and sequences
open (my $sequence, $file) or die "can't open file: $!";
while (my $line = <$sequence>) {
if ($line !~/^>/){
my $seq = $line;
$seq =~ s/\R//g;
#print $seq;
$seq =~ m/(CATCAT|TACTAC)([TAGC]{16})([TAGC]+)([TAGC]{16})(CATCAT|TACTAC)/;
$hash{$2} = $3;
}
}
}
while(<INFILE>){
etc
You can use regex to get the count:
my $delimiter = "\t";
my $line = "zyz pqr abc xyz";
my $count = () = $line =~ /$delimiter/g; # $count is now 3
print $count;
Your hash structure is not right for your problem as you have multiple entries for same ids. for example TTTATAA hash id has 2 entries in your %hash.
To solve this, use hash of array to create the hash.
Change your hash creation code in
$hash{$2} = $3;
to
push(#{$hash{$2}}, $3);
Now change your code in the while loop
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
my %id_list;
foreach my $sequence (split /\t/, $line){
$id_list{$sequence}=1;
}
foreach my $sequence(keys %id_list)
{
foreach my $val (#{$hash{$sequence}})
{
print $out ">$sequence\n$val\n";
}
}
}
I have assummed that;
The first digit in the output file name is the input file line number
The second digit in the output file name is the input file column number
That the input hash is a hash of arrays to cover the case of several sequences "matching" the one barcode as mentioned in the comments
When a barcode has a match in the hash, that the output file will lists all the sequences in the array, one per line.
The simplest way to do this that I can see is to build the output file using a temporary filename and the rename it when you have all the data. According to the perl cookbook, the easiest way to create temporary files is with the module File::Temp.
The key to this solution is to move through the list of barcodes that appear on a line by column index rather than the usual perl way of simply iterating over the list itself. To get the actual barcodes, the column number $col is used to index back into #barcodes which is created by splitting the line on whitespace. (Note that splitting on a single space is special cased by perl to emulate the behaviour of one of its predecessors, awk (leading whitespace is removed and the split is on whitespace, not a single space)).
This way we have the column number (indexed from 1) and the line number we can get from the perl special variable, $. We can then use these to rename the file using the builtin, rename().
use warnings;
use strict;
use diagnostics;
use File::Temp qw(tempfile);
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => [ "TATAGCGCTTTATGCTAGCTAGC" ],
"TTTATGG" => [ "TAGCTAGCTTTATGGGCTAGCTA" ],
"TTTATCC" => [ "GCTAGCTATTTATCCGCTAGCTA" ],
"TTTATCG" => [ "AGTCATGCTTTATCGCGATCGAT" ],
"TTTATAA" => [ "TAGCTAGCTTTATAATAGCTAGC", "ATCGATCGTTTATAACGATCGAT" ],
"TTTATAT" => [ "TCGATCGATTTATATTAGCTAGC", "TAGCTAGCTTTATATGCTAGCTA" ],
"TTTATTA" => [ "GCTAGCTATTTATTATAGCTAGC" ],
"CTTGTAA" => [ "ATCGATCGCTTGTAACGATTAGC" ]
);
my $cbn = "Clustered_Barcode_Number";
my $trailer = "Sequence.txt";
while (my $line = <INFILE>) {
chomp $line ;
my $line_num = $. ;
my #barcodes = split " ", $line ;
for my $col ( 1 .. #barcodes ) {
my $barcode = $barcodes[ $col - 1 ]; # arrays indexed from 0
# skip this one if its not in the hash
next unless exists $hash{$barcode} ;
my #sequences = #{ $hash{$barcode} } ;
# Have a hit - create temp file and output sequences
my ($out, $temp_filename) = tempfile();
say $out ">$barcode" ;
say $out $_ for (#sequences) ;
close $out ;
# Rename based on input line and column
my $new_name = join "_", $cbn, $line_num, $col, $trailer ;
rename ($temp_filename, $new_name) or
warn "Couldn't rename $temp_filename to $new_name: $!\n" ;
}
}
close INFILE
All of the barcodes in your sample input data have a match in the hash, so when I run this, I get 4 files for line 1, 5 for line 2 and 1 for line 3.
Clustered_Barcode_Number_1_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt
Clustered_Barcode_Number_1_3_Sequence.txt
Clustered_Barcode_Number_1_4_Sequence.txt
Clustered_Barcode_Number_2_1_Sequence.txt
Clustered_Barcode_Number_2_2_Sequence.txt
Clustered_Barcode_Number_2_3_Sequence.txt
Clustered_Barcode_Number_2_4_Sequence.txt
Clustered_Barcode_Number_2_5_Sequence.txt
Clustered_Barcode_Number_3_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt for example has:
>TTTATGG
TAGCTAGCTTTATGGGCTAGCTA
and Clustered_Barcode_Number_2_5_Sequence.txt has:
>TTTATTA
GCTAGCTATTTATTATAGCTAGC
Clustered_Barcode_Number_2_3_Sequence.txt - which matched a hash key with two sequences - had the following;
>TTTATAT
TCGATCGATTTATATTAGCTAGC
TAGCTAGCTTTATATGCTAGCTA
I was speculating here about what you wanted when a supplied barcode had two matches. Hope that helps.

Perl : Trying to deference an array after sorting it

I am currently writing a perl script where I have a reference to an array (students) of references. After adding the hash references to the array. Now I add the references to the array of students and then ask the user how to sort them. This is where it gets confusing. I do not know how to deference the sorted array. Using dumper I can get the sorted array but in a unorganized output. How can I deference the array of hash references after sorting?
#!bin/usr/perl
use strict;
use warnings;
use Data::Dumper;
use 5.010;
#reference to a var $r = \$var; Deferencing $$r
#reference to an array $r = \#var ; Deferencing #$r
#referenc to a hash $r = \%var ; deferencing %$r
my $filename = $ARGV[0];
my $students = [];
open ( INPUT_FILE , '<', "$filename" ) or die "Could not open to read \n ";
sub readLines{
while(my $currentLine = <INPUT_FILE>){
chomp($currentLine);
my #myLine = split(/\s+/,$currentLine);
my %temphash = (
name => "$myLine[0]",
age => "$myLine[1]",
GPA => "$myLine[2]",
MA => "$myLine[3]"
);
pushToStudents(\%temphash);
}
}
sub pushToStudents{
my $data = shift;
push $students ,$data;
}
sub printData{
my $COMMAND = shift;
if($COMMAND eq "sort up"){
my #sortup = sort{ $a->{name} cmp $b->{name} } #$students;
print Dumper #sortup;
}elsif($COMMAND eq "sort down"){
my #sortdown = sort{ $b->{name} cmp $a->{name} } #$students;
print Dumper #sortdown;
//find a way to deference so to make a more organize user friendly read.
}else{
print "\n quit";
}
}
readLines();
#Output in random, the ordering of each users data is random
printf"please choose display order : ";
my $response = <STDIN>;
chomp $response;
printData($response);
The problem here is that you're expected Dumper to provide an organised output. It doesn't. It dumps a data structure to make debugging easier. The key problem will be that hashes are explicitly unordered data structures - they're key-value mappings, they don't produce any output order.
With reference to perldata:
Note that just because a hash is initialized in that order doesn't mean that it comes out in that order.
And specifically the keys function:
Hash entries are returned in an apparently random order. The actual random order is specific to a given hash; the exact same series of operations on two hashes may result in a different order for each hash.
There is a whole section in perlsec which explains this in more detail, but suffice to say - hashes are random order, which means whilst you're sorting your students by name, the key value pairs for each student isn't sorted.
I would suggest instead of:
my #sortdown = sort{ $b->{name} cmp $a->{name} } #$students;
print Dumper #sortdown;
You'd be better off with using a slice:
my #field_order = qw ( name age GPA MA );
foreach my $student ( sort { $b -> {name} cmp $a -> {name} } #$students ) {
print #{$student}{#field_order}, "\n";
}
Arrays (#field_order) are explicitly ordered, so you will always print your student fields in the same sequence. (Haven't fully tested for your example I'm afraid, because I don't have your source data, but this approach works with a sample data snippet).
If you do need to print the keys as well, then you may need a foreach loop instead:
foreach my $field ( #field_order ) {
print "$field => ", $student->{$field},"\n";
}
Or perhaps the more terse:
print "$_ => ", $student -> {$_},"\n" for #field_order;
I'm not sure I like that as much though, but that's perhaps a matter of taste.
The essence of your mistake is to assume that hashes will have a specific ordering. As #Sobrique explains, that assumption is wrong.
I assume you are trying to learn Perl, and therefore, some guidance on the basics will be useful:
#!bin/usr/perl
Your shebang line is wrong: On Windows, or if you run your script with perl script.pl, it will not matter, but you want to make sure the interpreter that is specified in that line uses an absolute path.
Also, you may not always want to use the perl interpreter that came with the system, in which case #!/usr/bin/env perl maybe helpful for one-off scripts.
use strict;
use warnings;
use Data::Dumper;
use 5.010;
I tend to prefer version constraints before pragmata (except in the case of utf8). Data::Dumper is a debugging aid, not something you use for human readable reports.
my $filename = $ARGV[0];
You should check if you were indeed given an argument on the command line as in:
#ARGV or die "Need filename\n";
my $filename = $ARGV[0];
open ( INPUT_FILE , '<', "$filename" ) or die "Could not open to read \n ";
File handles such as INPUT_FILE are called bareword filehandles. These have package scope. Instead, use lexical filehandles whose scope you can restrict to the smallest appropriate block.
There is no need to interpolate $filename in the third argument to open.
Always include the name of the file and the error message when dying from an error in open. Surrounding the filename with ' ' helps you identify any otherwise hard to detect characters that might be causing the problem (e.g. a newline or a space).
open my $input_fh, '<', $filename
or die "Could not open '$filename' for reading: $!";
sub readLines{
This is reading into an array you defined in global scope. What if you want to use the same subroutine to read records from two different files into two separate arrays? readLines should receive a filename as an argument, and return an arrayref as its output (see below).
while(my $currentLine = <INPUT_FILE>){
chomp($currentLine);
In most cases, you want all trailing whitespace removed, not just the line terminator.
my #myLine = split(/\s+/,$currentLine);
split on /\s+/ is different than split ' '. In most cases, the latter is infinitely more useful. Read about the differences in perldoc -f split.
my %temphash = (
name => "$myLine[0]",
age => "$myLine[1]",
GPA => "$myLine[2]",
MA => "$myLine[3]"
);
Again with the useless interpolation. There is no need to interpolate those values into fresh strings (except maybe in the case where they might be objects which overloaded the stringification, but, in this case, you know they are just plain strings.
pushToStudents(\%temphash);
No need for the extra pushToStudents subroutine in this case, unless this is a stub for a method that will later be able to load the data to a database or something. Even in that case, it be better to provide a callback to the function.
sub pushToStudents{
my $data = shift;
push $students ,$data;
}
You are pushing data to a global variable. A program where there can only ever be a single array of student records is not useful.
sub printData{
my $COMMAND = shift;
if($COMMAND eq "sort up"){
Don't do this. Every subroutine should have one clear purpose.
Here is a revised version of your program.
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use Carp qw( croak );
run(\#ARGV);
sub run {
my $argv = $_[0];
#$argv
or die "Need name of student records file\n";
open my $input_fh, '<', $argv->[0]
or croak "Cannot open '$argv->[0]' for reading: $!";
print_records(
read_student_records($input_fh),
prompt_sort_order(),
);
return;
}
sub read_student_records {
my $fh = shift;
my #records;
while (my $line = <$fh>) {
last unless $line =~ /\S/;
my #fields = split ' ', $line;
push #records, {
name => $fields[0],
age => $fields[1],
gpa => $fields[2],
ma => $fields[3],
};
}
return \#records;
}
sub print_records {
my $records = shift;
my $sorter = shift;
if ($sorter) {
$records = [ sort $sorter #$records ];
}
say "#{ $_ }{ qw( age name gpa ma )}" for #$records;
return;
}
sub prompt_sort_order {
my #sorters = (
[ "Input order", undef ],
[ "by name in ascending order", sub { $a->{name} cmp $b->{name} } ],
[ "by name in descending order", sub { $b->{name} cmp $a->{name} } ],
[ "by GPA in ascending order", sub { $a->{gpa} <=> $b->{gpa} } ],
[ "by GPA in descending order", sub { $b->{gpa} <=> $a->{gpa} } ],
);
while (1) {
print "Please choose the order in which you want to print the records\n";
print "[ $_ ] $sorters[$_ - 1][0]\n" for 1 .. #sorters;
printf "\n\t(%s)\n", join('/', 1 .. #sorters);
my ($response) = (<STDIN> =~ /\A \s*? ([1-9][0-9]*?) \s+ \z/x);
if (
$response and
($response >= 1) and
($response <= #sorters)
) {
return $sorters[ $response - 1][1];
}
}
# should not be reached
return;
}

Using Perl hashes to handle tab-delimited files

I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png
A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.
You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.
The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;
I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;