I am looking to parse a tab delimited text file into a nested hash with a subroutine. Each file row will be keyed by a unique id from a uid column(s), with the header row as nested keys. Which column(s) is(are) to become the uid changes (as sometimes there isn't a unique column, so the uid has to be a combination of columns). My issue is with the $uid variable, which I pass as a non-interpolated string. When I try to use it inside the subroutine in an interpolated way, it will only give me the non-interpolated value:
use strict;
use warnings;
my $lofrow = tablehash($lof_file, '$row{gene}', "transcript", "ENST");
##sub to generate table hash from file w/ headers
##input values are file, uid, header starter, row starter, max column number
##returns hash reference (deref it)
sub tablehash {
my ($file, $uid, $headstart, $rowstart, $colnum) = #_;
if (!$colnum){ # takes care of a unknown number of columns
$colnum = 0;
}
open(INA, $file) or die "failed to open $file, $!\n";
my %table; # permanent hash table
my %row; # hash of column values for each row
my #names = (); # column headers
my #values = (); # line/row values
while (chomp(my $line = <INA>)){ # reading lines for lof info
if ($line =~ /^$headstart/){
#names = split(/\t/, $line, $colnum);
} elsif ($line =~ /^$rowstart/){ # splitting lof info columns into variables
#values = split(/\t/, $line, $colnum);
#row{#names} = #values;
print qq($uid\t$row{gene}\n); # problem: prints "$row{gene} ACB1"
$table{"$uid"} = { %row }; # puts row hash into permanent hash, but with $row{gene} key)
}
}
close INA;
return \%table;
}
I am out of ideas. I could put $table{$row{$uid}} and simply pass "gene", but in a couple of instances I want to have a $uid of "$row{gene}|$row{rsid}" producing $table{ACB1|123456}
Interpolation is a feature of the Perl parser. When you write something like
"foo $bar baz"
, Perl compiles it into something like
'foo ' . $bar . ' $baz'
It does not interpret data at runtime.
What you have is a string where one of the characters happens to be $ but that has no special effect.
There are at least two possible ways to do something like what you want. One of them is to use a function, not a string. (Which makes sense because interpolation really means concatenation at runtime, and the way to pass code around is to wrap it in a function.)
my $lofrow = tablehash($lof_file, sub { my ($row) = #_; $row->{gene} }, "transcript", "ENST");
sub tablehash {
my ($file, $mkuid, $headstart, $rowstart, $colnum) = #_;
...
my $uid = $mkuid->(\%row);
$table{$uid} = { %row };
Here $mkuid isn't a string but a reference to a function that (given a hash reference) returns a uid string. tablehash calls it, passing a reference to %row to it. You can then later change it to e.g.
my $lofrow = tablehash($lof_file, sub { my ($row) = #_; "$row->{gene}|$row->{rsid}" }, "transcript", "ENST");
Another solution is to use what amounts to a template string:
my $lofrow = tablehash($lof_file, "gene|rsid", "transcript", "ENST");
sub tablehash {
my ($file, $uid_template, $headstart, $rowstart, $colnum) = #_;
...
(my $uid = $uid_template) =~ s/(\w+)/$row{$1}/g;
$table{$uid} = { %row };
The s/// code goes through the template string and manually replaces every word by the corresponding value from %row.
Random notes:
Bonus points for using strict and warnings.
if (!$colnum) { $colnum = 0; } can be simplified to $colnum ||= 0;.
Use lexical variables instead of bareword filehandles. Barewords are effectively global variables (and syntactically awkward because they're not first-class citizens of the language).
Always use the 3-argument form of open to avoid unexpected interpretation of the second argument.
Include the name of your program in error messages (either explicitly with $0 or implicitly by omitting \n from die).
my #foo = (); my %bar = (); is redundant and can be simplified to my #foo; my %bar;. Arrays and hashes start out empty; overwriting them with an empty list is pointless.
chomp(my $line = <INA>) will throw a warning when you reach EOF (because you're trying to chomp a variable containing undef).
my %row; should probably be declared inside the loop. It looks like it's supposed to only contain values from the current line.
Suggestion:
open my $fh, '<', $file or die "$0: can't open $file: $!\n";
while (my $line = readline $fh) {
chomp $line;
...
}
Related
I'm a novice in Perl and recently came through a perl script which reads array data from __DATA__; (a inline data statement inside the perl script at the bottom) using #flatfiledata = <DATA>;
data looks as below inside the script at the bottom:
__DATA__;
Arrays
#cab_method::Concentration::Dilution
#cab04_cartypes::XXXX Dye::Gag XXXXXX::LuciXXXX::Firefly LuciXXXX::Renilla LucifXXXX
__END__;
I would like to keep this data inside a config file either csv or a tab delimited file
and then read this data into array.
below is the current subroutine which read this data from __DATA__ statement at the bottom.
# read data from __DATA__ section at end of program and put data into #flatfiledata array
#flatfiledata = <DATA>;
&ReadArraysAndHashes;
sub ReadArraysAndHashes {
foreach $line (#flatfiledata) {
chomp $line;
# ok, some weird newline chars have ended up in data section, so...
$line =~ s/\r|\n//g;
# skip any lines that do not contain at least one '::'
# (probably blank lines or comments)
unless ( $line =~ /::/ ) { next }
#split all the elements in the line into an array
my #elements = split( /::/, $line );
# The first element is the key;
my $key = shift(#elements);
# if the key starts with a '#,' you have an array;
# if it starts with a '%,' it is a hash
# either way, delete the symbol from the key
$key =~ s/^(.)//;
my $array_or_hash = $1;
# create a hash of hashes
if ( $array_or_hash eq '#' ) {
$clrdata{array}{$key} = \#elements;
#{$key} = #{ $clrdata{array}{$key} };
}
elsif ( $array_or_hash eq '%' ) {
if ( $#elements % 2 != 1 ) {
print "odd number of elements for $key\n";
}
my %hash = #elements;
$clrdata{hash}{$key} = \%hash;
}
}
}
__DATA__;
Arrays
#cab_method::Concentration::Dilution
#cab04_cartypes::XXXX Dye::Gag XXXXXX::LuciXXXX::Firefly LuciXXXX::Renilla LucifXXXX
__END__;
Basic answer
DATA is just a filehandle, so it can easily be replaced with any other filehandle.
open my $config_fh, '<', 'some_config_file'
or die "Can't open config file: $!\n";
#flatfiledata = <$config_fh>;
More tips
One thing that we've learned from many decades of computer programming is that it's a bad idea for a subroutine to access global variables. All data used by a subroutine should either be passed to the subroutine as parameters or created within the subroutine. Your ReadArraysAndHashes() subroutine uses the global #flatfiledata variable. You should either pass the data into the subroutine:
open my $config_fh, '<', 'some_config_file'
or die "Can't open config file: $!\n";
#flatfiledata = <$config_fh>;
ReadArraysAndHashes(#flatfiledata);
# And in the subroutine
sub ReadArraysAndHashes {
my #flatfiledata = #_;
# Rest of the code is the same
...
}
Or you should read the data inside the subroutine:
ReadArraysAndHashes();
# And in the subroutine
sub ReadArraysAndHashes {
# We don't need the #flatfiledata array
# as we can read the filehandle directly
open my $config_fh, '<', 'some_config_file'
or die "Can't open config file: $!\n";
foreach $line (<$config_fh>) {
# Your existing code
...
}
}
It also look like your subroutine is writing to a global variable called %clrdata. That's also a bad idea. Consider returning that variable from your subroutine instead.
%clrdata = ReadArraysAndHashes();
# And in the subroutine
sub ReadArraysAndHashes {
# All of the code
...
return %clrdata;
}
Extra reading
Two final tips to consider:
Take a look at restricting the scope of your variables using my.
Don't use ampersands to call subroutines.
I have written a script which collects marks of students and print the one who scored above 50.
Script is below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
print Dumper(\#array);
my $class = "3";
foreach my $each_value (#array) {
print "EACH: $each_value\n";
my ($name, $score ) = split (/,/, $each_value);
if ($score lt 50) {
next;
} else {
print "$name, \"GOOD SCORE\", $score, $class";
}
}
Here I wanted to print data of STUDENT1, since his score is greater than 50.
So output should be:
STUDENT1, "GOOD SCORE", 90, 3
But its printing output like this:
STUDENT1, "GOOD SCORE", 90
STUDENT2, 3
Here some manipulation happens between 90 STUDENT2 which it discards to separate it.
I know I was not splitting data with new line character since we have single element in the array #array.
How can I split the element which is in array to new line, so that inside for loop I can split again with comma(,) to have the values in $name and $score.
Actually the #array is coming as an argument to this script. So I have to modify this script in order to parse right values.
As you already know your "array" only has one "element" with a string with the actual records in it, so it essentially is more a scalar than an array.
And as you suspect, you can split this scalar just as you already did with the newline as a separator instead of a comma. You can then put a foreach around the result of split() to iterate over the records.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $records = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
my $class = "3";
foreach my $record (split("\n", $records)) {
my ($name, $score) = split(',', $record);
if ($score >= 50) {
print("$name, \"GOOD SCORE\", $score, $class\n");
}
}
As a small note, lt is a string comparison operator. The numeric comparisons use symbols, such as <.
Although you have an array, you only have a single string value in it:
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
That's not a big deal. Dave Cross has already shown you have you can break that up into multiple values, but there's another way I like to handle multi-line strings. You can open a filehandle on a reference to the string, then read lines from the string as you would a file:
my $string = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
One of the things to consider while programming is how many times you are duplicating the data. If you have it in a big string then split it into an array, you've now stored the data twice. That might be fine and its usually expedient. You can't always avoid it, but you should have some tools in your toolbox that let you avoid it.
And, here's a chance to use indented here docs:
use v5.26;
my $string = <<~"HERE";
STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
For your particular problem, I think you have a single string where the lines are separated by the '|' character. You don't show how you call this program or get the data, though.
You can choose any line ending you like by setting the value for the input record separator, $/. Set it to a pipe and this works:
use v5.10;
my $string = 'STUDENT1,90|STUDENT2,40|STUDENT3,30|STUDENT4,30';
{
local $/ = '|'; # input record separator
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
say "Got $_";
}
}
Now the structure of your program isn't too far away from taking the data from standard input or a file. That gives you a lot of flexibility.
The #array contains one element, Actually the for loop will working correct, you can fix it without any change in the for block just by replacing this array:
my #array = (
'STUDENT1,90',
'STUDENT2,40',
'STUDENT3,30',
'STUDENT4,30');
Otherwise you can iterate on them by splitting lines using new line \n .
So lets say I have a file.txt, this documents Syntax is like this:
"1;22;333;'4444';55555",
I now want my code to do the following:
open the file = already done
read line and save each Parameter separated by ; into a variable like ( $one = 1, $two = 22, $three = 333, $four = '4444', $five = 55555; )
this step would be writing the variables into a DB but thats done already
Loop until all lines of the file are done
So I actually Need help with Step 2, i think I am able to do the Loop and DB code. Do you guys have any ideas or tips how I could do this? beginnerfriendly would be nice so I can learn out of it.
foreach $file (#file){
$currentfile = "$currentdir\\$file";
open(my $reader, "<", $currentfile) or die "Failed to open file: $!\n";
?????
close $reader;
}
If you're just doing 'numbered fields' then you should be thinking 'array':
use Data::Dumper;
while ( <$reader> ) {
chomp;
my #row = split /;/;
print Dumper \#row;
}
This will give you an array that you can access - e.g. $row[0] for the first element.
$VAR1 = [
'1',
'22',
'333',
'\'4444\'',
'55555'
];
If you know what the headers are 'named' and prefer to work on names you can do something similar with a hash:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #cols = qw ( id value fish name sprout );
while ( <DATA> ) {
my %row;
chomp;
#row{#cols} = split /;/;
print Dumper \%row;
}
__DATA__
1;22;333;'4444';55555
This gives instead:
$VAR1 = {
'fish' => '333',
'name' => '\'4444\'',
'id' => '1',
'value' => '22',
'sprout' => '55555'
};
Note - hashes are unordered, but their whole point is that you don't need to care about the 'order' - just print $row{name},"\n";
You need to read from the filehandle $reader, line by line. See the tutorial perlopentut and the full reference open.
Then you split each line by the separator ;, what returns a list which you assign to an array.
open my $reader, "<", $currentfile or die "Failed to open file: $!\n";
while (my $line = <$reader>) {
chomp($line);
my #params = split ';', $line;
# do something with #params, it will be overwritten on next iteration
}
close $reader;
The diamond operator <> reads from a filehandle, <$fh>, returning a line at a time. See about it in perlop. When there are no more lines it returns undef and looping stops. You may assign the string that it returns to a variable which you declare (my $line), which then exists only within the body of the while loop. If you don't, but do while (<$fh>) instead, the line is assigned to the special variable $_, which is default for many things in Perl.
The chomp removes the linefeed (new line) from the end of the line.
Note that '4444' from your example is not a number and cannot be used as such.
Alternatively, you can take a reference to the array with parameters on each line, and put it in another array which thus will in the end contain all lines.
my #all_params;
while (my $line = <$reader>) {
my #params = split ';', $line;
push #all_params, \#params;
}
Now #all_params has elements which are references, each to an array with parameters for one line. For how to work with references see the tutorial perlreftut and the Cookbook on complex data structures, perldsc.
The following is more complex but let me mention it since it's a bit of an idiom. You can do the above in one statement
my #all_params = map { [ split ';', $_ ] } <$reader>;
This uses map, which applies the code in { ... } to each element of the list that is submitted to it, returning a list. So it takes a list and returns the processed list. The [...] inside makes an anonymous array, equivalent to the reference we took of an array previously. The filehandle <$reader>returns all lines of the file in one list when invoked in the list context, which is in this case imposed by map (since it must receive a list).
An important one: always start your programs with
use warnings 'all';
use strict;
The order of these doesn't really matter. Mostly you'll see use strict; first.
Then your loop over filenames need be foreach my $file (#file) { ... } and you must declare all variables, so my $currentfile = ....
I am currently trying to implement a suffix tree using Perl, however, when I attempt to set the reference for the tree function, the reference location is not set, if I pass the address via a string then check the text in the string vs the location of the hash table and they are different. Any help is appreciated!
use strict;
use warnings;
use Data::Dumper;
my $count = 0;
my $str; # holds the complete string
my %root;
# takes in all lines of code
open(IN, '<:encoding(UTF-8)', $ARGV[0]) or die "Could not open file '$ARGV[0]' $!\n";
while (<IN>) {
chomp;
# concatinates with string
$str .= $_;
}
# closes input
close(IN);
#length of input string
my $l_size = length($str) - 1;
#recursively makes
sub tree {
#recursive root
my %treeRoot;
#checking incomming data
print "1 ".Dumper(\#_)."\n";
#checking incomming data
print "2 ".Dumper(\%root)."\n";
#attempts to set tree's refrence
\%treeRoot, $count = #_;
#checking incomming data
print "3 ".Dumper(\%root)."\n";
#checking incomming data
print "4 ".$count."\n";
#leaf for each node
my %leaf;
for (my $i = 0; $i < $l_size; $i++) {
#creates alphabet tree
$treeRoot { substr($str, $i, 1) } = %leaf;
}
#checking incomming data
print "5 ".Dumper(\%root)."\n";
while ($count > 0) {
#checking incomming data
print "loop 6 ".Dumper(\%root)."\n";
$count--;
#checking incomming data
print "loop 7 ".$count."\n";
#recursion not implamented yet
#tree(\$treeRoot{'a'}, $count);
}
}
tree(\%root, 2);
#print Dumper(\%root);
You need parentheses to disambiguate. This:
\%treeRoot, $count = #_;
means this:
\%treeRoot;
$count = #_;
Because the assignment operator = has higher precedence than the comma operator ,. The warning that you got from running that code tells you this: Useless use of reference constructor in void context.
To pass the arguments correctly, you need parentheses:
(\%treeRoot, $count) = #_;
Unfortunately, this does not work, because you cannot assign to a reference this way. The following error tells you that: Can't modify reference constructor in list assignment.
So what you need is to pass the reference to a scalar:
my ($href, $count) = #_;
print $href->{'value'};
I think this method is a bit backwards, though. Passing variables by reference is likely to become a source of bugs. A more natural solution is to use the return value of the subroutine to assign values:
sub foo {
my %hash;
$hash{'value'} = ....
....
return \%hash;
}
my $hashref = foo();
print $hashref->{'value'};
Your question isn't actually how to pass a hash reference, but how to receive it, as the following will not work:
\%treeRoot, $count = #_;
Basically, you need to assign your reference to a scalar like so:
use strict;
use warnings;
sub example_sub {
my ($hashref, $count) = #_;
# Add two values to the hash:
$hashref->{newkey} = 'val';
$hashref->{newkey2} = 'val2';
}
my %root;
example_sub(\%root, 2);
use Data::Dump;
dd \%root;
Outputs:
{ newkey => "val", newkey2 => "val2" }
If you don't want to modify your original hash, you can assign the values to a new hash within the sub:
my %newhash = %$hashref;
For more info on working with references, check out: perlref - Perl references and nested data structures
I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!