Perl Dynamically Generated Multidimentional Associative Array

Perl Dynamically Generated Multidimentional Associative Array - perl

This may be a simple oversight on my part (or something much more advanced than my skill set). I am trying to dynamically fill a 2d associative array by reading input from a file.
my #data;
while (<FILE>) {
chomp;
my $ID,$COUNT;
print "READ: " . $_ . "\n"; #Debug 1
($ID,$COUNT,undef,undef,undef) = split /\,/;
print "DATA: " . $ID . "," . $COUNT . "\n"; # Debug 2
$data{$ID}{"count"} = $COUNT;
#push #{$data{$ID}{"count"}}, $COUNT;
print $data{$ID}{"count"} . "\n"; # Debug 3
}
The first print (Debug 1) will print a line similar to des313,3,,,.
The second print (Debug 2) will print a line DATA: des313,3
The third print (Debug 3) will print a blank line.
The issue seems to be in the way I am trying to insert the data into the associative array. I have tried both the direct insert and the push method with no results. I have done this with PHP however I think I am overlooking this in Perl. I did look at the perldoc perldsc page in the section of HASHES of HASHES however I did not see it talk about dynamic generation of them. Any suggestions would be great!

Assigning to the hash the way you have should work fine. You are declaring your variables improperly. Your associative array is called a hash in Perl, and is prefixed with a % sigil, so you should write my %data before the while loop. Inside the loop, the my operator needs parens to apply to a list, so it should be my ($ID, $COUNT);.
This minimal example works properly:
use warnings; # place these lines at the top of all of your programs
use strict; # they will catch many errors for you
my %data; # hash variable
while (<DATA>) {
chomp;
my ($id, $count) = split /,/; # simplify the split
$data{$id}{count} = $count; # build your hash
}
print "got: $data{des313}{count}\n"; # prints "got: 3"
__DATA__
des313,3

Related

How to remove array's newlines and add an element at the beginning of it in Perl?

First of I have to apologize for editing my initial post. But after I provide my code I did the question fuzzy.
So, I have this an array (#start_cod) containing lines separated by /n as follows:
print #start_cod;
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
I need to remove the newlines and add ">text" ONLY at the beginning of the array as follow:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
I tried:
s/\s+\z// for #start_cod;
print ">text#start_cod";
I tried also with chomp
chomp #start_cod;
print ">text#start_cod";
and
my #start_cod = split("\n",$start_cod);
$start_cod = join("",#start_cod);
print ">text$start_cod";
but I get
aaaaaaaaaaaaaaaaaaa>textcacacacacaacaccacaac>textaattatatattataattatatttat
Any suggestions on how to handle this in Perl Programming?
Here is my code which works 100%.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my %alliloux =();
$/="\n>";
while (<>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
my ($sp, $head) = split (/\./, $onoma);
push #{ $alliloux{$sp} }, join "\n", ">$onoma", #seq;
}
foreach my $sp (keys %alliloux) {
chomp $sp;
my ($head, $dna) = split(/\t/, $sp);
my #start_cod = substr($dna, 3);
say #start_cod;
Input file:
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
>namex aattatatattataattatatttat
output after Perl run
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
Desired output:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa

If I understand your question correctly, this should do what you want:
use strict;
use warnings;
my #start_cod = (
'aaaaaaaaaaaaaaaaaa',
'acacacacacaacaccacaac',
'aattatatattataattatatttat',
);
print ">text\n", #start_cod, "\n";
The print first prints ">text" and a newline once, then you get the #start_cod items on a line, and the last "\n" makes sure you have a newline after the last element.
Output:
>text
aaaaaaaaaaaaaaaaaaacacacacacaacaccacaacaattatatattataattatatttat

You might want to see Read FASTA into Hash. It's the same problem and very close to the code I wrote before I read it. Also, there are modules on CPAN that can handle FASTA.
I think you want to combine the sequences that start with the same name, disregarding the numbers. The sequences shouldn't have interior whitespace. In your code, you are constantly adding whitespace. You even join on a newline. So, you go to the doctor and say "My arm hurts when I do this", and the doctor says "So don't do that". :)
When you run into these sort of problems, check the results of your operations at each step to see if you get what you expect. Here's a much simplified version of a program that I think does what you want. I've removed most of the data structure because they are complicating your process.
In short, read a line and remove the newline at the end. That's one source of your newlines. Then, extract the sequence and concatenate that to the previous sequence. When you join with newlines, you are adding newlines. So, don't do that:
use v5.14;
use warnings;
use Data::Dumper;
my %alliloux = ();
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
# now split on whitespace, but only up to two parts.
# There's no array here.
my( $name, $seq ) = split /\s+/, $_, 2;
# remove the numbers at the end to get the prefix of the
# name.
my $prefix = $name =~ s/\d+\z//r;
# append the current sequence for this prefix to what we
# have already seen.f
$alliloux{$prefix} .= $seq;
}
say Dumper( \%alliloux );
foreach my $base ( keys %alliloux ) {
say ">text $alliloux{$base}";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
You don't need the intermediate array. You can build up your string as you go. You don't need to have all the parts before you do that.
Now, to figure out where you might be going wrong, do a little at once. Ensure that you've extracted the right thing. It's handle to put characters around the variables you interpolate so you can see whitespace at the beginning or end:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
}
Then, add another step, and ensure that works:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
my $prefix = $name =~ s/\d+\z//r;
say "Prefix: <$prefix>";
}
Repeat this process for each step. Then, when you come with a question, you've pinpointed the point where things diverge. Here's the same technique in your program:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
say "Onoma: <$onoma>";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
The output shows that you never had anything in #seq. You are splitting on a newline, but unless you've changed the default line ending, you'll only get a newline at the end:
Onoma: <name aaa>
Onoma: <name2 cccc>
Onoma: <name99 aattaatt>
Now there's nothing in #seq, so a line like join "\n", ">$onoma", #seq; is really just join "\n", ">$onoma". You could have seen that with a little checking.

The description lacks clarity of the problem.
By looking at the desired output the following code comes to mind. Please see if it does what you was looking for.
Even looking at your code it is not clear what you try to do -- some part of the code does not make much sense.
use strict;
use warnings;
use feature 'say';
my #start_cod;
while( <DATA> ) {
chomp;
next unless />\s?name.?\s+(.*)/;
push #start_cod, $1;
}
print ">text\n " . join('',#start_cod);
__DATA__
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
.
.
.
> namex aattatatattataattatatttat

Perl: reading file into a hash and splitting, retrieving information

I have a file which has data like this:
1 unknown state 3204563 3207049 . - . name "gosford"; school_name "gosford"; pupil_id "P15240"; transcript_id "NM_001011874.1"; tss_id "TSS13146";
I want to read it line by line into a hash, and then split it with regular expressions. so that i can count the number of schools.]
so far i have:
my$schools;
open (SCHOOLS, <"$schools) or die (Cannot open $schools");
while <SCHOOLS> {
chomp;
my ($val, $key) = split /(^\d)\s+\w+\s+\W+\s+\d+\s+\d+\s+\d+\.\s+\+\s+\.\s+.. and so on);
}
How do I get the values I've split into the hash, and then manipulate them so produce some basic statistics?

It's a bit unclear what you're after, but I will offer - you are doing things the hard way using a long regex to match the line. Also, for 'other things' it's quite hard to tell exactly what you have in mind. But grep is your friend, as it lets you specify search terms.
Something like this will do the trick. I've used a simplistic example for counting entries matching a particular criterion. Of course, given you've only given us one row, this is a bit of a guess:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #entries;
my #keys = qw ( id thing state firstnum secondnum );
while ( <DATA> ) {
my %attributes = m/(\w+) "(\w+)"/g;
#attributes{#keys} = split;
push #entries, \%attributes;
}
print Dumper \#entries;
print "count of things: ", scalar #entries, "\n";
print "There are ", (scalar grep { $_ -> {state} eq "state" } #entries), " things with a state of 'state'\n";
__DATA__
1 unknown state 3204563 3207049 . - . name "gosford"; school_name "gosford"; pupil_id "P15240"; transcript_id "NM_001011874.1"; tss_id "TSS13146";
I'll also point out - it's much better form to use lexical filehandles with 3 arg open. E.g.
open ( my $schools, '<', 'schools.txt' ) or die $!;
while ( <$schools> ) {
#etc.
}
I'm using the special filehandle __DATA__ for illustrative purposes.

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. So I figured I'd post it and see how some of you more experienced programmers can address with problem.
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). These labels are only in the first column of the text file. I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. This hash has about 26000 lines. So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok.
THis is what I have so far:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
#gtfarray = <FASTAFILE2>;
#print #gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (#gtfarray){
$_ =~ s/$find/$replace/g;
push #newgtf, $_;
}
}
print #newgtf;
This code doesn't seem to work as it doesn't complete. I'm pretty sure it's a problem with the foreach loop structure. Sorry I don't know of any other way to do this. Does anyone have a better way to run through this file and conduct the replacement?
Any input would be greatly appreciated!
Thanks,
Andrew
#DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? Thanks again!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error:
perl gtf_mod2.pl <./Hc_genome/header_file.txt
syntax error at gtf_mod2.pl line 14, near "while $line "
syntax error at gtf_mod2.pl line 23, near "}"
Execution of gtf_mod2.pl aborted due to compilation errors.

You exhaust your file the first time through your loop using the initial $find and $replace key/value pair.
There are two potential solutions:
Open the file for reading during each iteration of your while loop (expensive)
Move the foreach loop to the outside of the while and iterate the hash each time (less expensive)
example:
REPLACE:
for my $line (#gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push #newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push #newgtf, $line
}

How big is the file that you are replacing the first column in?
If it's >50,000 lines, you are better off doing the reverse:
Iterate through hash file once, and store that hash in memory
Iterate through main file once, and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write.
In other words, remove the first #gtfarray = <FASTAFILE2>; and replace your last while loop with:
while my $line (<FASTAFILE2>) {
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (e.g. your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU").
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (e.g. "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis:
#matches = ($line =~ m/(scaffold\d+\.\d+_size\d+\.\d+\.\d+_\d+/g );
Then, look up every value of #matches array in the hash. If found, run ONLY the matches as a s/// regex.

Looking at your previous post, wouldn't it be more simple to create the shortened 'id' while reading the file. Then you would have no need of the other file where you get your hash?
Here is the (untested) code below. (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";

Could it be a job for Tie::File? Assuming, that is, the data file could be operated on as an array.
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie #lines, 'Tie::File', $file or die ;
for (#lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie #lines ;
Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient.

Read from input and store comma separated values in Hash

I have a Perl question like this:
Write a Perl program that will read a series of last names and phone numbers from the given input. The names and numbers should be separated by a comma. Then print the names and numbers alphabetically according to last name.Use hashes.
Any idea how to solve this?

There's more than one way to do it :)
my %phonebook;
while(<>) {
chomp;
my ($name, $phone) = split /,/;
$phonebook{$name} = $phone;
}
print "$_ => $phonebook{$_}\n" for sort keys %phonebook;

Something like the following perhaps.
my %hash;
foreach(<>){ #reads yor args from commandline or input-file
my #arr = split(/\,/); #split at comma, every line
$hash{$arr[0]} = $arr[1]; #assign to hash
}
#print hash here
foreach my $key (sort keys %hash ) #sort and iterate
{
print "Name: " . $key . " Number: " . $hash{$key} . "\n";
}

Tasks like this are the strength of perl's command line switches. See perldoc perlrun for more infos!
Command line input
$ perl -naF',\s*' -lE'$d{$F[0]}=$F[1];END{say"$_: $d{$_}"for sort keys%d}'
Moe, 12345
Pi, 31416
Homer, 54321
Output
Homer: 54321
Moe: 12345
Pi: 31416

Assuming that we split on commas (you should use Text::CSV generally), we can actually create this hash with a simple application of the map function and the diamond operator (<>).
#!/usr/bin/env perl
use strict;
use warnings;
my %phonebook = map { chomp; split /,/ } <>;
use Data::Dumper;
print Dumper \%phonebook;
The last two lines are just to visualize the result, and the upper three should be in all scripts. The meat of the work is done all in the one line.

Perl- Extract each line from a txt file and store into different variables

I readin a txt file using a perl script, but im wondering how to store each line from the txt file into a different variable in the perl script using pattern matching. I can match a line using ~^>gi , but it displays both lines from the txt file with >gi (i.e line 1 & 3), also i want to read the two separate DNA sequences into different variables. Consider my example below.
file.txt
>gi102939
GATCTATC
>gi123453
CATCGACA
the perl script:
#!/usr/local/bin/perl
open (MYFILE, 'file.txt');
#array = <MYFILE>;
($first, $second, $third, $fourth, $fifth) = #array;
chomp $first, $second, $third, $fourth, $fifth;
print "Contents:\n #array";
if (#array =~ /^>gi/)
{
print "$first";
}
close (MYFILE);

Assuming that >gi.. are unique in the input, populate a hash where each key is associated with a sequence:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
my $last;
while (<DATA>) {
chomp;
if (/^>gi/) {
$last = $_;
} else {
$hash{$last} = $_;
}
}
foreach my $k (keys %hash) {
print "$k => $hash{$k}\n";
}
__DATA__
>gi102939
GATCTATC
>gi123453
CATCGACA

Please always use strict and use warnings at the top of your program, and declare your variables using my at their first point of use. This applies epecially when you are asking for help, as doing so can frequently reveal simlpe problems that could otherwise be overlooked.
As it stands, your program will read the file into #array and print it out. The test if (#array =~ /^>gi/) { ... } will force scalar context on the array, and so compare the number of elements in the array, presumably 5, with the regex pattern and fail.
What exactly are you trying to achieve? Reading a file into an array puts each line into a different scalar variables - the variables being the elements of the array

This one-liner reads the database and extracts one element:
perl < file.txt -e '#array=<>;chomp #array;%hash=#array;print $hash{">gi102939"}'
result:
GATCTATC