Print keys and values from hash in a proper format? - perl

Hello I am trying to print the values and keys of a hash with one key/value per row like this:
key:value
This is the code I am using to print my hash:
foreach (sort keys %hash) { print "$_:$hash{$_}\n"; }
And this is the output I get:
key
:value
Why is my script printing the value on a new row and what can I do to fix it?

The cursor is moving to the next line because your key contains a line feed. The solution is to remove the line feed from the key.
More specifically, you surely want to avoid creating a key with a line feed in the first place, so it should be removed from the key before you create the hash element.
You're presumably reading the key from a file handle. It's customary to use chomp (to remove any trailing line feed) or s/\s+/z// (to remove any trailing whitespace including line feeds).
my #keys;
while (<>) {
chomp; # Or: s/\s+\z//;
push #keys, $_;
}
my %hash; #hash{#keys} = #values;

Try this version of printing loop:
foreach (sort keys %hash) {
my $v = $hash{$_};
s/\s+$//;
print "$_:$v\n";
}
Keys in %hash definitely have some unwanted trailing characters, so it is better to filter them out when %hash is filled. For example instead of this:
#hash{#keys} = #vals;
Write this:
#hash{map { s/\s+$//; $_ } #keys} = #vals;
Or this:
chmop(#keys);
#hash{#keys} = #vals;
But chomp will not help with multiple characters.

Related

find duplicate filenames and append them to hash of arrays

Perl question: I have a colon separated file containing paths that I'm using. I just split using a regex, like this:
my %unique_import_hash;
while (my $line = <$log_fh>) {
my ($log_type, $log_import_filename, $log_object_filename)
= split /:/, line;
$log_type =~ s/^\s+|\s+$//g; # trim whitespace
$log_import_filename =~ s/^\s+|\s+$//g; # trim whitespace
$log_object_filename =~ s/^\s+|\s+$//g; # trim whitespace
}
The exact file format is:
type : source-filename : import-filename
What I want is an index file that contains the last pushed $log_object_filename for each unique key $log_import_filename, so, what I'm going to do in English/Perl pseudo-code is push the $log_object_filename onto an array indexed by the hash %unique_import_hash. Then, I want to iterate over the keys and pop the array referred by %unique_import_hash and store it in an array of scalars.
My specific question is: what is the syntax for appending to an array that is the value of a hash?
You can use push, but you have to dereference the array referenced by the hash value:
push #{ $hash{$key} }, $filename;
See perlref for details.
If you only care about the last value for each key, you're over-thinking the problem. No need to fool around with arrays when a simple assignment will overwrite the previous value:
while (my $line = <$log_fh>) {
# ...
$unique_import_hash{$log_import_filename} = $log_object_filename;
}
use strict;
use warnings;
my %unique_import_hash;
my $log_filename = "file.log";
open(my $log_fh, "<" . $log_filename);
while (my $line = <$log_fh>) {
$line =~ s/ *: */:/g;
(my $log_type, my $log_import_filename, my $log_object_filename) = split /:/, $line;
push (#{$unique_import_hash{$log_import_filename}}, $log_object_filename);
}
Seek the wisdom of the Perl monks.

Perl split function basics reading each word from an input file

I'm having trouble understanding why this code will not output anything:
#!/usr/bin/perl -w
use strict;
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
while (<>) {
print "In the loop 1";
chomp;
print "Got here";
my #words = split(/\W+/,$_);
}
foreach my $val (my #words) {
print "$val\n";
}
And I run it from the terminal using the command:
perl wordfinder.pl < exampletext.txt
I would expect the code above to output each word from the input file, but it does not output anything other than "In the loop 1" and "Got here". I'm trying to separate the input file word by word, using the split parameter I specified.
Update 1: Here, I have declared the variables within their proper scope, which was my main issue. Now I am getting all of the words from the input file to output on the terminal:
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
my #words = ();
my $val;
while (<>) {
print "Inputting words into an array! \n";
chomp;
#words = split(/\W+/,$_);
}
print("Words have been input successfully, performing analysis: \n");
foreach $val (#words) {
print "$val\n";
}
UPDATE 2: Progress has been made. Now, we put all words from any input files into a hash, and then print each unique key (i.e. each unique word found across all input files) from the hash.
#!/usr/bin/perl -w
use strict;
# Description: We want to take ALL text files from the command line input and calculate
# the frequencies of the words contained therein.
# Step 1: Loop over all words in all input files, and put each new unique word in a
# hash (check to see if contained in hash, if not, put the word in; if the word already
# exists in the hash, then increase its "total" by 1). Also, keep a running total of
# all words.
print("Welcome to word frequency finder. \n");
my $running_total = 0;
my %words;
my $val;
while (<>) {
chomp;
foreach my $str (split(/\W+/,$_)) {
$words{$str}++;
$running_total++;
}
}
print("Words have been input successfully, performing analysis: \n");
# Step 2: Loop over all entries in the hash and look for the word (key) with the
# maximum amount, and then remove this from the hash and put in a separate list.
# Do this until the size of the separate list is 10, since we want the top 10 words.
foreach $val (keys %words) {
print "$val\n";
}
Since you've already completed step 1, you're left with getting your top ten most common words. Rather than looping through the hash and finding the most frequent entry, let's let Perl do the work for us by sorting the hash by its values.
To sort the %words hash by its keys, we can use the expression sort keys %words; to sort a hash by its values, but be able to access its keys, we need a more complex expression:
sort { $words{$a} <=> $words{$a} } keys %words
Breaking it down, to sort numerically, we use the expression
sort { $a <=> $b } #array
(see [perl sort][1] for more on the special variables $a and $b used in sorting)
sort { $a <=> $b } keys %words
would sort on the hash keys, so to sort on the values, we do
sort { $words{$a} <=> $words{$b} } keys %words
Note that the output is still the keys of the hash %words.
We actually want to sort from high to low, so swap $a and $b over to reverse the sort direction:
sort { $words{$b} <=> $words{$a} } keys %words
Since we're compiling a top ten list, we only want the first ten from our hash. It's possible to do this by taking a slice of the hash, but the easiest way is just to use an accumulator to keep count of how many entries we have in the top ten:
my %top_ten;
my $i = 0;
for (sort { $words{$b} <=> $words{$a} } keys %words) {
# $_ is the current hash key
$top_ten{$_} = $words{$_};
$i++;
last if $i == 10;
}
And we're done!

How to get a subset of keys-value from a file with universe of key value pairs

I have a file with key value pairs separated by whitespace. The first column in the file is the key and the rest of the columns are the value. In other words, each key may have an array for a value.
I'm only interested in the values of certain keys in the file. I have an array with the keys I'm interested in. What's the best way in perl to create a hash with only the subset of key/value pairs that i'm interested in?
Here's what I have thus far:
foreach my $line (#{$file_arr_ref}) {
my $sub = substr( $line, 0, 1);
if(($sub ne "#") and ($sub ne "")){ #omit comments and blank lines
my #key_vals = split(/\s/, $line);
if $key_vals[0] eq "key_i'm_interested_in_1" or $key_vals[0] eq "key_i'm_interested_in_2" {
insert_into_hash();
}
}
}
Is there a more optimal way of doing this?
Create a hash from the array with keys you need.
my #keys_i_need = ('key_1', 'key_2', 'key_3');
my %keys_i_need = map {$_ => 1} #keys_i_need;
foreach my $line (#{$file_arr_ref}) {
my $sub = substr( $line, 0, 1);
if(($sub ne "#") and ($sub ne "")){ #omit comments and blank lines
my #key_vals = split(/\s/, $line);
insert_into_hash() if(exists $keys_i_need{$key_vals[0]});
}
}
Typically, when one is looking for the existance of something, the first data structure one should think of is a hash.
However, if the list of items is short, an array might be sufficient as well by using grep.
foreach my $line (#{$file_arr_ref}) {
next if $line =~ /^$/ || $line =~ /^#/; # Omit blank lines and comments
my #key_vals = split /\s/, $line;
next if ! grep {$key_vals[0] eq $_} qw(key_one key_two key_three);
insert_into_hash();
}
Also note, if you're going to be iterating on all the lines of your file, that it might be better to do it in the form while (<$fh>) instead of loading them all into an array first.

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. So I figured I'd post it and see how some of you more experienced programmers can address with problem.
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). These labels are only in the first column of the text file. I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. This hash has about 26000 lines. So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok.
THis is what I have so far:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
#gtfarray = <FASTAFILE2>;
#print #gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (#gtfarray){
$_ =~ s/$find/$replace/g;
push #newgtf, $_;
}
}
print #newgtf;
This code doesn't seem to work as it doesn't complete. I'm pretty sure it's a problem with the foreach loop structure. Sorry I don't know of any other way to do this. Does anyone have a better way to run through this file and conduct the replacement?
Any input would be greatly appreciated!
Thanks,
Andrew
#DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? Thanks again!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error:
perl gtf_mod2.pl <./Hc_genome/header_file.txt
syntax error at gtf_mod2.pl line 14, near "while $line "
syntax error at gtf_mod2.pl line 23, near "}"
Execution of gtf_mod2.pl aborted due to compilation errors.
You exhaust your file the first time through your loop using the initial $find and $replace key/value pair.
There are two potential solutions:
Open the file for reading during each iteration of your while loop (expensive)
Move the foreach loop to the outside of the while and iterate the hash each time (less expensive)
example:
REPLACE:
for my $line (#gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push #newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push #newgtf, $line
}
How big is the file that you are replacing the first column in?
If it's >50,000 lines, you are better off doing the reverse:
Iterate through hash file once, and store that hash in memory
Iterate through main file once, and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write.
In other words, remove the first #gtfarray = <FASTAFILE2>; and replace your last while loop with:
while my $line (<FASTAFILE2>) {
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (e.g. your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU").
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (e.g. "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis:
#matches = ($line =~ m/(scaffold\d+\.\d+_size\d+\.\d+\.\d+_\d+/g );
Then, look up every value of #matches array in the hash. If found, run ONLY the matches as a s/// regex.
Looking at your previous post, wouldn't it be more simple to create the shortened 'id' while reading the file. Then you would have no need of the other file where you get your hash?
Here is the (untested) code below. (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";
Could it be a job for Tie::File? Assuming, that is, the data file could be operated on as an array.
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie #lines, 'Tie::File', $file or die ;
for (#lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie #lines ;
Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient.

find biggest hash value regarding each key in perl

I have a file with this structure
>test1
MATRTQARGA
>test2
MRIIEGKLQLQG
>test1
MATRTQARGAVVELLYAFESGNEEIKKIASSML
in the result I want
>test2
MRIIEGKLQLQG
>test1
MATRTQARGAVVELLYAFESGNEEIKKIASSML
I was thinking about a hash structure which keys are the lines with > and the next line after each >line would be the value then for each key I some how print the string with longest length , but since hash structures can not have duplicate keys I don't know what to do
You don't need duplicate keys, you just have to store the current longest value for each key, and replace it when you get a longer one:
my %longest;
my $curkey;
while (<>) {
chomp;
if (/^>/) {
$curkey = $_;
$curkey =~ s/^.//; # Remove '>' prefix;
next;
}
if (length($_) > length($longest{$curkey})) {
$longest{$curkey} = $_;
}
}
another, less intuitive way
#!/usr/bin/env perl
use strict;
use Data::Dumper;
local $/ = ">"; # local not really needed here, as its in the global scope..
my %unqs;
while(<DATA>) {
next if (m/^\s*>/);
my #arr = grep { not m/>|^\s*$/ } split(/\n/);
$unqs{$arr[0]} = $arr[1] if (length($arr[1]) > length($unqs{$arr[0]}));
}
print Dumper(\%unqs);
__DATA__
>test1
MATRTQARGA
>test2
MRIIEGKLQLQG
>test1
MATRTQARGAVVELLYAFESGNEEIKKIASSML
now you can use %unqs hash and print it to a file, you will end up with what you want.