find biggest hash value regarding each key in perl - perl

I have a file with this structure
>test1
MATRTQARGA
>test2
MRIIEGKLQLQG
>test1
MATRTQARGAVVELLYAFESGNEEIKKIASSML
in the result I want
>test2
MRIIEGKLQLQG
>test1
MATRTQARGAVVELLYAFESGNEEIKKIASSML
I was thinking about a hash structure which keys are the lines with > and the next line after each >line would be the value then for each key I some how print the string with longest length , but since hash structures can not have duplicate keys I don't know what to do

You don't need duplicate keys, you just have to store the current longest value for each key, and replace it when you get a longer one:
my %longest;
my $curkey;
while (<>) {
chomp;
if (/^>/) {
$curkey = $_;
$curkey =~ s/^.//; # Remove '>' prefix;
next;
}
if (length($_) > length($longest{$curkey})) {
$longest{$curkey} = $_;
}
}

another, less intuitive way
#!/usr/bin/env perl
use strict;
use Data::Dumper;
local $/ = ">"; # local not really needed here, as its in the global scope..
my %unqs;
while(<DATA>) {
next if (m/^\s*>/);
my #arr = grep { not m/>|^\s*$/ } split(/\n/);
$unqs{$arr[0]} = $arr[1] if (length($arr[1]) > length($unqs{$arr[0]}));
}
print Dumper(\%unqs);
__DATA__
>test1
MATRTQARGA
>test2
MRIIEGKLQLQG
>test1
MATRTQARGAVVELLYAFESGNEEIKKIASSML
now you can use %unqs hash and print it to a file, you will end up with what you want.

Related

find duplicate filenames and append them to hash of arrays

Perl question: I have a colon separated file containing paths that I'm using. I just split using a regex, like this:
my %unique_import_hash;
while (my $line = <$log_fh>) {
my ($log_type, $log_import_filename, $log_object_filename)
= split /:/, line;
$log_type =~ s/^\s+|\s+$//g; # trim whitespace
$log_import_filename =~ s/^\s+|\s+$//g; # trim whitespace
$log_object_filename =~ s/^\s+|\s+$//g; # trim whitespace
}
The exact file format is:
type : source-filename : import-filename
What I want is an index file that contains the last pushed $log_object_filename for each unique key $log_import_filename, so, what I'm going to do in English/Perl pseudo-code is push the $log_object_filename onto an array indexed by the hash %unique_import_hash. Then, I want to iterate over the keys and pop the array referred by %unique_import_hash and store it in an array of scalars.
My specific question is: what is the syntax for appending to an array that is the value of a hash?
You can use push, but you have to dereference the array referenced by the hash value:
push #{ $hash{$key} }, $filename;
See perlref for details.
If you only care about the last value for each key, you're over-thinking the problem. No need to fool around with arrays when a simple assignment will overwrite the previous value:
while (my $line = <$log_fh>) {
# ...
$unique_import_hash{$log_import_filename} = $log_object_filename;
}
use strict;
use warnings;
my %unique_import_hash;
my $log_filename = "file.log";
open(my $log_fh, "<" . $log_filename);
while (my $line = <$log_fh>) {
$line =~ s/ *: */:/g;
(my $log_type, my $log_import_filename, my $log_object_filename) = split /:/, $line;
push (#{$unique_import_hash{$log_import_filename}}, $log_object_filename);
}
Seek the wisdom of the Perl monks.

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

Hash key sorting using perl?

I need to sort hash key using perl also i need to allow duplicate in key. So that i planned to check exists method in perl if it is exists then i increment a last digit then i will store into hash.
I tried the following code:
use strict;
use warnings;
use iPerl::Basic qw(_save_file _open_file);
my $xml = $ARGV[0];
my ($xmlcnt,$backcnt,$refcnt,$name,$year) = "";
my %sort = ();
if(($#ARGV != 0) or(not -f "$xml") or($xml!~ m{\.xml$}i)){
print_exit("\t\tSYSTAX ERROR: <EXE> <xml File>\n\n")
};
$xmlcnt=_open_file($xml);
$xmlcnt =~ s{<back(?: [^>]+)?>(?:(?!</?back[ >]).)*</back>}{
$backcnt = $&;
while($backcnt =~ m{<ref(?: [^>]+)?>(?:(?!<ref[ >]).)*</ref>}igs){
$refcnt = $&;
$name = $1 if($refcnt =~ m{<person-group(?: [^>]+)?>((?:(?!</?person-group[ >]).)*)</person-group>}is);
$year = $1 if($refcnt =~ m{<year>((?:(?!</?year[ >]).)*)</year>}is);
$name =~ s{</?(?:string-name|surname|given-names)>}{}ig;
my $count = 1;
my $keys="$name $year\E$count";
if(exists ($sort{$keys})){
$keys =~ s{(\d)$}{my $icr=$1;$icr++;qq($icr)}e;
#print"$keys\n";
$sort{$keys}="$refcnt";
}
else
{
$sort{$keys}="$refcnt";
}
print join("\n",keys %sort);
}
qq($backcnt)
}igse;
my #keys = sort {
$sort{$a} <=> $sort{$b}
# or
# "\L$a" cmp "\L$b"
} keys %sort;
# print join("\n",#keys);
sub print_exit {
my $msg = shift;
#print "\n$msg";
exit;
}
Please can anyone tell me what went wrong here?
input:
thieooieroh
apple
apple
highefhfe
bufghifeh
output:
apple
apple
bufghifeh
highefhfe
thieooieroh
Thanks in advance.
From a very brief look at your code, it appears that you want to store refcounts as the values in your hash, with the ability to have multiple counts for a single key. This is easily doable by using a hash of arrays (commonly abbreviated to HoA). Each key must, by definition, be unique, but the associated value can be a reference, allowing you to store multiple items under that key, or to build even more complex data structures.
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
my %hash;
while (my $line = <DATA>) {
chomp $line;
my ($key, $count) = split ',', $line;
push #{$hash{$key}}, $count;
}
for my $key (sort keys %hash) {
my $values = $hash{$key};
for (#$values) {
say "$key ($_)";
}
}
__DATA__
thieooieroh,1
apple,2
apple,3
highefhfe,4
bufghifeh,5
Output:
apple (2)
apple (3)
bufghifeh (5)
highefhfe (4)
thieooieroh (1)
If you're not actually concerned with storing multiple data items with each key, but only with the number of times each key appears, it's even simpler. Change the two loops in the above code to:
while (my $line = <DATA>) {
chomp $line;
$hash{$line}++;
}
for my $key (sort keys %hash) {
say $key for 1 .. $hash{$key};
}
and you get the output
apple
apple
bufghifeh
highefhfe
thieooieroh
As for the rest of your posted code, don't try to parse XML with regexes. Arbitrary XML cannot be parsed beyond a very crude first approximation by regular expressions because XML is not structurally "regular". There are many fine XML-parsing modules on CPAN which will parse your XML correctly for you, while also requiring far less effort from you than trying to write your own parser. Use one of them. Not regexes.

Merging two files based on first column and returns multiple values for each key

I am fairly new to Perl so hopefully this has a quick solution.
I have been trying to combine two files based on a key. The problem is there are multiple values instead of the one it is returning. Is there a way to loop through the hash to get the 1-10 more values it could be getting?
Example:
File Input 1:
12345|AA|BB|CC
23456|DD|EE|FF
File Input2:
12345|A|B|C
12345|D|E|F
12345|G|H|I
23456|J|K|L
23456|M|N|O
32342|P|Q|R
The reason I put those last one in is because the second file has a lot of values I don’t want but file 1 I want all values. The result I want is something like this:
WANTED OUTPUT:
12345|AA|BB|CC|A|B|C
12345|AA|BB|CC|D|E|F
12345|AA|BB|CC|G|H|I
23456|DD|EE|FF|J|K|L
23456|DD|EE|FF|M|N|O
Attached is the code I am currently using. It gives an output like so:
OUTPUT I AM GETTING:
12345|AA|BB|CC|A|B|C
23456|DD|EE|FF|J|K|L
My code so far:
#use strict;
#use warnings;
open file1, "<FILE1.txt";
open file2, "<FILE2.txt";
while(<file2>){
my($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~ /(.+)\|(.+)\|(.+)\|(.+)/;
$value4 = "$value1|$value2|$value3";
$file2Hash{$key} = $value4;
}
while(<file1>){
my ($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~/(.+)\|(.+)\|(.+)\|(.+)/;
if (exists $file2Hash{$key}) {
print $line."|".$file2Hash{$key}."\n";
}
else {
print $line."\n";
}
}
Thank you for any help you may provide,
Your overall idea is sound. However in file2, if you encounter a key you have already defined, you overwrite it with a new value. To work around that, we store an array(-ref) inside our hash.
So in your first loop, we do:
push #{$file2Hash{$key}}, $value4;
The #{...} is just array dereferencing syntax.
In your second loop, we do:
if (exists $file2Hash{$key}){
foreach my $second_value (#{$file2Hash{$key}}) {
print "$line|$second_value\n";
}
} else {
print $line."\n";
}
Beyond that, you might want to declare %file2Hash with my so you can reactivate strict.
Keys in a hash must be unique. If keys in file1 are unique, use file1 to create the hash. If keys are not unique in either file, you have to use a more complicated data structure: hash of arrays, i.e. store several values at each unique key.
I assume that each key in FILE1.txt is unique and that each unique key has at least one corresponding line in FILE2.txt.
Your approach is then quite close to what you need, you should just use FILE1.txt to create the hash from (as already mentioned here).
The following should work:
#!/usr/bin/perl
use strict;
use warnings;
my %file1hash;
open file1, "<", "FILE1.txt" or die "$!\n";
while (<file1>) {
my ($key, $rest) = split /\|/, $_, 2;
chomp $rest;
$file1hash{$key} = $rest;
}
close file1;
open file2, "<", "FILE2.txt" or die "$!\n";
while (<file2>) {
my ($key, $rest) = split /\|/, $_, 2;
if (exists $file1hash{$key}) {
chomp $rest;
printf "%s|%s|%s\n", $key, $file1hash{$key}, $rest;
}
}
close file2;
exit 0;

Perl- Extract each line from a txt file and store into different variables

I readin a txt file using a perl script, but im wondering how to store each line from the txt file into a different variable in the perl script using pattern matching. I can match a line using ~^>gi , but it displays both lines from the txt file with >gi (i.e line 1 & 3), also i want to read the two separate DNA sequences into different variables. Consider my example below.
file.txt
>gi102939
GATCTATC
>gi123453
CATCGACA
the perl script:
#!/usr/local/bin/perl
open (MYFILE, 'file.txt');
#array = <MYFILE>;
($first, $second, $third, $fourth, $fifth) = #array;
chomp $first, $second, $third, $fourth, $fifth;
print "Contents:\n #array";
if (#array =~ /^>gi/)
{
print "$first";
}
close (MYFILE);
Assuming that >gi.. are unique in the input, populate a hash where each key is associated with a sequence:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
my $last;
while (<DATA>) {
chomp;
if (/^>gi/) {
$last = $_;
} else {
$hash{$last} = $_;
}
}
foreach my $k (keys %hash) {
print "$k => $hash{$k}\n";
}
__DATA__
>gi102939
GATCTATC
>gi123453
CATCGACA
Please always use strict and use warnings at the top of your program, and declare your variables using my at their first point of use. This applies epecially when you are asking for help, as doing so can frequently reveal simlpe problems that could otherwise be overlooked.
As it stands, your program will read the file into #array and print it out. The test if (#array =~ /^>gi/) { ... } will force scalar context on the array, and so compare the number of elements in the array, presumably 5, with the regex pattern and fail.
What exactly are you trying to achieve? Reading a file into an array puts each line into a different scalar variables - the variables being the elements of the array
This one-liner reads the database and extracts one element:
perl < file.txt -e '#array=<>;chomp #array;%hash=#array;print $hash{">gi102939"}'
result:
GATCTATC