Perl: Iterating through large hash, runs out of memory - perl

I have been trying to find values that match between two columns (columns a and column b) of a large file and print the common values, plus the corresponding column d. I have been doing this by interating through hashes, however, because the file is so large, there is not enough memory to produce the output file. Is there any other way to do the same thing using less memory resources.
Any help is much appreciated.
The script I have written thus far is below:
#!usr/bin/perl
use warnings;
use strict;
open (FILE1, "<input.txt") || die "$!\n Couldn't open input.txt\n";
open (Output, ">output.txt")||die "Can't Open output.txt ";
my $hash1={};
my $hash2={};
while (<FILE1>) {
chomp (my $line=$_);
my ($a, $b, $c, $d) = split (/\t/, $line);
if ($a) {
$hash1->{$a}{info1} = "$d"; #original_ID-> YOB
}
if ($b) {
$hash2->{$b}{info2} = "$a"; #original_ID-> sire
}
foreach my $key (keys %$hash2) {
if (exists $hash1{$a}) {
$info1 = $hash1->{$a}->{info1};
print "$a\t$info1\n";
}
}
}
close FILE1;
close Output;
print "Done\n";
To clarify, the input file is a large pedigree file. An example is:
1 2 3 1977
2 4 5 1944
3 4 5 1950
4 5 6 1930
5 7 6 1928
An example of the output file is:
2 1944
4 1950
5 1928

Does the below work for you ?
#!/usr/local/bin/perl
use strict;
use warnings;
use DBM::Deep;
use List::MoreUtils qw(uniq);
my #seen;
my $db = DBM::Deep->new(
file => "foo.db",
autoflush => 1
);
while (<>) {
chomp;
my #fields = split /\s+/;
$$db{$fields[0]} = $fields[3];
push #seen, $fields[1];
}
for (uniq #seen) {
print $_ . " " . $$db{$_} . "\n" if exists $$db{$_};
}

Related

Join element in an array and separate with space

I want to join the first to 16th word and 17th to 31st, etc in an array with space to one line but do not know why the code does not work. Hope to get help here.Thanks
my #file = <FILE>;
for ( $i=0; $i<=$#file; $i+=16 ){
my $string = join ( " ", #file[$i..$i+15] );
print FILE1 "$string\n";
}
Below is part of my file.
1
2
3
...
What i wan to print is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21....
I wouldn't do it the way you've done it.
Instead I would:
open ( my $input, '<', "your_file_name" ) or die $!;
chomp ( my #file = <$input> );
print join (" ",splice (#file, 0, 15)),"\n" while #file;
Note - I've used a lexical file handle with a 3 argument open, because that's better style.
splice removes the first 16 elements from #file each iteration, and continues until #file is empty.
Your lines have newlines attached to them. Remove them with chomp. Then loop over the array, remove 16 items and print them.
my #file = <FILE>;
chomp #file;
while (#file) {
my #temp;
INNER: for ( 0 .. 15 ) {
push #temp, shift #file || last INNER; # not or
}
print join( q{ }, #temp ), "\n";
}
This is the long implementation of the splice solution Sobrique suggested in the comments. It's does the same thing, just way more verbose.
This is the old answer before the edit:
If you only want the first 16, this is way more effective.
my $string = join q{ }, map { <FILE>; chomp; $_ } 1 .. 16;
This reads 16 lines and chomp each of them, then joins.
You might also want to use lexical file handles $fh instead of the GLOB FILE.
open my $fh, '<', $path_to_file or die $!;
Suppose if you want to read it from file, don't store the whole file into an array. Instead loop through line by line. And check the line number with $. special variable.
use warnings;
use strict;
open my $fh,"<","input.txt";
my $line;
while (<$fh>)
{
chomp;
$line.= $_." ";
print "$line\n" and $line="" if($. % 16 == 0);
END{ print "$line\n";};
}
Or this also will work
use warnings;
use strict;
open my $wh,"<","input.txt";
my $line;
foreach (;;)
{
my $data = join " ",(map { my $m=<$wh> || ""; chomp($m); $m} (0..15));
last if ($data =~m/^\s+$/);
print $data,"\n";
}
Assuming that you have FILE and FILE1 descriptors open, try:
$.%16?s!\s+! !:1 and print FILE1 $_ while <FILE>;

Transposing an array of elements

I am trying to transpose an array.
I tried the following code...
#! /usr/bin/perl
use strict;
use warnings;
use autodie;
open my $fh, '<',"op.txt" || die "$!";
open my $wh , '>',"pwl.txt" || die "$!";
select ($wh);
while (my $line = <$fh>) {
my #rows = $line;
my #transposed;
for my $row (#rows) {
for my $column (0 .. $#{$row}) {
push(#{$transposed[$column]}, $row->[$column]);
}
}
for my $new_row (#transposed) {
for my $new_col (#{$new_row}) {
print $new_col, " ";
}
print "\n";
}
}
**********INPUT FILE******
1 2 3
4 5 6
7 8 9
********** EXPECTED OUTPUT FILE *******
1 4 7
2 5 8
3 6 9
******** GENERATED OUTPUT FILE *******
Currently couldn't able print anything. script shows the error
"can't use string ("1 4 7") as an array ref while "strict refs" in use
Reference:
used the following reference...
Transpose in perl
however in this reference example, array input lines are declared manually where as i am trying to process a array which is in a text file
could anybody help me where i did mistake?
Many Thanks
You did have to split input line as #dland suggested. But, there were a few other issues.
Here's the corrected code [please pardon the gratuitous style cleanup]:
#! /usr/bin/perl
use strict;
use warnings;
use autodie;
open my $fh, '<',"op.txt" || die "$!";
open my $wh , '>',"pwl.txt" || die "$!";
my #rows;
while (my $line = <$fh>) {
my #line = split(" ",$line);
push(#rows,\#line);
}
close($fh);
my #transposed;
for my $row (#rows) {
push(#transposed,[]);
}
my $rowidx = -1;
for my $rowptr (#rows) {
++$rowidx;
my $colidx = -1;
for my $rowval (#$rowptr) {
++$colidx;
###printf("R=%d C=%d\n",$rowidx,$colidx);
my $colptr = $transposed[$colidx];
$colptr->[$rowidx] = $rowval;
}
}
for my $new_row (#transposed) {
for my $new_col (#$new_row) {
print $wh $new_col, " ";
}
print $wh "\n";
}
close($wh);
Note: It's slightly harder to transpose a non-square matrix. The above code may need to be extended a bit for that.
You're trying to shove a scalar into an array:
my #row = $line;
I think what you really want is to split on spaces:
my #row = split / /, $line;
adding my #row =map [ split], $line to my initial code is helping to print the data to pwl.txt.
however its not printing side by side.. instead it is printing to new line.
I guess because of this Mr.Borodin didn't include matrix in side while loop!
1
2
3
4
5
6
7
8
9
Here's a simpler way of doing this. It assumes all the rows are of the same length, and that all the lines in the file contain data -- i.e. there are no blank lines
The name of the input file is expected as a parameter on the command line, and the output is sent to STDOUT so it can be redirected on the command line. For instance
perl transpose.pl op.txt > pwl.txt
use strict;
use warnings 'all';
my #matrix = map [ split ], <>;
print "#$_\n" for #matrix;
print "\n";
my #transpose;
for my $i ( 0 .. $#{ $matrix[0] } ) {
$transpose[$i] = [ map { $_->[$i] } #matrix ]
}
print "#$_\n" for #transpose;
print "\n";
output
1 2 3
4 5 6
7 8 9
1 4 7
2 5 8
3 6 9

Best way to keep track of previous and following line in perl

What is the best/right way, in perl, of keeping the information from the previous and/or following line. For example, with this code:
while (<IN>) {
print;
}
how can it be changed to not print the line only if the previous or the next line in the file match foo, but printing otherwise?
Could you give code examples. Thanks.
Updated: Simplified exposition.
Basically, you need to keep track of two extra lines if you want to print the current lines based on information contained in two other lines. Here is a simple script with everything hard-coded:
#!/usr/bin/env perl
use strict;
use warnings;
my $prev = undef;
my $candidate = scalar <DATA>;
while (defined $candidate) {
my $next = <DATA>;
unless (
(defined($prev) && ($prev =~ /foo/)) ||
(defined($next) && ($next =~ /foo/))
) {
print $candidate;
}
($prev, $candidate) = ($candidate, $next);
}
__DATA__
1
2
foo
3
4
5
foo
6
foo
7
8
9
foo
We can generalize this to a function that takes a filehandle and a test (as a subroutine reference):
#!/usr/bin/env perl
use strict; use warnings;
print_mid_if(\*DATA, sub{ return !(
(defined($_[0]) && ($_[0] =~ /foo/)) ||
(defined($_[1]) && ($_[1] =~ /foo/))
)} );
sub print_mid_if {
my $fh = shift;
my $test = shift;
my $prev = undef;
my $candidate = scalar <$fh>;
while (defined $candidate) {
my $next = <$fh>;
print $candidate if $test->($prev, $next);
($prev, $candidate) = ($candidate, $next);
}
}
__DATA__
1
2
foo
3
4
5
foo
6
foo
7
8
9
foo
You could read your line into an array, and then if you get something that signals you in some way, pop out the last few elements of the array. Once you've finished reading everything in, you could print it:
use strict;
use warnings;
use feature qw(say);
use autodie; #Won't catch attempt to read from an empty file
use constant FILE_NAME => "some_name.txt"
or die qq(Cannot open ) . FILE_NAME . qq(for reading: $!\n);
open my $fh, "<", FILE_NAME;
my #output;
LINE:
while ( my $line = <DATA> ) {
chomp $line;
if ( $line eq "foo" ) {
pop #output; #The line before foo
<DATA>; #The line after foo
next LINE; #Skip line foo. Don't push it into the array
}
push #output, $line;
}
From there, you can print out the array with the values you don't want printed already taken care of.
for my $line ( #output ) {
say $line;
}
The only problem is that this takes memory. If your file is extremely large, you could run out of memory.
One way to get around this is to use a buffer. You store your values in an array, and shift out the last value when you push another in the array. If the value read in is foo, you can reset the array. In this case, the buffer will contain at most one line:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
my #buffer;
LINE:
while ( my $line = <DATA> ) {
chomp $line;
if ( $line eq "foo" ) {
#buffer = (); #Empty buffer of previous line
<DATA>; #Get rid of the next line
next LINE; #Foo doesn't get pushed into the buffer
}
push #buffer, $line;
if ( #buffer > 1 ) { #Buffer is "full"
say shift #buffer; #Print out previous line
}
}
#
# Empty out buffer
#
for my $line ( #buffer ) {
say $line;
}
__DATA__
2
3
4
5
6
7
8
9
10
11
12
13
1
2
foo
3
4
5
foo
6
7
8
9
foo
Note that it is very possible that I might attempt to read from an empty file when I skip the next line. This is okay. The <$fh> will return either an empty string or undef, but I can ignore that. I'll catch the error when I go back to the top of my loop.
I didn't see that you had any specific criteria for "best", so I'll give you a solution that may be "best" along a different axis than those presented so far. You could use Tie::File and treat the entire file as an array, then iterate the array using an index. The previous and next lines are just $index-1 and $index+1 respectively. You just have to worry a little about your indices going beyond the bounds of your array. Here's an example:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010; # just for "say"
use Tie::File;
tie my #array, 'Tie::File', "filename" or die;
for my $i (0..$#array) {
if ($i > 0 && $i < $#array) { # ensure $i-1 and $i+1 make sense
next if $array[$i-1] =~ /BEFORE/ &&
$array[$i+1] =~ /AFTER/;
}
say $array[$i];
}
If it's more convenient, you can specify a filehandle instead of a filename and Tie::File also has some parameters to control memory usage or change what it means to be a "line" if you want that. Check the docs for more info.
Anyway, that's another way to do what you want that might be conceptually simpler if you are familiar with arrays and like to think in terms of arrays.
I would read the file into an array, with each line being an array element, then you can do the comparisons. The only real design consideration is the size of the file being read into memory.

perl to loop and check across files

still having trouble with perl programming and I need to be pushed to make a script work out.
I have two files and I want to use the list file to "extract" rows from the data one. The problem is that the list file is formatted as follow:
X1 A B
X2 C D
X3 E F
And my data looks like this:
A X1 2 5
B X1 3 7
C X2 1 4
D X2 1 5
I need to obtain the element pairs from the list file by which select the row in the data file. At the same time I would like to write an output like this:
X1 A B 2 5 3 7
X2 C D 1 4 1 5
I'm trying writing a perl code, but I'm not able to produce something useful. I'm at this point:
open (LIST, "< $fils_list") || die "impossibile open the list";
#list = <LIST>;
close (LIST);
open (HAN, "< $data") || die "Impossible open data";
#r = <HAN>;
close (HAN);
for ($p=0; $p<=$#list; $p++){
chomp ($list[$p]);
($x, $id1, $id2) = split (/\t/, $list[$p]);
$pair_one = $id1."\t".$x;
$pair_two = $id2."\t".$x;
for ($i=0; $i<=$#r; $i++){
chomp ($r[$i]);
($a, $b, $value1, $value2) = split (/\t/, $r[$i]);
$bench = $a."\t".$b;
if (($pair_one eq $bench) || ($pair_two eq $bench)){
print "I don't know what does this script must print!\n";
}
}
}
I'm not able to rationalize about what to print.
Any kind of suggestion is very welcome!
A few general recommendations:
Indent your code to show the structure of your program.
Use meaningful variable names, not $a or $value1 (if I do so below, this is due to my lack of domain knowledge).
Use data structures that suit your program.
Don't do operations like parsing a line more that once.
In Perl, every program should use strict; use warnings;.
use autodie for automatic error handling.
Also, use the open function like open my $fh, "<", $filename as this is safer.
Remember what I said about data structures? In the second file, you have entries like
A X1 2 5
This looks like a secondary key, a primary key, and some data columns. Key-value relationships are best expressed through a hash table.
use strict; use warnings; use autodie;
use feature 'say'; # available since 5.010
open my $data_fh, "<", $data;
my %data;
while (<$data_fh>) {
chomp; # remove newlines
my ($id2, $id1, #data) = split /\t/;
$data{$id1}{$id2} = \#data;
}
Now %data is a nested hash which we can use for easy lookups:
open my $list_fh, "<", $fils_list;
LINE: while(<$list_fh>) {
chomp;
my ($id1, #id2s) = split /\t/;
my $data_id1 = $data{$id1};
defined $data_id1 or next LINE; # maybe there isn't anything here. Then skip
my #values = map #{ $data_id1->{$_} }, #id2s; # map the 2nd level ids to their values and flatten the list
# now print everything out:
say join "\t", $id1, #id2s, #values;
}
The map function is a bit like a foreach loop, and builds a list of values. We need the #{ ... } here because the data structure doesn't hold arrays, but references to arrays. The #{ ... } is a dereference operator.
This is how i would do it, mostly using Hashes resp. Hash- and Array-References (test1.txt and test2.txt contain the data you provided in your example):
use strict;
use warnings;
open(my $f1, '<','test1.txt') or die "Cannot open file1: $!\n";
open(my $f2, '<','test2.txt') or die "Cannot open file2: $!\n";
my #data1 = <$f1>;
my #data2 = <$f2>;
close($f1);
close($f2);
chomp #data1;
chomp #data2;
my %result;
foreach my $line1 (#data1) {
my #fields1 = split(' ',$line1);
$result{$fields1[0]}->{$fields1[1]} = [];
$result{$fields1[0]}->{$fields1[2]} = [];
}
foreach my $line2 (#data2){
my #fields2 = split(' ',$line2);
push #{$result{$fields2[1]}->{$fields2[0]}}, $fields2[2];
push #{$result{$fields2[1]}->{$fields2[0]}}, $fields2[3];
}
foreach my $res (sort keys %result){
foreach (sort keys %{$result{$res}}){
print $res . " " . $_ . " " . join (" ", sort #{$result{$res}->{$_}}) . "\n";
}
}

100 Most Used Strings in File

How can I find the top 100 most used strings (words) in a .txt file using Perl? So far I have the following:
use 5.012;
use warnings;
open(my $file, "<", "file.txt");
my %word_count;
while (my $line = <$file>) {
foreach my $word (split ' ', $line) {
$word_count{$word}++;
}
}
for my $word (sort keys %word_count) {
print "'$word': $word_count{$word}\n";
}
But this only counts each word, and organizes it in alphabetical order. I want the top 100 most frequently used words in the file, sorted by number of occurrences. Any ideas?
Related: Count number of times string repeated in files perl
From reading the fine perlfaq4(1) manpage, one learns how to sort hashes by value. So try this. It’s rather more idiomatically “perlian” than your approach.
#!/usr/bin/env perl
use v5.12;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:utf8 :std);
my %seen;
while (<>) {
$seen{$_}++ for split /\W+/; # or just split;
}
my $count = 0;
for (sort {
$seen{$b} <=> $seen{$a}
||
lc($a) cmp lc($b) # XXX: should be v5.16's fc() instead
||
$a cmp $b
} keys %seen)
{
next unless /\w/;
printf "%-20s %5d\n", $_, $seen{$_};
last if ++$count > 100;
}
When run against itself, the first 10 lines of output are:
seen 6
use 5
_ 3
a 3
b 3
cmp 2
count 2
for 2
lc 2
my 2