I've created a hash that lists each unique word from a text file and the number of times each word appears. The resulting output from this hash indicates it has read the entire file correctly.
However, later efforts to parse text from the same file appear to only capture some of the contents.
For illustrative purposes, a second hash designed to capture each word from the infile as a hash value and that word's relative ordering in the document as the hash key captures only a small fraction of all the words in the document.
Any insights as to the cause of this phenomenon?
#!/usr/bin/perl
use strict;
use warnings;
my $path = "U:/Perl";
chdir($path) or die "Cant chdir to $path $!";
# Starting off computing a simple word count for each word in the 10-K.
my %hash;
my $word;
my %words;
while (<>) {
my #words = split;
# Here creating an index of each word in the 10-K and the number of times
# it appears. This hash works correctly.
foreach my $i (0 .. $#words) {
my $word = $words[$i];
++$hash{$word};
# Here trying to create a hash where each word from the infile is a value,
# and the word's order in the doc is the key.
$words{$i} = $word;
}
}
# The code below simply sorts each hash and prints it to an external file.
my #keys = sort { "\L$a" <=> "\L$b" } keys %words;
open(my $fh2, '>', 'report2.txt');
foreach (#keys) {
print $fh2 "$_ \t $words{$_}\n ";
}
close $fh2;
#keys = sort {
"\L$hash{$a}" <=> "\L$hash{$b}" or
"\L$a" cmp "\L$b"
} keys %hash;
open(my $fh3, '>', 'report3.txt');
foreach (#keys) {
print $fh3 "$_ \t $hash{$_}\n ";
}
close $fh3;
I presume you're talking about the %words hash? You are keying that hash by the position of each word on each line, so it will only ever have as many entries as the longest line in the file has words.
When processing this data
a b c d e f
g h i j k
l m n o
p q r
s t
u
your program will build %hash with 21 elements, with the keys a to u and all the values equal to 1 as they are all different.
But the hash %words will have six elements -- the number of words in the longest line -- and the values will be overwritten by the last word at that position in the line. So your hash will look like
{ 0 => 'u', 1 => 't', 2 => 'r', 3 => 'o', 4 => 'k', 5 => 'f' }
as the last word in the first position on the line is u, the last one in the second position is t, etc.
Surely you don't want a hash indexing the words across the whole file? That would be an array!
Update
Thank you for explaining your intention. I don't think putting all the words from the file into a hash one by one is going to help you to count all the unique six-word sequences.
In any case, using numeric keys from 1 to N, where N is the number of words in the file, is misusing a hash, and as I intimated above what you really want is an array that is meant to be indexed by integers.
I think you should keep an array that holds the current six-word sequence. If you add each word to the end of the array and drop them from the beginning then it will always hold the most recent six words from the file.
Something like this, perhaps
use strict;
use warnings;
my #sequence;
my %sequences;
while (<DATA>) {
for (split) {
push #sequence, $_;
if (#sequence >= 6) {
shift #sequence while #sequence > 6;
++$sequences{"#sequence"};
}
}
}
use Data::Dump;
dd \%sequences;
__DATA__
a b c d e f
g h i j k
l m n o
p q r
s t
u
output
{
"a b c d e f" => 1,
"b c d e f g" => 1,
"c d e f g h" => 1,
"d e f g h i" => 1,
"e f g h i j" => 1,
"f g h i j k" => 1,
"g h i j k l" => 1,
"h i j k l m" => 1,
"i j k l m n" => 1,
"j k l m n o" => 1,
"k l m n o p" => 1,
"l m n o p q" => 1,
"m n o p q r" => 1,
"n o p q r s" => 1,
"o p q r s t" => 1,
"p q r s t u" => 1,
}
Related
The code is getting messy somewhere in the loop! Please help me to solve it.
Details
Replicate and/or reassign most of the array elements in the multi-dimensional array, using reference elements.
File-1: List of array indices & the elements that needs to be maintained in the original array.
File-2: The original multi-dimensional array that needs to be rewritten with the above info. Except the elements from the above, the rest of all elements have to be reassigned.
File-3: Expected output (reassigned array elements)
Note: Addition to the array indices from file1, rest of all the indices will be replaced with the reference line. Reference line is usually present in the first line of the array. In the modified array, the reference line is not needed.
File-1:
ID1 2 E1,E4
ID2 5 E6,E7,E9
ID3 1 E3
File-2:
ID1.txt
Ref K L M N O P A B C D
E1 S H G U S K R E K K
E2 S L G N O P A B C D
E3 S L G N O P A B C D
E4 U L G G O P A B C D
E5 U L M G O P A J C D
E6 U L M G O P A J C D
E7 U L M G O P A J C D
E8 U L M G O P A J C D
E9 S L M N O P A J C D
E10 S L M N O P A J C D
.
.
.
File-3: Expected output
new_ID1.txt
E1 K L G N O P A B C D
E2 K L M N O P A B C D
E3 K L M N O P A B C D
E4 K L G N O P A B C D
E5 K L M N O P A B C D
E6 K L M N O P A B C D
E7 K L M N O P A B C D
E8 K L M N O P A B C D
E9 K L M N O P A B C D
E10 K L M N O P A B C D
.
.
.
In the expected output, (new_ID1.txt), second index of the array for "E1" and "E4" is maintained from the original array. Everything else is replaced by the reference line in "E2,E3,E5...".
Code
#!/usr/bin/perl
use strict;
use warnings;
my %HoHoA = ();
open(IN,"ids.txt");
my #ids = <IN>; chomp #ids; close IN;
open(IN2,"indices_and_values.txt");
while(my $l = <IN2>)
{
chomp $l;
my #tmp = split "\t", $l;
my $lid = $tmp[0];
my $pos = $tmp[1];
my #gps = #tmp[2..$#tmp];
foreach my $g (#gps)
{
push #{$HoHoA{$lid}{$g}}, $pos;
}
}
close IN2;
foreach my $outer (sort keys %HoHoA)
{
open(IN3,"$outer.txt");
my #rS = <IN3>; chomp #rS; close IN3;
my #orgArr = (); my #refArr = (); my #newArr = ();
foreach my $unk (#rS)
{
#orgArr = split "\t", $unk;
if($unk =~ /^Ref/)
{
#refArr = split "\t", $unk;
next;
}
foreach my $inner (sort keys %{$HoHoA{$outer}})
{
if($inner =~ /^$orgArr[0]/)
{
foreach my $ele (sort {$a <=> $b} #{$HoHoA{$outer}{$inner}})
{
$refArr[$ele] = $orgArr[$ele];
}
}
#else
#{
#}
}
print ">$orgArr[0]\t";
print join("\t",#refArr[1..$#refArr]);
print "\n";
}
#rS = ();
print "\n";
}
The shown code is well-meant but a bit too complicated; you may have lost your way in the maneuvers over the nested data structure. Here's another, simpler, approach.
Parse the information from the "reference" file (File-1) into a hash (E1 => [2, ...], ..). I put indices for data to be kept in an arrayref to allow for multiple indices for a row. Then go line by line, replacing data at these indices for rows that have a key, and print output as you go.
use warnings;
use strict;
use feature 'say';
my ($ref_file, $data_file) = #ARGV;
die "Usage: $0 ref-file data-file\n" if not $ref_file or not $data_file;
open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my %rows;
while (<$fh>) {
my (undef, $idx, $row_id) = split;
for (split /,/, $row_id) {
push #{$rows{$_}}, $idx; # elem => [ indices ]
}
}
my $outfile = 'new_' . $data_file;
open $fh, '<', $data_file or die "Can't open $data_file: $!";
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
my #ref = split ' ', <$fh>;
shift #ref; # toss the first field
while (<$fh>) {
my ($row_id, #data) = split;
if (exists $rows{$row_id}) { # this row needs attention
my #new_row = #ref;
foreach my $idx (#{$rows{$row_id}}) { # keep data at these indices
$new_row[$idx] = $data[$idx];
}
say $fh_out join "\t", $row_id, #new_row;
}
else { # use whole reference line
say $fh_out join "\t", $row_id, #ref;
}
}
The new file (shown with two spaces instead of the actual tabs, for readability)
E1 K L G N O P A B C D
E2 K L M N O P A B C D
E3 K L M N O P A B C D
E4 K L G N O P A B C D
E5 K L M N O P A B C D
E6 K L M N O P A B C D
E7 K L M N O P A B C D
E8 K L M N O P A B C D
E9 K L M N O P A B C D
E10 K L M N O P A B C D
Note that the given input file happens to have the same entries as the reference line to use in replacement at many indices of interest -- so we can't see those "changes" in the above output. (I tested by changing the input file so to be able to see.)
This is one way to do it, if I understood your problem statement correctly:
#!/usr/bin/perl
use strict;
use warnings;
my %keep_idx;
open FILE, "file-1" or die "Couldn't open file-1";
while(<FILE>) {
my (undef, $idx, $id_str) = split /\s+/;
my #ids = split /,/, $id_str;
foreach my $id (#ids) {
$keep_idx{$id}{$idx} = 1;
}
}
close FILE;
open FILE, "file-2" or die "Couldn't open file-2";
open OUTFILE, ">file-3" or die "Couldn't open file-3";
my (undef, #ref) = split /\s+/, <FILE>;
while(<FILE>) {
my ($id, #src) = split /\s+/;
my $line = "$id";
for (my $i = 0; $i <= $#src; $i++) {
my $e = $keep_idx{$id}{$i} ? $src[$i] : $ref[$i];
$line .= " $e";
}
print OUTFILE "$line\n";
}
close OUTFILE;
close FILE;
I have a file like below
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
I want the output to look like this
1 B C D
2 K B D
1 B C D
2 K B D
1 B C D
2 K B D
Sort commands don't work, so I tried writing a Perl program -
use strict;
use Data::Dumper;
my $file1 = <$ARGV[0]>;
open (IF2, "$file1") || die "Cannot open the file\n";
open (OUT, ">$file1.out") || die "Cannot open the out file\n";
my $k = 0;my $i=0;
my #line;
my $m;
my #line2;
while ( chomp($m = <IF2>) ) {
my $count = 0;
#line2 = split(/\t/,$m);#<stdin>;
my $l = length #line2;print $l;<stdin>;
for (my $x = 0; $x < $l;$x++) {
my $k = 0;
for (my $y = 0;$y < $l; $y) {
$i++;
#
if ($count == 0)
{
print OUT "\t$line2[$x]";
$count++;
}
if ($count != 0 && $x != $y)
{
if ($line2[$x] eq $line2[$y])
{
$k++;
}
}
}
if ($k == 0)
{
print OUT "\t$line2[$x]";
}
}
print OUT "\n";
}
print $i;
close IF2;
close OUT;
But it didn't work.
Can some one help?
Note The input and output examples in the question were edited: now the data is consistent with the title, asking to remove all duplicates. (See the original question for how it looked.) I am leaving the answer to the original question as well, until we hear from the poster.
I would like to make a general remark first.
Your posted code is an honest attempt to write a C-style program in Perl. I recommend to resist that and to learn how to use Perl instead. It does take a little investment of time and effort but it goes quick and pays off quick. As you can see below, it makes many things incomparably easier.
Even so, there are issues with the posted code but I can't write a code review now.
Edited question
We need to remove all duplicates from each line. One way: split the line by whitespace then remove duplicates from the list, which is a standard task with ready solutions.
use warnings;
use strict;
use List::MoreUtils qw(uniq);
my $file = '...';
my $fileout = '...';
open my $fh, '<', $filen or die "Can't open $file: $!";
open my $fh_out, '>', $fileout or die "Can't open $fileout: $!";
while (<$fh>)
{
my #unique = uniq split;
print $fh_out "#unique\n";
}
close $fh;
close $fh_out;
The line with spaces is restored by printing the quoted array, when spaces (or whatever is in $") are inserted between elements . The other option is to join the resulting list
my $res = join ' ', uniq split;
print $fh_out $res, "\n";
or just print $fh_out join(' ', uniq split), "\n";.
This uses uniq from List::MoreUtils module. Note from uniq that
The order of elements in the returned list is the same as in LIST.
Once List::MoreUtils got mentioned, take note of the core module List::Util as well.
A one-liner version
perl -MList::MoreUtils=uniq -lne'print join " ", uniq split' input > output
or
perl -MList::MoreUtils=uniq -lane'print join " ", uniq #F' input > output
See Command switches in perlrun
Original question (see it in the edit history)
I was answering the question with the belief that this was the input
1 B B C D B 2 K B D D K 1 B B C D B 2 K B D D K 1 B B C D B 2 K B D D K
and this was the desired output
1 B C D 2 K B D 1 B C D 2 K B D 1 B C D 2 K B D
By your desired output you want to remove only adjacent duplicates (not get "uniq" as the title says).
For that you can use the ability of regex to match repeated patterns, by using backreferences. First we need to strip all spaces, and we'll put them back in the end. For example
use warnings;
use strict;
my $file = '...';
my $fileout = '...';
open my $fh, '<', $filen or die "Can't open $file: $!";
open my $fh_out, '>', $fileout or die "Can't open $fileout: $!";
while (my $line = <$fh>)
{
$line =~ s/\s*//g; # remove spaces /
$line =~ s/(.)\1+/$1/g; # remove adjacent duplicates
$line =~ s/(.)/$1 /g; # restore space
print $fh_out $line;
}
close $fh;
close $fh_out;
The . matches any character, replace it with something more restrictive if needed (for example \w, for a "word" character). See perlretut. Note that we cannot restore space along with replacement (like s/(.)\1+/$1 /g) since non-repeating chars aren't matched and don't get the space back.
This can be done in more concise ways.
A one-liner version
perl -pe's/\s*//g; s/(.)\1+/$1/g; s/(.)/$1 /g' input > output
See Command switches in perlrun
I suggest something like this. It finds all occurrences of whitespace followed by non-whitespace, and checks whether the non-whitespace has been seen before on the current line. The matched substring is deleted if the non-whitespace has been seen before, otherwise it is left untouched
use strict;
use warnings 'all';
while ( <DATA> ) {
my %seen;
s/(\s+(\S+))/ $seen{$2}++ ? '' : $1 /eg;
print;
}
__DATA__
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
output
1 B C D
2 K B D
1 B C D
2 K B D
1 B C D
2 K B D
This may be done in a one-liner like this
perl -pe 'my %s; s/(\s+(\S+))/ $s{$2}++ ? "" : $1 /eg' myfile
I have a tab separated large file like this:
input.txt
a b c
s t e
a b c
f q y
r e x
to delete the repeated lines (rows) in this file, i use:
my %seen;
my #lines;
while (<>) {
my #cols = split /\s+/;
unless ($seen{$cols[0]}++) {
push #lines, $_;
}
}
print #lines;
the output here is:
a b c
s t e
f q y
r e x
Now if I want to delete those lines too that contain repeted values (means: that value once appear anywhere in upper rows/columns, here "e") and keep only the uppermost value containing line, please suggest what will be the most preffered approach keeping in mind that my input file is very large with many columns and rows.
model output that I want for the above input.txt would be:
a b c
s t e
f q y
Thank you
You also need to iterate over the #cols and examine every item instead of just the first one, $cols[0].
You need something like
unless ($seen{$cols[0]}++ || $seen{$cols[1]}++ || $seen{$cols[2]}++ ...) {
push #lines, $_;
}
Of course that would be bad style and impossible if you don't know the number of columns in advance.
I would do it with grep:
my %seen;
my #lines;
while (<DATA>) {
my #cols = split /\s+/;
unless ( grep { $seen{$_}++ } #cols ) {
push #lines, $_;
}
}
print #lines;
__DATA__
a b c
s t e
a b c
f q y
r e x
Output:
a b c
s t e
f q y
grep processes the code between the curlies { $seen{$_}++ } for each element in the list #cols and returns (in scalar context) the number of items that evaluated to true.
It's not the fastest approach because it always iterates over the whole array (even if the first evaluation would be sufficient for your particular test). But give it a try; perhaps it's fast enough for you.
As I wrote in my comments, split /\s+/ is very rarely correct
And the solution you have mishandles lines with duplicate fields
It's also more efficient to replace grep with any from the core List::Util module
I suggest that you store the fields of each line in a hash %cols, like this
use strict;
use warnings 'all';
use List::Util 'any';
my ( #lines, %seen );
while ( <DATA> ) {
my %cols = map { $_ => 1 } split;
push #lines, $_ unless any { $seen{$_}++ } keys %cols;
}
print for #lines;
__DATA__
a b c
p p p
p q r
s t e
a b c
f q y
r e x
output
a b c
p p p
s t e
Even this may not be what you want, as the line f q y is omitted because q has already been "seen" in the omitted line p q r. You will have to clarify the required behaviour in this situation
#t = qw(a b c d e + g h + j k m n l + h +);
#q = qw(a b c d e f g h i j k l m l j h h);
#s = qw(a b c d e f g h k j k l m l j h h);
foreach (0..$#q){
if($t[$_] eq ($q[$_] && $s[$_])){
print "$t[$_]";
}
print "$t[$_]-$t[$_]\n";
elsif($t[$_] eq '+' && $q[$_] eq $s[$_]){
print"$t[$_]";
}
else{
print "\n";
}
}
Expected Output:
abcde+gh [1-8]
jk [10-11]
l+h+ [14-17]
Here #t based on matching both #q and #s, and print the intervals also based on #t.
I am not able to get an intervals as mismatching. please give me a good solution
Your code has an syntax error you introduced with your 4th edit. You can't put any code outside an if's block and its elseif. If I understood it right you wanted to know when the arrays #q, #s and #t line up, where #t is allowed to have '+' as a wildcard.
Here is one solution. It uses a $start variable to check if we are inside an interval and stores the beginning. If we are at the end of an interval or the arrays. We print the interval lengths. There are probably nicer ways to format this. The best would be to introduce more complex ad-hoc objects. The code would be much easier if you were't interested in the indices of the beginning and end of the intervals.
For the test: I restructured it a bit. Furthermore if you already know that $q[$_] eq $s[$_] you won't have to check both $t[$_] eq $s[$_] and $t[$_] eq $q[$_]. You don't have to make that check at all if $t[$_] eq "+"
#!/usr/bin/env perl
use strict; # These aren't optional!
use warnings; # Always use them!
use 5.01; # for the // operator and say
my #t = qw(a b c d e + g h + j k m n l + h +);
my #q = qw(a b c d e f g h i j k l m l j h h);
my #s = qw(a b c d e f g h k j k l m l j h h);
my ($start);
sub print_interval{
my $end = shift;
printf((' 'x(8+$start-$end)). # inserting the whitespaces
"[%2d-%-2d]\n", $start, $end);
}
foreach (0..$#q){
my ($te, $qe, $se) = ($t[$_], $q[$_], $s[$_]); # just shorthands
if($qe eq $se && ($te eq "+" || $te eq $qe)){
$start //= $_; # if not set, set it to the current index
print $te;
}elsif (defined $start){
print_interval($_-1);
undef $start;
}
}
if (defined $start){
# if we are still in an interval at the end,
# we'll have to print that too.
print_interval($#q)
}
If you're uncomfortable with the definedness checks, you also can set $start to -1 and check 0 <= $start.
Here is a solution that uses intermediate objects and saves the results in an array, this makes for nicer formatting and the code is structured better:
# … strict, warnings, array declarations
my ($res,#results);
foreach (0..$#q){
my ($te, $qe, $se) = ($t[$_], $q[$_], $s[$_]);
if($qe eq $se && ($te eq "+" || $te eq $qe)){
$res = {start => $_, string => ''} unless defined $res;
$res->{string} .= $te;
}elsif (defined $res){
$res->{end} = $_-1;
push #results, $res;
undef $res;
}
}
if (defined $res){ # still in interval
$res->{end} = $#q;
push #results, $res;
}
printf "%-9s[%2d-%-2d]\n", #{$_}{qw|string start end|} for #results;
#!/usr/bin/perl
use strict;
use warnings;
my #t = qw(a b c d e + g h + j k m n l + h +);
my #q = qw(a b c d e f g h i j k l m l j h h);
my #s = qw(a b c d e f g h k j k l m l j h h);
my #current_interval = (); #will store the interval we are currently working on
my #intervals = (); #keeps track of all those intervals
for(0 .. $#t){
if($q[$_] eq $s[$_] and ($q[$_] eq $t[$_] or $t[$_] eq '+')){
push(#current_interval, $_);
}
else{
if(#current_interval){
push(#intervals, [$current_interval[0], $current_interval[$#current_interval]]);
#current_interval = ();
}
}
}
#when exiting the loop we dont want to lose our current interval!
if(#current_interval){
push(#intervals, [$current_interval[0], $current_interval[$#current_interval]]);}
#print intervals
for (#intervals){
my #c = #{$_};
print $c[0],"\t",$c[1],"\n";
}
I got the intervals for you.
Please note that I added "use strict; use warnings" - before adding this solution to your project.
Greetings Tim
I'm writing a piece of code that creates a HoAs and loops through for each key. The snippet below shows a basic example of the problem I'm having.
#!/usr/bin/perl
use strict;
use warnings;
my #array1 = qw (1 1 3 4 5); # Note that '1' appears twice
my #array2 = qw (a b c d e);
my #array3 = qw (6 7 8 9 10);
my #array4 = qw (f g h i j);
my %hash;
push #{$hash{$array1[$_]}}, [ $array2[$_], $array3[$_], $array4[$_] ] for 0 .. $#array1;
for my $key (sort keys %hash) {
my ($array2, $array3, $array4) = #{$hash{$key}[-1]};
print "[$key] $array2\t$array3\t$array4\n";
}
Output:
[1] b 7 g
[3] c 8 h
[4] d 9 i
[5] e 10 j
For the data I'm actually using (as opposed to this example) I have been using a key that I've just realised isn't unique, so, as above I end up overriding non-uniqe keys. I'm mainly using these values as keys in order to sort by them later.
My question is either:
A) I can perform the above task for each key unless (exists $hash{$array1}) in which case I can modify it
or
B) Is there a way to sort by those values, in which case I could use another, non-redundant key.
Thanks!
so, as above I end up overriding non-uniqe keys
You aren't. Let's print out the whole contents of that hash:
for my $key (sort { $a <=> $b } keys %hash) { # sort numerically!
for my $array (#{ $hash{$key} }) { # loop over all instead of $hash{$key}[-1] only
say "[$key] " . join "\t", #$array;
}
}
Output:
[1] a 6 f
[1] b 7 g
[3] c 8 h
[4] d 9 i
[5] e 10 j
You would be overriding the values if you were building the hash like
$hash{$array1[$_]} = [ $array2[$_], $array3[$_], $array4[$_] ] for 0 .. $#array1;
(And printing it as)
for my $key ( ... ) {
say "[$key] " . join "\t", #{ $hash{$key} };
}
That is, assigning instead of pushing.
If you want to keep the first value assigned to each key, use the //= operator (assign if undef).