perl better method for counting each array element occurance

perl better method for counting each array element occurance - perl

I have 2 arrays, one is the root array, which contains huge mount of elements. Another is the tested array, all of which elements is a subset of root array. I want to construct a new array, with size equal to root arrays, and its element value at a specific position represents the count of that element in the tested array.
Below codes works well when 2 array size is small. But the practical problem I meet is that, the root arrays has about 15000 elements, and there are about 14000 tested arrays.
I wonder to have better algorithm. Do you guys have some suggestions?
my #root=qw(1 2 3 4 5 6 7 8 10);
my #aa=qw(1 1 2 3);
my #count;
foreach my $eleroot(#root){
my $mathnum=0;
my ($i) = grep { $root[$_] ~~ $eleroot } 0 .. $#root;
foreach my $eleaa(#aa){
if ($eleroot==$eleaa){
$mathnum++;
}
}
$count[$i]=$mathnum;
}
print #count;

A better algorithm would be to use a hash to keep counts. For your sample arrays, it would look like this. (And will run considerably faster than your solution).
#!/usr/bin/perl
use strict;
use warnings;
my #root=qw(1 2 3 4 5 6 7 8 10);
my #aa=qw(1 1 2 3);
print join("\t", #root), "\n";
my %seen;
for my $data (#aa) {
$seen{$data}++;
}
print join("\t", map {$_ // '0'} #seen{#root}), "\n";
Output is:
1 2 3 4 5 6 7 8 10
2 1 1 0 0 0 0 0 0
#seen{#root} is a hash slice keyed by the #root array. If no item was found for any of the #root elements, the map supplies a zero.
The map {$_ // '0'} portion is saying pass to join a count if $_ has a count otherwise pass a zero.

Related

How to create a numeric ruler and evenly spaced numeric width markers

I am new to Perl. An exercise, where I am to create a numeric ruler from which, I size columns for data at 20 characters-width, is proving a little difficult to complete. So far, I have,
printf “%10d” x 5, (1..6);
#ruler = (1..10) x 7;
Print #ruler, “\n”;
It should look something like,
1 2 3 4
1234567890123456789012345678901234567890
What I get for the top row of numbers is an error, ‘Redundant argument in printf at <script.pl> line #; the bottom row produces numbers from 1 to 10, as it ought with the range operator, but I would like it to produce 1 to 9 with a zero on the end. I did think to start the range from 0, but I haven’t figured out how to remove the first index and only the first index.
I would be grateful for your guidance with both issues.

The warning is due to the fact that you pass 6 numbers to printf, but the format only requires 5.
To me,
1 2 3 4
1234567890123456789012345678901234567890
reads as
11, 12, 13, ..., 19, 10, 21, 22, 23, ...
Why does it start with 11? Why is 10 between 19 and 21?
The following makes more sense:
1 2 3 4
01234567890123456789012345678901234567890 0-based
and
1 2 3 4
1234567890123456789012345678901234567890 1-based
I'm not going to give the solution outright.
If you want the numbers 1 to 9 and 0, that would be 1..9, 0.
%10d will add padding on the left. %-10d will add padding on the right.
Nothing says you can't prefix the output with something that doesn't repeat, like a zero or a space.

Provided desired output starts count from 11 instead 1 -- it doesn't look right.
Perhaps OP intended to start count from 1 until some $max value with placing a digit representing tens above main counter.
Please study following code sample for compliance with your requirements.
use strict;
use warnings;
use feature 'say';
my $max = shift || 45;
rule($max);
sub rule {
my $max = shift;
my($a,$b);
$a .= ' ' x 9 . $_ for 1..$max/10;
$b .= $_ % 10 for 1..$max;
say $a . "\n" . $b;
}
Output
1 2 3 4
123456789012345678901234567890123456789012345
Original OP's code requires slight modification to achieve desired output
use strict;
use warnings;
use feature 'say';
my $max = shift || 45;
printf "%10d" x int($max/10) . "\n", (1..$max/10);
print $_ % 10 for 1..$max;
print "\n";

Using unix tools(without resorting to R), how to filter the rows based on some statistics of the grouping?

I'd like to filter rows of the following table such that it goes from this :
A 1 3 SOME_OTHER_INFO
A 1 4 SOME_OTHER_INFO2
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 3 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
to that:
A 1 3 SOME_OTHER_INFO
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
Filtering criteria is this:
1) based on the first 2 columns, group rows.
2) Then for each group, select the row where the third column is minimum within group.
3) Return.
Now it's easy to do something like this in R using the package such as plyr using commands like this:
ddply(data, .(first_col, second_col), function(x) {
min_idx = which.min(x$third_col);
return(x[min_idx])
})
But I want to know if there is a efficient & elegant way to do this using unix tools on a command line.
Lastly, I almost found the beautiful solution to this using datamash, which is a recently tool released in GNU, but with some glitches.
$ datamash -g 1,2 min 3 -f < file.txt | cut -f1-4
A 1 3 SOME_OTHER_INFO1
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 3 SOME_OTHER_INFO4 # <-- not the correct row I want to grab
The problem was when using "-f" flag, it grabs the first item from each group, not the row that min corresponds to. So if you look at the output above "B 2 3 SOME_OTHER_INFO4" was selected rather than "B 2 0 SOME_OTHER_INFO4".

Here are couple of more options using perl:
perl -MList::Util=min -lane'
$h{"#F[0,1]"}{$F[2]} = $_
}{
print $h{$_}{ min keys %{$h{$_}} } for sort keys %h
' file
A 1 3 SOME_OTHER_INFO
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
Create a hash of hash having inner key as the first two columns and outer key as the third column.
Using the core module min method grab the smallest outer key and print the value which is the entire line.
or without the core module:
perl -lane'
push #{ $h{"#F[0,1]"} }, [$F[2], $_]
}{
print $_->[1] for sort map {
(sort { $a->[0] <=> $b->[0] } #$_)[0]
} values %h
' file
A 1 3 SOME_OTHER_INFO
A 2 5 SOME_OTHER_INFO3
B 1 1 SOME_OTHER_INFO4
B 2 0 SOME_OTHER_INFO4
Create a hash of arrays using key as first two columns and the value as array of third column and the entire line.
Pull the hash entries by values and sort based on the first element of the array.
Using slice just grab the first smallest entry and print the second element which is the entire line.

Dunno what you call efficient or elegant, but this seems to be what you want:
sort -k1 -k2,3n file.txt | rev | uniq -f 2 | rev
If the double rev is considered inelegant (or the actual number of columns varies, in which case it won't work),
sort -k1 -k2,3n file.txt | perl -wane'print if $.==1 || $F[0] ne $last[0] || $F[1] != $last[1]; #last=#F'

Provided you can get the lines sorted in the right order, a simple Awk filter which prints only the first line in a group should work.
sort -k1 -k2n -k3n file.txt |
awk '!a[$1 $2]++'
The Awk script populates an array a with keys from the first two fields, and prints only when it sees a new key.

How about an elegant, efficient regex?
perl -pi'.old' -0777 -e 's/^([a-z]\t[0-9]\t)([0-9]\t\w+\s*)^(\g1[0-9]\t\w+\s*){1,}/$1$2/smgi' file.txt
Slurps in file.txt and replaces consecutive lines where the first two columns are identical with the first occurrence of the line. This version modifies file.txt in place but backs up the original file as file.txt.old.

how to pop the last three element of an array

I have a certain array from which I want to remove the last three element (constant number, because I want a list of subdirectory which are in a directory where there is 3 files).
Is there a better and nicer solution than this one ?
my #list = (`ls --group-directories-first`);
pop #list;
pop #list;
pop #list;

You can use splice:
splice #list, -3;
Where -3 denotes an offset of 3 from the end. This will remove elements from that offset and onward.
It's worth noting that parsing output from ls is a horrible idea. You can do the exact same thing with Perl code:
my #list = grep -d, glob "*";
The -d is a file check, which checks if the current argument is a directory.

splice is also the solution I'd advise.
However, since you're working with the end of an array and just wanting to destroy those trailing elements instead of saving them to another structure, you could also just edit the last array index:
$#list -= 3;
The above is reduces the size of the #list array by 3 elements. Here's another example:
my #a = (1..10);
$#a -= 4;
print "#a\n";
# Prints: 1 2 3 4 5 6

Sorting lines according to the numerical value of the first element on each line?

I have just started learning Perl, hence my question might seem very silly. I apologize in advance.
I have a list say #data which contains a list of lines read from the input. The lines contain numbers that are separated by (unknown number of) spaces.
Now, I would like to sort them and print them out, but not in the lexicographical order but according to the numerical value of the first number appearing on the line.
I know this must be something very simple but I cannot figure out how to do it?
Thanks in advance,

You can use a Schwartzian transform, capturing the first number in the row with a regex
use strict;
use warnings;
my #sorted = map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /^(-?[\d.]+)/ ] } <DATA>;
print #sorted;
__DATA__
21 13 14
0 1 2
32 0 4
11 2 3
1 3 3
Output:
0 1 2
1 3 3
11 2 3
21 13 14
32 0 4
Reading the transform from behind, the <DATA> is the file handle we use, it will return a list of the lines in the file. The first map statement returns an array reference [ ... ], that contains the original line, plus the first number that is captured in the line. Alternatively, you can use the regex /^(\S+)/ here, to just capture whatever non-whitespace that comes first. The sort uses this captured number inside the array ref when comparing lines. And finally, the last map converts the array ref back to the original value, stored in $_->[0].
Be aware that this relies on the lines having a number at the start of the line. If that can be missing, or blank, this will have some unforeseen consequences.
Note that only using a simple numerical sort will also "work", because Perl will convert one of your lines to the correct number, assuming each line begins with a number followed by space. You will have some warnings about that, such as Argument "21 13 14\n" isn't numeric in sort. For example, if I replace my code above with
my #foo = sort { $a <=> $b } <DATA>;
I will get the output:
Argument "21 13 14\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "0 1 2\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "32 0 4\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "11 2 3\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "1 3 3\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
0 1 2
1 3 3
11 2 3
21 13 14
32 0 4
But as you can see, it has sorted correctly. I would not advice this solution, but it is a nice demonstration in this context, I think.

You can use the sort function :
#sorted_data = sort(#data);

print the number which is present in documentA but absent in documentB

I've a problem in making a Perl program for matching the numbers in two documents. Let's say there are documents A and B.
So I want to have the numbers which is present in document A and absent in the document B.
Example 1:
DocA: 1 2 3 5 6 8 9 10 11 12 13
DocB: 1 2 3 6 7 8 9 10 11
output:
5 12 13
EDITED:
#a=qw( 1 2 3 5 6 8 9 10 11 12 13);
#b=qw( 1 2 3 4 5 6 7 8 9 10 11);
#new=();
#new1=();
for($i=0;$i<=$#a;$i++)
{
for($j=0;$j<=$#b;$j++)
{
if($a[$i] ne $b[$j])
{
push(#new,$b[$j]);
}
}
}

You could use the CPAN module Array::Utils. The following will do what you need:
use Array::Utils qw(:all);
my #a = qw( 1 2 3 5 6 8 9 10 11 12 13);
my #b = qw( 1 2 3 4 5 6 7 8 9 10 11);
my #diff = array_minus(#a, #b);`
By the way, the reason your program does not work is because you made a logic error. You are adding a value to #new EVERY TIME the value does not match. So, in the first iteration of the loop you compare the a value to the b value. Even though the value is equal to the first element of #b, it is not equal to the other ten elements of #b and hence all of these elements are added to #new. I have rewritten your loop. Note the more Perlish loops instead of the C-loop you used in your code.
my #a = qw( 1 2 3 5 6 8 9 10 11 12 13);
my #b = qw( 1 2 3 4 6 7 8 9 10 11);
my #new = ();
for my $a_value (#a) {
my $b_not_in_a = 1;
INNER: for my $b_value (#b)
{ if($a_value == $b_value) {
$b_not_in_a = 0;
last INNER; }
}
if ($b_not_in_a)
{
push(#new,$a_value);
}
}

Consider Algorithm::Diff ?

You could use a hash. Read DocA and initialize a hash using the read numbers as keys:
open(INPUT, "DocA");
while (<INPUT>)
{
chomp;
$myhash{$_} = 1;
}
Then read DocB and foreach number, check if already defined in your hash:
open(INPUT, "DocB");
while (<INPUT>)
{
chomp;
if (not defined $myhash{$_})
{
print "$_\n";
}
}
This code asumes you have a number per line. If your files are formatted differently, you will need to adapt it.
This code will work even if your numbers aren't ordered.

The lists in your example are sorted. I am assuming that they are, and that you're not allowed to use modules since it is homework. Also, since it is homework, I won't give the answer, but some hints in the right direction.
If you would do this by hand, and you are only allowed to look at the front of each row, how would you do it? If the head of A is a number smaller than B, what does that mean? If it is equal, what does that mean? If it is larger, what does that mean?
Now you know how you can handle one situation, from that you can create some kind of step to reduce the problem. Now define when you need to stop, and what the possible leftovers of the lists are at that point, and how you can get your answer from the values you collected in the step, and the remainder after you stop.
Some examples of extreme cases:
#a = qw();
#b = qw(1 2 3);
#a = qw (1 2 3);
#b = qw (4 5 6);
#a = qw(1 3 5);
#b = qw(2 4 6)
Good luck!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

perl better method for counting each array element occurance - perl

Related

How to create a numeric ruler and evenly spaced numeric width markers

Using unix tools(without resorting to R), how to filter the rows based on some statistics of the grouping?

how to pop the last three element of an array

Sorting lines according to the numerical value of the first element on each line?

print the number which is present in documentA but absent in documentB

Categories

Resources