How to find the index of the smallest element in awk like this? [closed] - perl

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Input file
Cat|Dog|Dragon -40|1000|-20
K|B|L|D|E -9|1|-100|-8|9
Output file:
Dragon 20
B 1
The workflow is like this: In column2, find the index of the smallest absolute value, then fetch element in column1 using this index. Does anyone have ideas about this?

Using my incredible powers of perception, I detect a hint that this is not precisely an operational problem. Could it be Homework?
{
split($1, catdog, "|")
split($2, numbers, "|")
smallest = -1
for(i in numbers) {
a = numbers[i]
if(a < 0)
a = -a
if(smallest == -1 || a < smallest) {
smallest = a
j = i
}
}
printf("%-9s %2d\n", catdog[j], smallest)
}

The following awk command should work:
awk '
function abs(value)
{
return (value<0?-value:value)
}
{
len=split($2,arr,"|")
min=abs(arr[1])
minI=1
for(i=1;i<=len;i++){
if(abs(arr[i])<min){
min=abs(arr[i])
minI=i
}
}
split($1,arr2,"|")
print(arr2[minI],min)
}' file
Output:
Dragon 20
B 1

perl -lnwe '($k,$v) = map [split /\|/], split;
my %a;
#a{#$k} = map abs, #$v;
print "$_\t$a{$_}" for
(sort { $a{$a} <=> $a{$b} } keys %a)[0];
' input.txt
Output:
Dragon 20
B 1
Explanation:
The command line switches:
-l handle line endings, for convenience
-n read input from argument file name or stdin
The code:
The rightmost split splits each line on whitespace. We split those fields again on pipe | and put the result in an array ref [ ... ] so they fit inside a scalar variable ($k and $v). Then we declare a lexical hash %a to hold our data for each new input line. We need this declaration to avoid values from one line leaking over into the next line. We then assign via a hash slice the keys from $k to the absolute values in $v. This is the same principle as:
#foo{'a', 'b', 'c'} = (1, 2, 3); # %foo = ( a => 1, b => 2, c => 3);
Then we sort the hash on the values, take the first value with a subscript [0] and print out the corresponding key and value separated by a tab.

Related

Writing a custom base64 encoding function in perl

I'm trying to learn perl by writing a custom base64 encoding function, unfortunately I've had no success by now. What I've come to is the following, which doesn't work and I unfortunately don't have any clue about how to proceed.
sub base64($) {
# Split string into single bits
my $bitstring = unpack("B*", $_[0]);
# Pack bits in pieces of six bits at a time
my #splitsixs = unpack("(A6)*", $bitstring);
my #enc = ("A".."Z", "a".."z", "0".."9", "+", "/");
# For each piece of six bits, convert them to integer, and take the corresponding place in #enc.
my #s = map { $enc[pack("B6", $_)] } #splitsixs;
join "", #s;
}
Can someone explain to me what am I doing wrong in this conversion? (Please leave aside for now the fact that I'm not considering padding)
I finally made it! I was erroneously trying to indexing elements in $enc directly via packed bytes, while I should convert them first into integers.
You can see this in the lines below.
I copy the entire function, padding included, in the hope that it might be useful to others.
sub base64($) {
# Split string into single bits
my $bitstring = unpack("B*", $_[0]);
# Pack bits in pieces of six bits at a time
my #sixs = unpack("(A6)*", $bitstring);
# Compute the amount of zero padding necessary to obtain a 6-aligned bitstring
my $padding = ((6 - (length $sixs[-1]) % 6) % 6);
$sixs[-1] = join "", ($sixs[-1], "0" x $padding);
# Array of mapping from pieces to encodings
my #enc = ("A".."Z", "a".."z", "0".."9", "+", "/");
# Unpack bit strings into integers
#sixs = map { unpack("c", pack("b6", join "", reverse(split "", $_))) } #sixs;
# For each integer take the corresponding place in #enc.
my #s = map { $enc[$_] } #sixs;
# Concatenate string adding necessary padding
join "", (#s, "=" x ($padding / 2));
}

perl6 Permutation match

I am still trying to work on permutation match, and I wonder if anyone has better way to do it. I want to match all patterns in an array in any order, i.e., match permutations of items (string or other objects) in an array. E.g., if array is (1,2,3), then it is true if a string contains 1 and 2 and 3 in any order; i.e, true if a string contains permutation of (1,2,3).
What I have now is this:
my #x = < one eins uno yi two zwei dos er one one one two two two >;
my #z = < one eins uno yi two zwei dos er one one one two two two foo >;
my $y = "xxx one eins uno yi two zwei dos er xxx";
sub matchAllWords($aString, #anArray) {
my $arraySize = #anArray.elems;
if $arraySize == 0 { False; }
elsif $arraySize == 1 {
($aString.match(/:i "#anArray[0]" /)).Bool;
} else {
my $firstCheck = ($aString.match(/:i "#anArray[0]"/)).Bool;
if $firstCheck {
$firstCheck
and
(matchAllWords($aString, #anArray[1..*]));
} else {
return False;
}
}
}
say matchAllWords($y, #x);
# result is True, but it should NOT be True because $y should not
# match permutations of #x which contains multiple identical elements
# of "one" and "two"
say matchAllWords($y, #z); # False as expected;
The problems is that my function matches all unique words in the array, but is unable to differentiate permutations of duplicate words. I can add more and more codes to tell if a word has been matched, but more codes to accomplish a simple idea, "permutation match", is un-perl-ly. Any suggestions? Thanks
New answer
Based on everyone's comments, here's a restatement of the problem as I now understand it, followed by a new solution:
Test that Y, a string, contains all of the strings in Z, a Bag (multiset) of strings, with correct copy count / multiplicity.
my \Z = < one eins uno yi two zwei dos er two > .Bag ;
my \Y = "xxx one eins uno yi two zwei dos er two xxx" ;
sub string-matches-bag ($string, $bag) {
for $bag.kv -> $sub-string, $copy-count {
fail unless ($string ~~ m:g/ $sub-string /).elems == $copy-count
}
True
}
say string-matches-bag Y, Z
Old answer
say so $y.words.all eq #z.any
An explanation for this line of code is in the last part of this answer.
I found your question pretty confusing. But I'm hopeful this answer is either what you want or at least enough to move things in the right direction.
I found your data confusing. There are two 'xxx' words in your $y but none in either array. So that bit can't match. There's a 'foo' in your #z. Was that supposed to be 'xxx'? There's a 'one' in your $y but both arrays have at least two 'one's. Is that an issue?
I found your narrative confusing too.
For this answer I've assumed that #z has a xxx at the end, and that the key comment is:
a simple idea, "permutation match"
say so $y.words.all eq #z.any
so returns the boolean evaluation (True or False) of the expression on its right.
The expression on so's right uses Junctions. An English prose summary of it is 'all of the "words" in $y, taken one at a time, are string equal to at least one element of #z'.
Is this the simple solution you're asking for?

Find the top 5 closest rather than just the closest?

How would you find 5 numbers in a column of numbers that are the closest to a $VariableNumber?
For example, if the $VariableNumber = 30 then:
Example input file:
50
100
70
40
20
10
65
41
92
Example output:
20
40
41
10
50
there's an answer that someone’s posted elsewhere before that finds the closest number match in a specific column on a specific line to a given value, which goes as follows:
awk -v col_num="3" -v value="$Number" '
func abs(x) { return (x<0) ? -x : x }
{
distance = abs($col_num - value)
}
NR==1 || distance<shortest_distance {
shortest_distance = distance
nearest_value = $col_num
}
END {
print nearest_value
}
'
But I haven't been able to adapt it
I'd just sort the entries by distance to n and select the first ones like:
awk -v n=30 '
function abs(x) {return x < 0 ? -x : x}
{print abs($0 - n) "\t" $0}' < file |
sort -n |
head -n 5 |
cut -f 2-
As usual, Stéphane’s answer is very good;
simple and straightforward. 
But, if you really really want to do it entirely in awk,
and you have GNU awk (a.k.a. gawk), you can do this:
awk -v t="$VariableNumber" '
{
d = $1 - t
if (d < 0) d = -d
e = d "#" $1
if (NR <= 5) {
a[NR] = e
} else {
a[5+1] = e
asort(a, a, "#val_num_asc")
delete a[5+1]
}
}
END {
print "---"
if (NR <= 5) asort(a, a, "#val_num_asc")
for (i in a) { gsub(".*#", "", a[i]); print a[i]; }
}
'
For each input value,
this computes d as the absolute difference
between that value and the target value,
t (which is set on the command line to the value of $VariableNumber,
which, as per your example, might be 30). 
It then constructs an array entry, e,
consisting of the difference,
concatenated with a # and the original number. 
This array entry is then added to the array a. 
The first five input values are simply put into array elements 1 through 5.
After that, each number is appended to the array
by being put into element 6.  Then the array is sorted. 
Since the array entries start with the difference value,
numbers that are close to the target
(for which the difference value is low)
are sorted to the beginning of the array,
and numbers that are far from the target
are sorted to the end of the array. 
(Specify "#val_num_asc"
to sort the values as numbers rather than strings. 
Without this, differences of 10 and 20 will sort below 3 and 4.) 
Then the 6th element (the one that is farthest from the target) is deleted.
Finally (upon reaching the END of the data), we
Check whether the number of records is ≤ 5. 
If it is, sort the array,
because it is still in the order of the input data. 
(Arguably, this step is optional.)
For each element is the array,
strip off the difference and the #
by searching for the regular expression .*#
and substituting (gsub) nothing. 
Then print the original value.
Obviously, if you want to look at a column other the first one,
you can change all the occurrences of $1 in the script. 
(The script you show in your question
demonstrates how to allow the column number to be specified at run time.) 
And, if you want some number other than the closest five,
just change all appearances of 5. 
(I could have referred to a[6] in lines 9 and 11;
I wrote a[5+1] to facilitate simple-minded parameterization.)
Another, for all awks (tested with gawk, mawk, Debian's original-awk and Busybox awk):
$ awk -v v=30 -v n=5 ' # v is the central value,
function abs(d) { # n is the number of wanted values
return (d<0?-d:d) # d is distance, c array index, va value array
} # da distance array, max is greatest of smallest
((d=abs(v-$1))<max) && c==n { # if we find distance < max of top n smallest
da[maxc]=d # replace max in distance array
va[maxc]=$1 # and in value array
max=0 # set max to min to find new max distance
for(ct in da)
if(da[ct]>max) { # find the new max in the top n smallest
max=da[ct]
maxc=ct
}
if(max==0) # if max is 0, all are 0s so might as well exit
exit
next
}
c<n { # we need n first distances
da[++c]=d # fill distance array with them
va[c]=$1 # related values to value array
if(d>max) { # look for max
max=d
maxc=c
}
}
END { # in the end or exit
for(c in va) # get all values in value array
print va[c] # and output them
}' file
Output (in no particular order, array implementation related):
50
10
41
40
20
Execution time is linear, worst case is size of value array times record count so still linear (right? :).

Printing a 2500 x 2500 dimensional matrix using Perl

I am very new to Perl. Recently I wrote a code to calculate the coefficient of correlation between the atoms between two structures. This is a brief summary of my program.
for($i=1;$i<=2500;$i++)
{
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
print $Cij;
}
}
This program prints all the correlations serially in a single column. But I need to print the correlations in the form of a matrix, something like..
Atom1 Atom2 Atom3 Atom4
Atom1 0.5 -0.1 0.6 0.8
Atom2 0.1 0.2 0.3 -0.5
Atom3 -0.8 0.9 1.0 0.0
Atom4 0.3 1.0 0.8 -0.8
I don't know, how it can be done. Please help me with a solution or suggest me how to do it !
Simple issue you're having. You need to print a NL after you finish printing a row. However, while i have your attention, I'll prattle on.
You should store your data in a matrix using references. This way, the way you store your data matches the concept of your data:
my #atoms; # Storing the data in here
my $i = 300;
my $j = 400;
my $value = ...; # Calculating what the value should be at column 300, row 400.
# Any one of these will work. Pick one:
my $atoms[$i][$j] = $value; # Looks just like a matrix!
my $atoms[$i]->[$j] = $value; # Reminds you this isn't really a matrix.
my ${$atoms[$1]}[$j] = $value; # Now this just looks ridiculous, but is technically correct.
My preference is the second way. It's just a light reminder that this isn't actually a matrix. Instead it's an array of my rows, and each row points to another array that holds the column data for that particular row. The syntax is still pretty clean although not quite as clean as the first way.
Now, let's get back to your problem:
my #atoms; # I'll store the calculated values here
....
my $atoms[$i]->[$j] = ... # calculated value for row $i column $j
....
# And not to print out my matrix
for my $i (0..$#atoms) {
for my $j (0..$#{ $atoms[$i] } ) {
printf "%4.2f ", $atoms[$i]->[$j]; # Notice no "\n".
}
print "\n"; # Print the NL once you finish a row
}
Notice I use for my $i (0..$#atoms). This syntax is cleaner than the C style three part for which is being discouraged. (Python doesn't have it, and I don't know it will be supported in Perl 6). This is very easy to understand: I'm incrementing through my array. I also use $#atom which is the length of my #atoms array -- or the number of rows in my Matrix. This way, as my matrix size changes, I don't have to edit my program.
The columns [$j] is a bit tricker. $atom[$i] is a reference to an array that contains my column data for row $i, and doesn't really represent a row of data directly. (This is why I like $atoms[$i]->[$j] instead of $atoms[$i][$j]. It gives me this subtle reminder.) To get the actual array that contains my column data for row $i, I need to dereference it. Thus, the actual column values are stored in row $i in the array array #{$atoms[$i]}.
To get the last entry in an array, you replace the # sigil with $#, so the last index in my
array is $#{ $atoms[$i] }.
Oh, another thing because this isn't a true matrix: Each row could have a different numbers of entries. You can't have that with a real matrix. This makes using an Array of Arrays in Perl a bit more powerful, and a bit more dangerous. If you need a consistent number of columns, you have to manually check for that. A true matrix would automatically create the required columns based upon the largest $j value.
Disclaimer: Pseudo Code, you might have to take care of special cases and especially the headers yourself.
for($i=1;$i<=2500;$i++)
{
print "\n"; # linebreak here.
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
printf "\t%4f",$Cij; # print a tab followed by your float giving it 4
# spaces of room. But no linebreak here.
}
}
This is of course a very crude and quick and dirty solution. But if you save the output into a .csv file, most csv-able spreadsheet programs (OpenOfice) should easily be able to read it into a proper table. If the spreadsheet viewer of your choice can not understand tabs as delimeter, you could easily add ; or / or whatever it can use into the printf string.

Trouble using 'while' loop to evaluate multiple lines, Perl

Thank you in advance for indulging an amateur Perl question. I'm extracting some data from a large, unformatted text file, and am having trouble combining the use of a 'while' loop and regular expression matching over multiple lines.
First, a sample of the data:
01-034575 18/12/2007 258,750.00 11,559.00 36 -2 0 6 -3 2 -2 0 2 1 -1 3 0 5 15
-13 -44 -74 -104 -134 -165 -196 -226 -257 -287 -318 -349 -377 -408 -438
-469 -510 -541 -572 -602 -633 -663
Atraso Promedio ---> 0.94
The first sequence, XX-XXXXXX is a loan ID number. The date and the following two numbers aren't important. '36' is the number of payments. The following sequence of positive and negative numbers represent how late/early this client was for this loan at each of the 36 payment periods. The '0.94' following 'Atraso Promedio' is the bank's calculation for average delay. The problem is it's wrong, since they substitute all negative (i.e. early) payments in the series with zeros, effectively over-stating how risky a client is. I need to write a program that extracts ID and number of payments, and then dynamically calculates a multi-line average delay.
Here's what I have so far:
#Create an output file
open(OUT, ">out.csv");
print OUT "Loan_ID,Atraso_promedio,Atraso_alt,N_payments,\n";
open(MYINPUTFILE, "<DATA.txt");
while(<MYINPUTFILE>){
chomp($_);
if($ID_select != 1 && m/(\d{2}\-\d{6})/){$Loan_ID = $1, $ID_select = 1}
if($ID_select == 1 && m/\d{1,2},\d{1,3}\.00\s+\d{1,2},\d{1,3}\.00\s+(\d{1,2})/) {$N_payments = $1, $Payment_find = 1};
if($Payment_find == 1 && $ID_select == 1){
while(m/\s{2,}(\-?\d{1,3})/g){
$N++;
$SUM = $SUM + $1;
print OUT "$Loan_ID,$1\n"; #THIS SHOWS ME WHAT NUMBERS THE CODE IS GRABBING. ACTUAL OUTPUT WILL BE WRITTEN BELOW
print $Loan_ID,"\n";
}
if(m/---> *(\d*.\d*)/){$Atraso = $1, $Atraso_select = 1}
if($ID_select == 1 && $Payment_find == 1 && $Atraso_select == 1){
...
There's more, but the while loop is where the program is breaking down. The problem is with the pattern modifier, 'g,' which performs a global search of the string. This makes the program grab numbers that I don't want, such as the '1' in loan ID and the '36' for the number of payments. I need the while loop to start from wherever the previous line in the code left off, which should be right after it has identified the number of loans. I've tried every pattern modifier that I've been able to look up, and only 'g' keeps me out of an infinite loop. I need the while loop to go to the end of the line, then start on the next one without combing over the parts of the string already fed through the program.
Thoughts? Does this make sense? Would be immensely grateful for any help you can offer. This work is pro-bono, unpaid: just trying to help out some friends in a micro-lending institution conduct a risk analysis.
Cheers,
Aaron
The problem is probably easier using split, for instance something like this:
use strict;
use warnings;
open DATA, "<DATA.txt" or die "$!";
my #payments;
my $numberOfPayments;
my $loanNumber;
while(<DATA>)
{
if(/\b\d{2}-\d{6}\b/)
{
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
}
elsif(/Atraso Promedio/)
{
my (undef, undef, undef, $atrasoPromedio) = split;
# Calculate average of payments and print results
}
else
{
push(#payments, split);
}
}
If the data's clean enough, I might approach it by using split instead of regular expressions. The first line is identifiable if field[0] matches the form of a loan number and field[1] matches the format of a date; then the payment dates are an array slice of field[5..-1]. Similarly testing the first field of each line tells you where you are in the data.
Peter van her Heijden's answer is a nice simplification for a solution.
To answer the OP's question about getting the regexp to continue where it left off, see Perl operators - regexp-quote-like operators, specifically the section "Matching in list context" and the "\G assertion" section just after that.
Essentially, you can use m//gc along with the \G assertion to use regexps match where previous matches left off.
The example in the "\G assertion" section about lex-like scanners would seem to apply to this question.