Loop recursive through two files in perl - perl

Hi guys I have an issue to solve,
I have 2 files.
File A
col1,col2, value_total_to_put
File A 201843,12345,30
File B
col1,col2,col3, value_inputted, missing_value, value_max
201843,12345,447,4,0,4
201843,12345,448,0,0,4
201843,12345,449,0,0,2
201843,12345,450,4,0,4
201843,12345,451,2,0,2
201843,12345,455,4,0,4
201843,12345,457,0,0,4
201843,12345,899,10,0,10
201843,12345,334,0,1,1
201843,12345,364,0,1,1
201843,12345,71,0,2,2
201843,12345,260,0,2,2
201843,12345,321,0,2,2
201843,12345,328,0,2,2
201843,12345,371,0,2,2
201843,12345,385,0,2,2
201843,12345,426,0,2,2
201843,12345,444,0,2,2
201843,12345,31,4,6,10
201843,12345,360,2,87,99
201843,12345,373,4,95,99
201843,12345,472,4,95,99
201843,12345,475,4,95,99
201843,12345,430,0,99,99
201843,12345,453,0,99,99
201843,12345,463,0,99,99
201843,12345,482,0,99,99
201843,12345,484,0,99,99
My keys are col1 and col2 from both files and I am doing this way below and my loop is wrong because when I reach the EOF from File B my loop is stopped.
What I want is match File A and B with $col1 and $col2 and while the value_total_to_put is > 0 withdraw 1 in each loop and in value_inputted from File B when value_inputted is less than value_max. For withdraw from File A missing_value might be > 0.
For the result I will print when value_inputted is equal to value_max in other words the last value until reach value_max or value_total_to_put is 0.
while ( <FA> ){
chomp;
my($col1,$col2, $value_total_to_put) = split ",";
push #A, [$col1,$col2, $value_total_to_put];
}
my #B;
while ( <FB> ){
chomp;
my($col1,$col2,$col3, $value_inputted, $missing_value, $value_max) = split ",";
push #B, [$col1,$col2,$col3, $value_inputted, $missing_value, $value_max];
}
foreach my $line (#A){
my $idxl = #$line[0].",".#$line[1];
my $value_total_to_put = #$line[2];
while ($value_total_to_put > 0){
foreach my $row ( #B ){
if ( $idxr eq $idxl ){
my $idxr = #$row[0].",".#$row[1];
my $value_inputted = #$row[3];
my $value_max = #$row[5];
my $missing_value = #$row[4];
if ( ($value_inputted eq 0) and ($missing_value eq 0)){
#do_nothing
} elsif($value_inputted == $value_max){
#do_nothing
print join(",", $idxr, #$row[2],"Value_inputted: ".$value_inputted, "Missing_value: ".$missing_value, "Value_max:".$value_max, "Total: ".$value_total_to_put)."\n";
}else{
$value_inputted++;
$missing_value--;
$value_total_to_put--;
}
}
}
last if $value_total_to_put > 0;
}
}
The third file will be this way:
201843,12345,447,4,0,4
201843,12345,450,4,0,4
201843,12345,451,2,0,2
201843,12345,455,4,0,4
201843,12345,899,10,0,10
201843,12345,334,1,0,1
201843,12345,364,1,0,1
201843,12345,71,2,0,2
201843,12345,260,2,0,2
201843,12345,321,2,0,2
201843,12345,328,2,0,2
201843,12345,371,2,0,2
201843,12345,385,2,0,2
201843,12345,426,2,0,2
201843,12345,444,2,0,2
201843,12345,31,10,0,10
201843,12345,360,3,86,99
201843,12345,373,5,94,99
201843,12345,472,5,94,99
201843,12345,475,5,94,99
201843,12345,430,1,98,99
201843,12345,453,1,98,99

As explained by #Dave Cross, your code is quite hard to read (and misses use strictand use warnings), and your explanation of what you are trying to achieve is quite unclear...
One thing that caught my eye however, is that you start a loop with this statement...
while ($value_total_to_put > 0){
... but at the very end of that same block you do :
last if $value_total_to_put > 0;
}
This will effectively cause Perl to exit the loop after the first iteration, no matter the value of the $value_total_to_put variable. This is probably not what you want. Hence, as far as I understand, you should start your investigation by removing that last statement.

Related

Binary search—Can't use string "1" as a symbol ref while strict refs is in use

I've been browsing over the already answered questions regarding this error message.
I am trying to solve a problem from the Rosalind web site that looks for some indexes using a binary search.
When my subroutine finds the number it seems to ignore it, and if I try to print the $found variable, it gives me the error
Can't use string "1" as a symbol ref while strict refs is in use
The code is this
sub binarysearch
{
my $numbertolook = shift;
my #intarray=#_;
my $lengthint = scalar #intarray;
my #sorted = sort {$a <=> $b} #intarray;
#print $numbertolook, " " , #sorted, "\n";
my $low=0;
my $high=$lengthint-1;
my $found =undef;
my $midpoint;
while ($low<$high)
{
$midpoint=int(($low+$high)/2);
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ($numbertolook<$sorted[$midpoint])
{
$high=$midpoint;
}
elsif ($numbertolook>$sorted[$midpoint])
{
$low=$midpoint;
}
elsif ($numbertolook==$sorted[$midpoint])
{
$found=1;
print $found "\n";
last;
}
if ($low==$high-1 and $low==$midpoint)
{
if ($numbertolook==$sorted[$high])
{
$found=1;
print $found "\n";
last;
}
$low=$high;
}
}
return $found;
}
You want
print $found, "\n";
Or
print $found . "\n";
With no operator between $found and the newline, it thinks $found is the filehandle to print a newline to, and is getting an error because it isn't a filehandle.
I'll try to help
First of all, as simple as it may seem, a binary search is quite difficult to code correctly. The main reason is that it's a hotbed of off-by-one errors, which are so prevalent that they have their own Wikipedia page
The issue is that an array containing, say, the values A to Z will have 26 elements with indices 0 to 25. I think FORTRAN bucks the trend, and Lua, but pretty much every other language has the first element of an array at index zero
A zero base works pretty well for everything until you start using divide and conquer algorithms. Merge Sort as well as Binary Search are such algorithms. Binary search goes
Is it in the first half?
If so then search the first half further
Else search the second half further
The hard part is when you have to decide when you've found the object, or when you need to give up looking. Splitting data in two nearly-halves is easy. Knowing when to stop is hard
It's highly efficient for sorted data, but the problem comes when implementing it that, if we do it properly, we have to deal with all sorts of weird index bases beyond zero or one.
Suppose I have an array
my #alpha = 'A' .. 'Q'
If I print scalar #alpha I will see 17, meaning the array has seventeen elements, indexed from 0 to 16
Now I'm looking for E in that array, so I do a binary search, so I want the "first half" and the "second half" of #alpha. If I add 0 to 16 and divide by 2 I get a neat "8", so the middle element is at index 8, which is H
But wait. There are 17 elements, which is an odd number, so if we say the first eight (A .. H) are left of the middle and the last eight (I .. Q) are right of the middle then surely the "middle" is I?
In truth this is all a deception, because a binary search will work however we partition the data. In this case binary means two parts, and although the search would be more efficient if those parts could be equal in size it's not necessary for the algorithm to work. So it can be the first third and the last two-thirds, or just the first element and the rest
That's why using int(($low+high)/2) is fine. It rounds down to the nearest integer so that with our 17-element array $mid is a usable 8 instead of 8.5
But your code still has to account for some unexpected things. In the case of our 17-element array we have calculated the middle index to be 8. So indexes 0 .. 7 are the "first half" while 8 .. 16 are the "second half", and the middle index is where the second half starts
But didn't we round the division down? So in the case of an odd number of elements, shouldn't our mid point be at the end of the first half, and not the start of the second? This is an arcane off-by-one error, but let's see if it still works with a simple even number of elements
#alpha = `A` .. `D`
The start and and indices are 0 and 3; the middle index is int((0+3)/2) == 1. So the first half is 0..1 and the second half is 2 .. 3. That works fine
But there's still a lot more. Say I have to search an array with two elements X and Y. That has two clear halves, and I'm looking for A, which is before the middle. So I now search the one-element list X for A. The minimum and maximum elements of the target array are both zero. The mid-point is int((0+0)/2) == 0. So what happens next?
It is similar but rather worse when we're searching for Z in the same list. The code has to be exactly right, otherwise we will be either searching off the end of the array or checking the last element again and again
Saving the worst for last, suppose
my #alpha = ( 'A', 'B, 'Y, 'Z' )
and I'm looking for M. That lest loose all sorts of optimisations that involve checks that may may the ordinary case much slower
Because of all of this it's by far the best solution to use a library or a language's built-in function to do all of this. In particular, Perl's hashes are usually all you need to check for specific strings and any associated data. The algorithm used is vastly better than a binary search for any non-trivial data sets
Wikipedia shows this algorithm for an iterative binary search
The binary search algorithm can also be expressed iteratively with two index limits that progressively narrow the search range.
int binary_search(int A[], int key, int imin, int imax)
{
// continue searching while [imin,imax] is not empty
while (imin <= imax)
{
// calculate the midpoint for roughly equal partition
int imid = midpoint(imin, imax);
if (A[imid] == key)
// key found at index imid
return imid;
// determine which subarray to search
else if (A[imid] < key)
// change min index to search upper subarray
imin = imid + 1;
else
// change max index to search lower subarray
imax = imid - 1;
}
// key was not found
return KEY_NOT_FOUND;
}
And here is a version of your code that is far from bug-free but does what you intended. You weren't so far off
use strict;
use warnings 'all';
print binarysearch( 76, 10 .. 99 ), "\n";
sub binarysearch {
my $numbertolook = shift;
my #intarray = #_;
my $lengthint = scalar #intarray;
my #sorted = sort { $a <=> $b } #intarray;
my $low = 0;
my $high = $lengthint - 1;
my $found = undef;
my $midpoint;
while ( $low < $high ) {
$midpoint = int( ( $low + $high ) / 2 );
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ( $numbertolook < $sorted[$midpoint] ) {
$high = $midpoint;
}
elsif ( $numbertolook > $sorted[$midpoint] ) {
$low = $midpoint;
}
elsif ( $numbertolook == $sorted[$midpoint] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
if ( $low == $high - 1 and $low == $midpoint ) {
if ( $numbertolook == $sorted[$high] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
return;
}
}
return $midpoint;
}
output
FOUND
66
If you call print with several parameters separated with a space print expects the first one to be a filehandle. This is interprented as print FILEHANDLE LIST from the documentation.
print $found "\n";
What you want to do is either to separate with ,, to call it as print LIST.
print $found, "\n";
or to concat as strings, which will also call it as print LIST, but with only one element in LIST.
print $found . "\n";

Progress line in perl

I would like to create a very simple progressbar for my script. So far I've got this, and it works. However, I cannot get it to be a percentage out of 100. My code is the following and it produces basically a dot for every 5 entries in #entries.
my $total_entries = #entries;
my $count = 0;
my $count_tens = $total_entries/0.2;
$count_tens = sprintf ('%d',$count_tens);
foreach (#entries){
# do some stuff #
for (1 .. $total_entries){
if ($count == $count_tens){
print ".";
$count = 0;
}
$count++;
}
}
I would like to have something that produces always a fixed amount of dots, regardless of the total number of entries in #entries.
Let's say we want 80 dots. Then:
my $number_of_dots = 80;
my #items = 0 .. 20; # or something
my $items_per_dot = #items / $number_of_dots;
STDOUT->autoflush(1); # print everything out immediately
for my $i (0 .. $#items) {
my $dots = $i / $items_per_dot;
print "\r", "." x $dots;
sleep 1; # do something
}
print "\n";
Note that we avoid rounding errors by calculating the number of dots per item anew on each iteration. The \r will move the cursor to the start of the line, so the existing dots will be overwritten each time. You can easily skip the printing if the $dots value doesn't change between iterations.
Rather than rewriting the wheel, you may want to use existing code that has already been written, tested and debugged.
http://metacpan.org/pod/Term::ProgressBar

How to write a Perl script to check x number of columns to print a value in another line for each returned value

I have data (IP addresses) in a CSV file that will be columns 9-13. If there aren't values in the other columns, then by default it should just print out what is in column 9. There is an output file that will print a set of values and the value of column 9 (and through 13 if a value exist) concatenated with a static value to create an alias value. My question is, how would you do this efficiently? I have this code that works:
my $alias0= "ComponentAliases=['ComputerSystem:$columns[9]'];\n";
my $alias1= "ComponentAliases=['ComputerSystem:$columns[9]','ComputerSystem:$columns[10]'];\n";
my $alias2= "ComponentAliases=['ComputerSystem:$columns[9]','ComputerSystem:$columns[10]','ComputerSystem:$columns[11]'];\n";
print BAROC "ComputerSystem;\n";
if(($columns[11] != '')&&($columns[10] != '')) { print BAROC $alias2 }
elsif(($columns[11] == '')&&($columns[10] != '')) { print BAROC $alias1 }
elsif(($columns[11] == '')&&($columns[10] == '')) { print BAROC $alias0 }
This works to do what I want it to do, but there is a chance the CSV file will have values in columns 9-13 or 9-11, etc. Easily I think statically writing this will be fine, but I would like to do it efficiently as well as understand and always apply best practices. I'm new to scripting Perl, but continually am drawn to it to solve problems at work. Suggestions?
This is the output, btw:
ComponentAliases=['ComputerSystem:10.1.0.225','ComputerSystem:10.200.252.77','ComputerSystem:10.100.252.77'];
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
my $csv_in = Text::CSV_XS->new
or die Text::CSV_XS->error_diag;
my $csv_out = Text::CSV_XS->new({
always_quote => 1,
quote_char => q{'},
}) or die Text::CSV_XS->error_diag;
while (my $row = $csv_in->getline(\*DATA)) {
my #aliases = map "ComputerSystem:$_",
grep defined && length, #$row[9 .. 13];
if ($csv_out->combine(#aliases)) {
printf "ComponentAliases=[%s];\n", $csv_out->string;
}
}
__DATA__
0,1,2,3,4,5,6,7,8,10.1.0.225,10.200.252.77,,,,,,,
0,1,2,3,4,5,6,7,8,10.1.0.225,10.200.252.77,10.100.252.77,,,,,
Output:
C:\temp> gn
ComponentAliases=['ComputerSystem:10.1.0.225','ComputerSystem:10.200.252.77'];
ComponentAliases=['ComputerSystem:10.1.0.225','ComputerSystem:10.200.252.77','Co
mputerSystem:10.100.252.77'];
Efficiently now means maintainable. Trying to save a command or two isn't going to save you to much time. In fact, it might actually make the program more inefficient if the compiler can't figure out what you're doing.
What is important is readability. Get rid of the $alias stuff. It simply makes it harder to see what your code is doing, and you could end up with all sorts of side effects doing stuff like this.
The lack of white space also makes your code much more difficult to figure out too. Once I reformatted your code, I immediately spotted an error. You did this:
if ( ($columns[11] != '') && ($columns[10] != '') )
However, this is a string comparison. You need to do this:
if ( ( $columns[11] ne '' ) && ( $columns[10] ne '' ) ) {
Or, you could simplify it even further:
if ( not $column[10] and not $column[11] ) {
This makes it very clear what you're looking for, and will work whether the columns contain a numeric zero, a null string, or are undefined.
This code snippet is using your logic, but I take advantage of the fact that print doesn't automatically add a \n at the end of the string. I simply continue building upon the line:
if ( $columns[9] ) {
print BAROC "ComputerSystem;\n";
print BAROC "ComponentAliases=['ComputerSystem:$columns[9]'";
if ( $columns[10] ) {
print BAROC ",ComputerSystem:$columns[10]";
}
if ( $columns[11] ) {
print BAROC ",ComputerSystem:$columns[11]";
}
print BAROC "];\n";
}
You mentioned that you might need columns 9 to 13 if these columns had data in them. Why not use a loop?
if ( $#columns >= 9 ) { #There are at least nine columns
print BAROC "ComputerSystem;\n";
print BAROC "ComponentAliases=[ComputerSystem:$columns[9]";
for my $column ( (10..$#columns) ) {
last if not $column[$column];
print BAROC ",ComputerSystem:$columns[$columns];
}
print BAROC "];\n";
}
If given more time, I'm sure I could clean up the logic a bit more. But, this will work whether there are 9, 10, 11, or 43 columns with data.
One liner (not so elegant, but somehow I like it):
print "ComponentAliases=[".join(",",map {"'ComputerSystem:$_'"} grep {$_ ne ""} #columns[9-13])."]\n";
Or if you prefer the same code in a more understandable way:
print(
"ComponentAliases=[",
join(
",",
map(
"'ComputerSystem:$_'",
grep (
$_ ne "",
#columns[9-13]
)
)
),
"]\n"
);

Perl need the right grep operator to match value of variable

I want to see if I have repeated items in my array, there are over 16.000 so will automate it
There may be other ways but I started with this and, well, would like to finish it unless there is a straightforward command. What I am doing is shifting and pushing from one array into another and this way, check the destination array to see if it is "in array" (like there is such a command in PHP).
So, I got this sub routine and it works with literals, but it doesn't with variables. It is because of the 'eq' or whatever I should need. The 'sourcefile' will contain one or more of the words of the destination array.
// Here I just fetch my file
$listamails = <STDIN>;
# Remove the newlines filename
chomp $listamails;
# open the file, or exit
unless ( open(MAILS, $listamails) ) {
print "Cannot open file \"$listamails\"\n\n";
exit;
}
# Read the list of mails from the file, and store it
# into the array variable #sourcefile
#sourcefile = <MAILS>;
# Close the handle - we've read all the data into #sourcefile now.
close MAILS;
my #destination = ('hi', 'bye');
sub in_array
{
my ($destination,$search_for) = #_;
return grep {$search_for eq $_} #$destination;
}
for($i = 0; $i <=100; $i ++)
{
$elemento = shift #sourcefile;
if(in_array(\#destination, $elemento))
{
print "it is";
}
else
{
print "it aint there";
}
}
Well, if instead of including the $elemento in there I put a 'hi' it does work and also I have printed the value of $elemento which is also 'hi', but when I put the variable, it does not work, and that is because of the 'eq', but I don't know what else to put. If I put == it complains that 'hi' is not a numeric value.
When you want distinct values think hash.
my %seen;
#seen{ #array } = ();
if (keys %seen == #array) {
print "\#array has no duplicate values\n";
}
It's not clear what you want. If your first sentence is the only one that matters ("I want to see if I have repeated items in my array"), then you could use:
my %seen;
if (grep ++$seen{$_} >= 2, #array) {
say "Has duplicates";
}
You said you have a large array, so it might be faster to stop as soon as you find a duplicate.
my %seen;
for (#array) {
if (++$seen{$_} == 2) {
say "Has duplicates";
last;
}
}
By the way, when looking for duplicates in a large number of items, it's much faster to use a strategy based on sorting. After sorting the items, all duplicates will be right next to each other, so to tell if something is a duplicate, all you have to do is compare it with the previous one:
#sorted = sort #sourcefile;
for (my $i = 1; $i < #sorted; ++$i) { # Start at 1 because we'll check the previous one
print "$sorted[$i] is a duplicate!\n" if $sorted[$i] eq $sorted[$i - 1];
}
This will print multiple dupe messages if there are multiple dupes, but you can clean it up.
As eugene y said, hashes are definitely the way to go here. Here's a direct translation of the code you posted to a hash-based method (with a little more Perlishness added along the way):
my #destination = ('hi', 'bye');
my %in_array = map { $_ => 1 } #destination;
for my $i (0 .. 100) {
$elemento = shift #sourcefile;
if(exists $in_array{$elemento})
{
print "it is";
}
else
{
print "it aint there";
}
}
Also, if you mean to check all elements of #sourcefile (as opposed to testing the first 101 elements) against #destination, you should replace the for line with
while (#sourcefile) {
Also also, don't forget to chomp any values read from a file! Lines read from a file have a linebreak at the end of them (the \r\n or \n mentioned in comments on the initial question), which will cause both eq and hash lookups to report that otherwise-matching values are different. This is, most likely, the reason why your code is failing to work correctly in the first place and changing to use sort or hashes won't fix that. First chomp your input to make it work, then use sort or hashes to make it efficient.

How do I turn a table into a matrix?

If I got a table in a text file such like
A B 1
A C 2
A D 1
B A 3
C D 2
A E 1
E D 2
C B 2
. . .
. . .
. . .
And I got another symbol list in another text file. I want to transform this table into a Perl data structure like:
_ A D E . . .
A 0 1 1 . . .
D 1 0 2 . . .
E 1 2 0 . . .
. . . . . . .
But I only need some selected symbol, for example A, D and E are selected in the symbol text but B and C are not.
Use an array for the first one and a 2-dimentional hash for the second one. The first one should look roughly like:
$list[0] # row 1 - the value is "A B 1"
And the hash like:
$hash{A}{A} # the intersection of A and A - the value is 0
Figuring out how to implement a problem is about 75% of the mental battle for me. I'm not going to go into specifics about how to print the hash or the array, because that's easy and I'm also not entirely clear on how you want it printed or how much you want printed. But converting the array to the hash should look a bit like this:
foreach (#list) {
my ($letter1, $letter2, $value) = split(/ /);
$hash{$letter1}{$letter2} = $value;
}
At least, I think that's what you're looking for. If you really want you could use a regular expression, but that's probably overkill for just extracting 3 values out of a string.
EDIT: Of course, you could forgo the #list and just assemble the hash straight from the file. But that's your job to figure out, not mine.
you can try this with awk:
awk -f matrix.awk yourfile.txt > newfile.matrix.txt
where matrix.awk is :
BEGIN {
OFS="\t"
}
{
row[$1,$2]=$3
if (!($2 in f2)) { header=(header)?header OFS $2:$2;f2[$2]}
if (col1[c]!=$1)
col1[++c]=$1
}
END {
printf("%*s%s\n", length(col1[1])+2, " ",header)
ncol=split(header,colA,OFS)
for(i=1;i<=c;i++) {
printf("%s", col1[i])
for(j=1;j<=ncol;j++)
printf("%s%s%c", OFS, row[col1[i],colA[j]], (j==ncol)?ORS:"")
}
}
Another way to do this would be to make a two-dimensional array -
my #fArray = ();
## Set the 0,0th element to "_"
push #{$fArray[0]}, '_';
## Assuming that the first line is the range of characters to skip, e.g. BC
chomp(my $skipExpr = <>);
while(<>) {
my ($xVar, $yVar, $val) = split;
## Skip this line if expression matches
next if (/$skipExpr/);
## Check if these elements have already been added in your array
checkExists($xVar);
checkExists($yVar);
## Find their position
for my $i (1..$#fArray) {
$xPos = $i if ($fArray[0][$i] eq $xVar);
$yPos = $i if ($fArray[0][$i] eq $yVar);
}
## Set the value
$fArray[$xPos][$yPos] = $fArray[$yPos][$xPos] = $val;
}
## Print array
for my $i (0..$#fArray) {
for my $j (0..$#{$fArray[$i]}) {
print "$fArray[$i][$j]", " ";
}
print "\n";
}
sub checkExists {
## Checks if the corresponding array element exists,
## else creates and initialises it.
my $nElem = shift;
my $found;
$found = ($_ eq $nElem ? 1 : 0) for ( #{fArray[0]} );
if( $found == 0 ) {
## Create its corresponding column
push #{fArray[0]}, $nElem;
## and row entry.
push #fArray, [$nElem];
## Get its array index
my $newIndex = $#fArray;
## Initialise its corresponding column and rows with '_'
## this is done to enable easy output when printing the array
for my $i (1..$#fArray) {
$fArray[$newIndex][$i] = $fArray[$i][$newIndex] = '_';
}
## Set the intersection cell value to 0
$fArray[$newIndex][$newIndex] = 0;
}
}
I am not too proud regarding the way I have handled references but bear with a beginner here (please leave your suggestions/changes in comments). The above mentioned hash method by Chris sounds a lot easier (not to mention a lot less typing).
CPAN has many potentially useful suff. I use Data::Table for many purposes. Data::Pivot also looks promising, but I have never used it.