File row confrontation - perl

I have the following problem: from a file (file.dat) with the following formatted datas
1 2 3 4
2 1 3 4 5
3 1 2
4 1 2
5 2 6 7
6 5 8
7 5 8
8 6 7 9
9 8
I want to find:
if the first element of a row appears in the other rows and if the first element of the subsequent rows appear in the row taken in exam;
if it exists then I want to print "I have found the link x y";
if the "link" exists, then I want to count how many times the other elements in the row taken in exam appear in the row where the link is present and print "I have found z triangles".
For example in this case when the program compare the first row and the second row and find that "the link 1 2" exists and then write also "I have find 2 triangles" (because in each rows there are the numbers 3 and 4).
For this purpose I have tried to write the following program:
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
############ DATA ABSORTION
my $file = 'file.dat';
open my $fh, "<", $file or die "Cannot open $file: $!";
############ COLLECT THE DATAS IN A VECTOR as vector[i][j]
my #vector;
while (<$fh>) {
push #vector, [ split ];
}
############ START THE RESEARCH OF THE LINKS AND TRIANGLES BY MEANS OF FOR LOOPS
my #link;
my $triangles;
for (my $i=0 ; $i < scalar #vector; $i++){
$triangles=0;
for(my $j=0; $j < scalar #vector; $j++){
for (my $k=$i+1; $k < scalar #vector; $k++){
for(my $l=0; $l < scalar #vector; $l++){
if($vector[$i][0]==$vector[$k][$l] && $vector[$i][$j]==$vector[$k][0] && $l != 0 && $j != 0) {
#link=($vector[$i][0],$vector[$k][0]);
print "I found the link #link\n";
if($vector[$i][$j]==$vector[$k][$l] && $l != 0 && $j != 0 && $i != $k){
$triangles++;
}
print "The number of triangles is $triangles\n\n";
}
}
}
}
}
The program print the right number of links but I found that if the number of rows is lower of the number of colums in the file, the program doesn't read the full row and this could be a problem for my link research. I think the problem is due at the scalar #vector upper limit in the for instrunctions (but I don't understand why).
The second problem is that it does't count the right number o triangles that I'am looking for... Any helps?

This program does what you require. In addition it prints the three corners of each triangle when one is found.
use strict;
use warnings;
use 5.010;
my $filename = 'file.dat';
open my $fh, '<', $filename or die qq{Cannot open "$filename": $!};
my %vector;
while (<$fh>) {
my #fields = split;
my $root = shift #fields;
$vector{$root} = { map { $_ => 1} #fields };
}
my #roots = sort { $a <=> $b } keys %vector;
for my $i (0 .. $#roots) {
my $aa = $roots[$i];
for my $j ($i + 1 .. $#roots) {
my $bb = $roots[$j];
next unless $vector{$aa}{$bb} and $vector{$bb}{$aa};
say "I found the link $aa $bb";
my $triangles = 0;
for my $cc ( keys %{$vector{$aa}} ) {
next if $cc == $aa or $cc == $bb;
if ($vector{$bb}{$cc}) {
say "Triangle $aa - $bb - $cc";
$triangles++;
}
}
say "I have found $triangles triangle". ($triangles == 1 ? '' : 's');
print "\n";
}
}
There are only two triangles in the data you show: 1-2-3 and 1-2-4. Following your algorithm results in this program counting triangles more than once, with the corners in different orders. To count each distinct triangle only once, change the line
next if $cc == $aa or $cc == $bb;
to
next if $cc <= $aa or $cc <= $bb;
output
I found the link 1 2
Triangle 1 - 2 - 4
Triangle 1 - 2 - 3
I have found 2 triangles
I found the link 1 3
Triangle 1 - 3 - 2
I have found 1 triangle
I found the link 1 4
Triangle 1 - 4 - 2
I have found 1 triangle
I found the link 2 3
Triangle 2 - 3 - 1
I have found 1 triangle
I found the link 2 4
Triangle 2 - 4 - 1
I have found 1 triangle
I found the link 2 5
I have found 0 triangles
I found the link 5 6
I have found 0 triangles
I found the link 5 7
I have found 0 triangles
I found the link 6 8
I have found 0 triangles
I found the link 7 8
I have found 0 triangles
I found the link 8 9
I have found 0 triangles

[ Only answers first question ]
$j and $l are suppose to iterate over the column indexes, but you count rows. The correct loops are:
for my $i (0 .. $#vector-1) {
for my $j (0 .. $#{ $vector[$i] }) {
for my $k ($i+1 .. $#vector) {
for my $l (0 .. $#{ $vector[$k] }) {

This question has two parts:
Establish if a link exists between two rows
Establish the total 'unique' numbers they share in common
Using an AoA is fine, but using a HoH makes life a little easier:
my %links;
while ( <$fh> ) {
chomp;
my ( $from, #to ) = split;
$links{$from}{$_}++ for #to;
}
You can then check to see if the link exists:
print "Link $from $to\n" if exists $links{$from} && exists $links{$from}{$to};
And finding common "triangles" should be easy as well:
use List::MoreUtils 'uniq';
sub get_triangles {
my ( $from, $to ) = #_;
for ( $from, $to ) { # Bail out if link doesn't exist
warn( "'$_' does not exist"), return unless exists $links{$_};
}
my #triangles = map { exists $links{$from} && exists $links{$to} }
uniq( values %{$links{$from}}, values %{$links{to}} );
return #triangles;
}

Related

how to deal with graph theory related problems using perl

I want to learn how to solve simple graph theory problems in perl without using any extra module.
I can explain a simple problem.
Input format:
Line 1- number of vertices of graph-N.
Next N lines- index of vertices with direct connection to vertex with index i. Index starts from 1.
index of starting point (space) index of end point, find longest route possible.
Example
4
2 3 4
1
1 4
1 3
2 4
Solution:
2 to 4 can be reached in following ways
- 2-1-4
- 2-1-3-4
so longest path is 2-1-3-4
I want to learn the basics of using perl for such problems. Any help would be highly appreciated. Give me a hint and i will try to code.
I'd use a hash of hashes to represent a graph. $graph{$v1}{$v2} exists if the edge v1-v2 is in the graph. You can represent directed graphs this way (as $graph{$v2}{$v1} doesn't have to exist). Also, if you want weighted edges, you can store the weight as the value.
To solve your example problem, I'd use something like the following:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
# Check that vertex can be added to the path.
sub already {
my ($vertex, #vertices) = #_;
for my $i (1 .. $#vertices) {
# last-v or v-last might already be present.
return 1 if ($vertices[ $i - 1 ] == $vertices[-1] and $vertices[$i] == $vertex)
or ($vertices[ $i - 1 ] == $vertex and $vertices[$i] == $vertices[-1])
}
return
}
sub path {
my ($graph, $start, $end, %known) = #_;
my $count = keys %known;
for my $path (keys %known) {
my #vertices = split '-', $path;
next if $vertices[-1] == $end;
for my $target (keys %{ $graph->{ $vertices[-1] } }) {
undef $known{"$path-$target"} unless already($target, #vertices);
}
}
if (keys %known > $count) {
return path($graph, $start, $end, %known)
} else {
return keys %known
}
}
my %graph;
my $size = <>;
for my $node (1 .. $size) {
my #targets = split ' ', <>;
undef $graph{$node}{$_} for #targets;
}
my ($start, $end) = split ' ', <>;
say "$start to $end can be reached in the following ways";
my #paths = grep /-$end$/,
path(\%graph, $start, $end, map {; "$start-$_" => undef }
keys %{ $graph{$start} });
say for #paths;
my $max = 0;
for my $i (1 .. $#paths) {
$max = $i if ($paths[$i] =~ tr/-//) > ($paths[$max] =~ tr/-//);
}
say "so longest path is $paths[$max]";

Finding The number of Divisors in a given number?

I have created a Perl program to calculate the amount of divisible numbers in numbers 3 to 10.
Example: the number 6 has 4 divisors 1, 2, 3 and 6.
This is how the program is suppose to work:
The program will calculated the number of divisors of 3 it will then print it to the report.txt file. Next, it will move on to calculate the number of divisors of 4 and print it to report.txt. The program will do this until it has calculated to the number 10 then it will close the program.
#!/usr/bin/perl
use warnings;
use strict;
my $num = 2; # The number that will be calculated
my $count = 1; # Counts the number of divisors
my $divisors; # The number of divisors
my $filename = 'report.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!"; # open file "report.txt"
for (my $i=2; $i <= 10; $i++) {
while( $num % $i == 0) { # Checks if the number has a remainder.
$num++; # Adds 1 to $num so it will calculate the next number.
$count++; # counts the number of divisible numbers.
$num /= $i; # $num = $num / $i.
}
$divisors = $count; # The number of divisors are equal to $count.
print $fh "$divisors\n"; # The output will be repeated..
}
close $fh # Closes the file "report.txt"
I think the problem is that the for-loop keeps repeating this code:
print $fh "$divisors\n";
The output is:
2
2
2
2
2
2
2
2
2
but, I'm not sure exactly what I am missing.
Give your variables meaningful names. This helps in both making your code self-documenting, but also in that it helps you recognize when you're using a variable incorrectly. The variable name $i doesn't communicate anything, but $divisor says that you are testing if that number is a divisor.
As for why your code is looping, can't say. Here is a reformatted version of your code that does function though:
#!/usr/bin/perl
use warnings;
use strict;
use autodie;
for my $num (2..10) {
my $divisor_count = 0;
for my $divisor (1..$num) {
$divisor_count++ if $num % $divisor == 0;
}
print "$num - $divisor_count\n"
}
Output:
2 - 2
3 - 2
4 - 3
5 - 2
6 - 4
7 - 2
8 - 4
9 - 3
10 - 4

grep tab separated string in perl

I am trying to grep tab separated numbers (eg 1\t3) in an array something like
#data=
1 3
2 3
1 3
3 3
the idea behind the code is something like this
#!usr/bin/perl
use strict;
use warnings;
my #data = ( "1\t3", "2\t3", "1\t3", "3\t3", );
for (my $i=0;$i<4;$i++) {
for (my $j=0;$j<4_size;$j++) {
my $pattern= "$i\t$j";
my #count=grep(/$pattern/,#data);
undef $pattern;
print "$pattern\tcount\n";
}
}
hoping for output something like
1st and second column: pattern
3nd column : count of total matches
1 1
1 2
1 3 2
2 1
2 3 1
3 1
3 2
3 3 1
but the output is null for some reasons,
I am recently learnt and finding it very intriguing.
any suggestions?
The code below:
Does not crash if input contains unexpected characters (e.g., '(')
Only counts exact matches for the sequences of digits on either side of "\t".
Matches lines that might have been read from a file or __DATA__ section without using chomp using \R.
--
#!/usr/bin/env perl
use strict;
use warnings;
my #data = ( "1\t3", "2\t3", "(\t4", "1\t3", "3\t3", "11\t3" );
for my $i (1 .. 3) {
for my $j (1 .. 3) {
my $pattern = "$i\t$j";
my $count = grep /\A\Q$pattern\E\R?\z/, #data;
print join("\t", $pattern, $count ? $count : ''), "\n";
}
}
Output:
1 1
1 2
1 3 2
2 1
2 2
2 3 1
3 1
3 2
3 3 1
You almost got it. Here is a working version:
#!usr/bin/perl
use strict;
use warnings;
my #data = ( "1\t3", "2\t3", "1\t3", "3\t3", );
foreach my $i (1 .. 3) {
foreach my $j (1 .. 3) {
my $pattern = "$i\t$j";
my $count = grep(/$pattern/, #data);
print $pattern . ($count ? "\t$count\n" : "\n");
}
}

Try to remove specific columns using splice in Perl

I am a brand new Perl novice, looking for help with my first ever Perl script
I have some huge files 30-50GB files and they are constructed like this - millions of columns and thousands of rows:
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
I would like to delete column "A", and column "C", then ever third of the number columns, so the "3" column and the "6" column, then "9" column until the end of the file. Space delimited.
My attempt is like this:
#!/usr/local/bin/perl
use strict;
use warnings;
my #dataColumns;
my $dataColumnCount;
if(scalar(#ARGV) != 2){
print "\nNo files supplied, please supply file name\n";
exit;
}
my $Infile = $ARGV[0];
my $Outfile = $ARGV[1];
open(INFO,$Infile) || die "Could not open $Infile for reading";
open(OUT,">$Outfile") || die "Could not open $Outfile for writing";
while (<INFO>) {
chop;
#dataColumns = split(" ");
$dataColumnCount = #dataColumns + 1;
#Now remove the first element of the list
shift(#dataColumns);
#Now remove the third element (Note that it is now the second - after removal of the first)
splice(#dataColumns,1,1); # remove the third element (now the second)
#Now remove the 6th (originally the 8th) and every third one thereafter
#NB There are now $dataColumnCount-1 columns
for (my $i = 5; $i < $dataColumnCount-1; $i = $i + 3 ) {
splice($dataColumns; $i; 1);
}
#Now join the remaining elements of the list back into a single string
my $AmendedLine = join(" ",#dataColumns);
#Finally print out the line into your new file
print OUT "$AmendedLine/n";
}
But I am getting a few weird errors:
It is saying it doesn't like my $1 in the for loop, I have added a 'my' which seems to make the error go away but nobody else's for code seems to contain a 'my' here so I am not sure what is going on.
Global symbol "$i" requires explicit package name at Convertversion2.pl line 36.
Global symbol "$i" requires explicit package name at Convertversion2.pl line 36.
Global symbol "$i" requires explicit package name at Convertversion2.pl line 36.
Global symbol "$i" requires explicit package name at Convertversion2.pl line 36.
The other error is this:
syntax error at Convertversion2.pl line 37, near "#dataColumns;"
syntax error at Convertversion2.pl line 37, near "1)"
I am not sure how to correct this error, I think I am almost there, but not sure what exactly what the syntax error is, is am unsure how to fix it.
Thank you in advance.
After I blogged about this question, a commenter pointed out that it is possible to reduce run time by 45% for my test case. I paraphrased his code a little bit:
my #keep;
while (<>) {
my #data = split;
unless (#keep) {
#keep = (0, 1, 0, 1, 1);
for (my $i = 5; $i < #data; $i += 3) {
push #keep, 1, 1, 0;
}
}
my $i = 0;
print join(' ', grep $keep[$i++], #data), "\n";
}
This runs in almost half the time my original solution took:
$ time ./zz.pl input.data > /dev/null
real 0m21.861s
user 0m21.310s
sys 0m0.280s
Now, it is possible to gain another 45% performance by using Inline::C in a rather dirty way:
#!/usr/bin/env perl
use strict;
use warnings;
use Inline C => <<'END_C'
/*
This code 'works' only in a limited set of circumstances!
Don't expect anything good if you feed it anything other
than plain ASCII
*/
#include <ctype.h>
SV *
extract_fields(char *line, AV *wanted_fields)
{
int ch;
IV current_field = 0;
IV wanted_field = -1;
unsigned char *cursor = line;
unsigned char *field_begin = line;
unsigned char *save_field_begin;
STRLEN field_len = 0;
IV i_wanted = 0;
IV n_wanted = av_len(wanted_fields);
AV *ret = newAV();
while (i_wanted <= n_wanted) {
SV **p_wanted = av_fetch(wanted_fields, i_wanted, 0);
if (!(*p_wanted)) {
croak("av_fetch returned NULL pointer");
}
wanted_field = SvIV(*p_wanted);
while ((ch = *(cursor++))) {
if (!isspace(ch)) {
continue;
}
field_len = cursor - field_begin - 1;
save_field_begin = field_begin;
field_begin = cursor;
current_field += 1;
if (current_field != wanted_field) {
continue;
}
av_push(ret, newSVpvn(save_field_begin, field_len));
break;
}
i_wanted += 1;
}
return newRV_noinc((SV *) ret);
}
END_C
;
And, here is the Perl part. Note that we split only once to figure out the indices of fields to keep. Once we know those, we pass the line and the (1-based) indices to the C routine to slice and dice.
my #keep;
while (my $line = <>) {
unless (#keep) {
#keep = (2, 4, 5);
my #data = split ' ', $line;
push #keep, grep +(($_ - 5) % 3), 6 .. scalar(#data);
}
my $fields = extract_fields($line, \#keep);
print join(' ', #$fields), "\n";
}
$ time ./ww.pl input.data > /dev/null
real 0m11.539s
user 0m11.083s
sys 0m0.300s
input.data was generated using:
$ perl -E 'say join(" ", "A" .. "ZZZZ") for 1 .. 100' > input.data
and it is about 225MB in size.
The code you show doesn't produce those errors. You have no $1 in there at all, and if you meant $i then your use of that variable is fine. The only syntax error is in the line splice($dataColumns; $i; 1) which has semicolons instead of commas, and uses $dataColumns instead of #dataColumns.
Apart from that
It is good practice to declare variables as close as possible to their point of use, not at the top of the program.
Capital letters are generally used for constants like package names. You should use lower case, digits and underscore for variables.
Are you aware you are setting $dataColumnCount to one more than the number of elements in #dataColumns?
It is frowned on more recently to use global file handles - you should use lexical variables instead.
I suggest this refactoring of your program. It uses autodie to avoid having to check the success of the open calls. It builds a list of array indices that need deleting as soon as it can: once the number of fields in each line is known after the first record is read. Then it deletes them from the end backwards to avoid having to do arithmetic on the indices as preceding elements are removed.
#!/usr/local/bin/perl
use strict;
use warnings;
use autodie;
if (#ARGV != 2) {
die "\nNo files supplied, please supply file names\n";
}
my ($infile, $outfile) = #ARGV;
open my $info, '<', $infile;
open my $out, '>', $outfile;
my #remove;
while (<$info>) {
my #data = split;
unless (#remove) {
#remove = (0, 2);
for (my $i = 7; $i < #data; $i += 3) {
push #remove, $i;
}
}
splice #data, $_, 1 for reverse #remove;
print $out join(' ', #data), "\n";
}
While the other answers above work perfectly, and mine probably doesn't present any advantage, this is a different way of achieving the same while avoiding split:
#!/usr/local/bin/perl
use strict;
use warnings;
use feature 'say';
my $dir='D:\\';
open my $fh,"<", "$dir\\test.txt" or die;
while (<$fh>) {
chomp;
my #fields = split ' ';
print "$fields[0] $fields[2] ";
for (my $i=7; $i <= $#fields; $i += 3){
print "$fields[$i] ";
}
print "\n";
}
close $fh;
Please let me know if this is useless.

perl extract information from several files in folders into the specific files

I'm just trying to extract of daily data to analyze yearly something.
So, I made a code for searching folder and files.
After that I wanna extract curtain lines in several files that has same name in the middle.
When I done my work, I realize that there are only one day information left
daily data is grid format like this.
ncols 751
nrows 601
xllcorner 124.5
yllcorner 33.
cellsize 0.01
nodata_value -99
-99.0 -99.0 -99.0 -99.0 -99.0
I wanna get the result like this with my code.
1.txt (2011)
10 10 10 10 10 4 4 3 2
5 4 3 2 10 4 4 3 2
1 1 10 10 10 10 10 10
2.txt (2012)
3 4 2 10 10 4 4 3 2
5 4 3 2 10 4 4 3 2
1 1 10 10 10 10 10 10
use 5.010;
use warnings;
if( $#ARGV < 0 )
{ die "need folder.\n"; }
$dirName = shift(#ARGV);
local($i);
#rutine
&readSubDir($dirName);
sub readSubDir
{
if(!opendir(dirHandle, $_[0]))
{
print "$_[0] Failed opening.\n";
return 0;
}
local(#files) = readdir(dirHandle);
local($i);
local($newFile);
local(#dironly);
local(#fileonly);
for($i = 0; $i <= $#files; $i++)
{
$newFile = $_[0]."\\".$files[$i];
if(-d $newFile)
{
push(#dironly, $files[$i]);
}
elsif(-f $newFile)
{
push(#fileonly, $files[$i]);
}
else
{}
}
#files = #dironly;
push(#files, #fileonly);
closedir dirHandle;
my $cnt = 1;
my $b = 2011;
for($i =0; $i <= $#files; $i++){
$newFile = $_[0]."\\".$files[$i];
if(-f $newFile){
open(fileHandle, $newFile)|| die "Cannot open 개체 \n";
my($dir, $file, $ext) = ($newFile =~ m|^(.*\\)(.*)(\..*)$| );
if (substr($file,17,4) eq $b){
while(<fileHandle>){
if($. == 7){
my $filename = $cnt.'.txt';
open OUT, ">$filename" or die "Failed to create $filename";
print OUT $_;
}
}
close(fileHandle);
}
elsif (substr($file,17,4) eq $b+1){
$b++;
$cnt++;
while(<fileHandle>){
if($. == 7){
my $filename = $cnt.'.txt';
open OUT, ">$filename" or die "Failed to create $filename";
print OUT $_;
}
}
close(fileHandle);
}
}
close(OUT);
}
}
The question really isn't clear exactly what you're trying to accomplish, as the information in the daily file data example doesn't match any of the data in the output examples. However, from looking at your code I think I get what you're trying to do and I think your problem is occurring when you open the output file to store the lines you extracted. You're opening the file with >, which indicates you're opening the file for output, but it will also overwrite the file if it already exists. So your code is just overwriting the same file over and over and only the information in the last file will be saved. You need to open the file in append mode using >>. So your code should look similar to the following:
open OUT, ">>$filename" or die "Failed to create $filename";