Extract several rows with with HTML::TableExtract - perl

I have made one script which will extract all the Row data from HTML <TR> tags. I am having 30 HTML <TR> tags on my HTML page. Based on count, my code will fetch particular row data. Let's say if I need data present in 5th <tr>...</tr>, then my condition is if(count =5) {(go inside and get that data)}
But my problem here is I need the selected rows' data one at a time. Let's say I need data for rows 5, 6, and 14.
Could you please help me sort it out?
$te = new HTML::TableExtract(count => 0 );
$te->parse($content);
# Examine all matching tables
foreach $ts ($te->table_states) {
#print "Table (", join(',', $ts->coords), "):\n";
$cnt = 1;
foreach $row($ts->rows) {
# print " ---- Printing Row $cnt ----\n";
$PrintLine= join("\t", #$row);
#RowData=split(/\t/,$PrintLine);
$PrintLine =~ s/\r//ig;
$PrintLine =~ s/\t//ig;
$cnt = $cnt + 1;
# if ($PrintLine =~ /Site ID/ig || $PrintLine =~ /Site name/ig){print " Intrest $PrintLine $cnt =====================\n"};
if ( $cnt == 14) {
$arraycnt = 1;
my $SiteID="";
my $SiteName="";
foreach (#RowData) {
# print " Array element $arraycnt\n";
chomp;
$_ =~ s/\r//ig;
$_ =~ s/[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3]//ig;
if ($arraycnt== 17 ) { $SiteID= $_;}
if ($arraycnt== 39 ) { $SiteName= $_;}
$arraycnt = $arraycnt + 1;
}
#$PrintLineFinal = $BridgeCase."\t".$PrintLine;
$PrintLineFinal = $BridgeCase."\t".$SiteID."\t".$SiteName;
#print "$PrintLineFinal\n";
print MYFILE2 "$PrintLineFinal\n";
last;
}
}
}

A few suggestions:
Always:
use strict;
use warnings;
This will force you to declare your variables with my. e.g.
foreach my $ts ($te->table_states) {
my $cnt = 1;
(warnings will let you know about most silly mistakes. strict prevents mistakes by requiring you to use better practices in certain cases).
In several places, you are using your own counter variables as you go through the array. You don't need to do this. Instead, just get the array element you want directly. e.g. $array[3] to get the third element.
Perl also allows array slices to get just certain elements you want. #array[4,5,13] gets the fifth, sixth, and fourteenth elements of the array. You can use this to process only the rows you want, instead of looping through all of them:
my #rows = $ts->rows;
foreach my $row (#rows[4,5,13]) #process only the 5th, 6th, and 14th rows.
{
...
}
Here is a shortcut version of the same thing, using an anonymous array:
foreach my $row (#{[$ts->rows]}[4,5,13])
Also, perhaps you want to define the rows you want elsewhere in your code:
my #wanted_rows = (4,5,13);
...
foreach my $row (#{[$ts->rows]}[#wanted_rows])
This code is quite confused:
$PrintLine= join("\t", #$row);
#RowData=split(/\t/,$PrintLine);
$PrintLine =~ s/\r//ig;
$PrintLine =~ s/\t//ig;
First you are joining an array with tab characters, then you are splitting the array you just joined to get the array back again. Then you remove all tab characters from the line anyway.
I suggest you get rid of all that code. Just use #$row whenever you need the array, instead of making a copy of it. If you need to print the array for debugging (which is all you seem to be doing with $PrintLine, you can print an array directly:
print #$row; #print an array, nothing between each element.
print "#$row"; #print an array with spaces between each element.
With all of these changes, your code would be something like this:
use strict;
use warnings;
my #wanted_rows = (4,5,13);
my $te = new HTML::TableExtract(count => 0);
$te->parse($content);
# Examine all matching tables
foreach my $ts ($te->table_states) {
foreach my $row (#{[$ts->rows]}[#wanted_rows]) {
s/[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3\r\n]//ig for (#$row);
my $SiteID = $$row[16] // ''; #set to empty strings if not defined.
my $SiteName = $$row[38] // '';
print MYFILE2 $BridgeCase."\t".$SiteID."\t".$SiteName;
}
}

You could access the results like this:
foreach $ts ($te->table_states) {
#you need 14th rows
#my 14throws = $ts->rows->[13];#starting with zero!
#17th col from the 14th row
#my $17colfrom14throws = $ts->rows->[13]->[16];
my $SiteName = $ts->rows->[13]->[38];
my $SiteID = $ts->rows->[13]->[16];
my $PrintLineFinal = $BridgeCase."\t".$SiteID."\t".$SiteName;
print MYFILE2 "$PrintLineFinal\n";
}

Related

Passing strings as array to subroutine and return count of specific char

I was trying to think in the right way to tackle this:
-I would to pass say, n elements array as argument to a subroutine. And for each element match two char types S and T and print for each element, the count of these letters. So far I did this but I am locked and found some infinite loops in my code.
use strict;
use warnings;
sub main {
my #array = #_;
while (#array) {
my $s = ($_ = tr/S//);
my $t = ($_ = tr/T//);
print "ST are in total $s + $t\n";
}
}
my #bunchOfdata = ("QQQRRRRSCCTTTS", "ZZZSTTKQSST", "ZBQLDKSSSS");
main(#bunchOfdata);
I would like the output to be:
Element 1 Counts of ST = 5
Element 2 Counts of ST = 6
Element 3 Counts of ST = 4
Any clue how to solve this?
while (#array) will be an infinite loop since #array never gets smaller. You can't read into the default variable $_ this way. For this to work, use for (#array) which will read the array items into $_ one at a time until all have been read.
The tr transliteration operator is the right tool for your task.
The code needed to get your results could be:
#!/usr/bin/perl
use strict;
use warnings;
my #data = ("QQQRRRRSCCTTTS", "ZZZSTTKQSST", "ZBQLDKSSSS");
my $i = 1;
for (#data) {
my $count = tr/ST//;
print "Element $i Counts of ST = $count\n";
$i++;
}
Also, note that my $count = tr/ST//; doesn't require the binding of the transliteration operator with $_. Perl assumes this when $_ holds the value to be counted here. Your code tried my $s = ($_ = tr/S//); which will give the results but the shorter way I've shown is the preferred way.
(Just noticed you had = instead of =~ in your statement. That is an error. Has to be $s = ($_ =~ tr/S//);)
You can combine the 2 sought letters as in my code. Its not necessary to do them separately.
I got the output you want.
Element 1 Counts of ST = 5
Element 2 Counts of ST = 6
Element 3 Counts of ST = 4
Also, you can't perform math operations in a quoted string like you had.
print "ST are in total $s + $t\n";
Instead, you would need to do:
print "ST are in total ", $s + $t, "\n";
where the operation is performed outside of the string.
Don't use while to traverse an array - your array gets no smaller, so the condition is always true and you get an infinite loop. You should use for (or foreach) instead.
for (#array) {
my $s = tr/S//; # No need for =~ as tr/// works on $_ by default
my $t = tr/T//;
print "ST are in total $s + $t\n";
}
Why tr///??
sub main {
my #array = #_;
while (#array) {
my $s = split(/S/, $_, -1) - 1;
my $t = split(/T/, $_, -1) - 1;
print "ST are in total $s + $t\n";
}
}

Counting through a hash - PERL

I have a db of places people have ordered items from. I parsed the list to get the city and state so it prints like this - city, state (New York, NY) etc....
I use the variables $city and $state but I want to count how many times each city and state occur so it looks like this - city, state, count (Seattle, WA 8)
I have all of it working except the count .. I am using a hash but I can't figure out what is wrong with this hash:
if ($varc==3) {
$line =~ /(?:\>)(\w+.*)(?:\<)/;
$city = $1;
}
if ($vars==5) {
$line =~ /(?:\>)((\w+.*))(?:\<)/;
$state = $1;
# foreach $count (keys %counts){
# $counts = {$city, $state} {$count}++;
# print $counts;
# }
print "$city, $state\n";
}
foreach $count (keys %counts){
$counts = {$city, $state} {$count}++;
print $counts;
}
Instead of printing city and state you can build a "location" string with both items and use the following counting code:
# Declare this variable before starting to parse the locations.
my %counts = ();
# Inside of the loop that parses the city and state, let's assume
# that you've got $city and $state already...
my $location = "$city, $state";
$counts{$location} += 1;
}
# When you've processed all locations then the counts will be correct.
foreach $location (keys %counts) {
print "OK: $location => $counts{$location}\n";
}
# OK: New York, NY => 5
# OK: Albuquerque, NM => 1
# OK: Los Angeles, CA => 2
This is going to be a mix of an answer and a code review. I will start with a warning though.
You are trying to parse what looks like XML with Regular Expressions. While this can be done, it should probably not be done. Use an existing parser instead.
How do I know? Stuff that is between angle brackets looks like the format is XML, unless you have a very weird CSV file.
# V V
$line =~ /(?:\>)(\w+.*)(?:\<)/;
Also note that you don't need to escape < and >, they have no special meaning in regex.
Now to your code.
First, make sure you always use strict and use warnings, so you are aware of stuff that goes wrong. I can tell you're not because the $count in your loop has no my.
What's $vars (with an s), and what's $varc (with a c). I am guessing that has to do with the state and the city. Is it the column number? In an XML file? Huh.
$line =~ /(?:\>)((\w+.*))(?:\<)/;
Why are there two capture groups, both capturing the same thing?
Anyway, you want to count how often each combination of state and city occurs.
foreach $count (keys %counts){
$counts = {$city, $state} {$count}++;
print $counts;
}
Have you run this code? Even without strict, it gives a syntax error. I'm not even sure what it's supposed to do, so I can't tell you how to fix it.
To implement counting, you need a hash. You got that part right. But you need to declare that hash variable outside of your file reading loop. Then you need to create a key for your city and state combination in the hash, and increment it every time that combination is seen.
my %counts; # declare outside the loop
while ( my $line = <$fh> ) {
chomp $line;
if ( $varc == 3 ) {
$line =~ /(?:\>)(\w+.*)(?:\<)/;
$city = $1;
}
if ( $vars == 5 ) {
$line =~ /(?:\>)((\w+.*))(?:\<)/;
$state = $1;
print "$city, $state\n";
$count{"$city, $state"}++; # increment when seen
}
}
You have to parse the whole file before you can know how often each combination is in the file. So if you want to print those together, you will have to move the printing outside of the loop that reads the file, and iterate the %count hash by keys at a later point.
my %counts; # declare outside the loop
while ( my $line = <$fh> ) {
chomp $line;
if ( $varc == 3 ) {
$line =~ /(?:\>)(\w+.*)(?:\<)/;
$city = $1;
}
if ( $vars == 5 ) {
$line =~ /(?:\>)((\w+.*))(?:\<)/;
$state = $1;
$count{"$city, $state"}++; # increment when seen
}
}
# iterate again to print final counts
foreach my $item ( sort keys %counts ) {
print "$item $counts{$item}\n";
}

how to count number of occurances in hash values

I have a below fix file and I want to find out how many orders are sent at same time. I am using tag 52 as the sending time.
Below is the file,
8=FIX.4.2|9=115|35=A|52=20080624-12:43:38.021|10=186|
8=FIX.4.2|52=20080624-12:43:38.066|10=111|
8=FIX.4.2|9=105|35=1|22=BOO|52=20080624-12:43:39.066|10=028|
If I want to count number how many same occurances of Tag 52 values were sent? How can I check?
So far, I have written below code but not giving me the frequency.
#!/usr/bin/perl
$f = '2.txt';
open (F,"<$f") or die "Can not open\n";
while (<F>)
{
chomp $_;
#data = split (/\|/,$_);
foreach $data (#data)
{
if ( $data == 52){
#data1 = split ( /=/,$data);
for my $j (#data1)
{
$hash{$j}++;
} for my $j (keys %hash)
{
print "$j: ", $hash{j}, "\n";
}
}
}
}
Here is your code corrected:
#!/usr/bin/perl
$f = '2.txt';
open (F,"<$f") or die "Can not open\n";
my %hash;
while (<F>) {
chomp $_;
#data = split (/\|/,$_);
foreach $data (#data) {
if ($data ~= /^52=(.*)/) {
$hash{$1}++;
}
}
}
for my $j (keys %hash) {
print "$j: ", $hash{j}, "\n";
}
Explanation:
if ( $data == 52) compares the whole field against value 52, not a substring of the field. Of course, you do not have such fields, and the test always fails. I replaces it with a regexp comparison.
The same regexp gives an opportunity to catch a timestamp immediately, without a need to split the field once more. It is done by (.*) in the regexp and $1 in the following assignment.
It is hardly makes sense to output the hash for every line of input data (your code outputs it within the foreach loop). I moved it down. But maybe, outputting the current hash for every line is what you wanted, I do not know.

Amend perl script so that words are matched on a word for word basis

I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files.
Original question here:
https://unix.stackexchange.com/questions/283942/matching-3-or-more-words-from-fields-in-separate-csv-files?noredirect=1#comment494461_283942
I have also added some exception functionality following advice from meuh:
#!/bin/perl
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (#csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my #titlewords = split /\s+/, $title; #/ get words
my #new; #add exception words which shouldn't be matched
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(and|if|where)$/i;
}
#titlewords = #new;
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
++$count if $value =~ /\b$word\b/i;
last if $count >= $desired;
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;
During my testing, one issue I've found that I would like to tweak is that if csv2 contains a single common word such as the, if this is replicated in csv1 three or more times then three positive matches is found. To clarify:
If csv1 contains:
1216454,the important people feel the same way as the others, 15445454, 45445645
^ i.e. there are three insatnces of the in the above line
If csv2 contains:
14564564,the tallest man on earth,546456,47878787
^ i.e. there is one instance of the in this line
Then I would like only one word to be classed as matching, and there be no output (based on my desired number of matching words- 3 ) because there is only one instance of the matching word in one of the files.
However if:
csv1 contained:
1216454,the important people feel the same way as the others,15445454, 45445645
and csv2 contained:
15456456,the only way the man can sing the blues,444545,454545
Then, as there are three matching words in each (i.e. 3 instances of the word the in each title, then I would like this to be classed as a matching title based on my desired number of matching words being 3 or more, thus generating the output:
1216454,the important people feel the same way as the others,15445454, 45445645
15456456,the only way the man can sing the blues,444545,454545
I would like to amend the script so that if there is one instance of a word in a csv, and multiple instances of the same word in the other csv then that is classed as only one match. However, if there were say 3 instance of the word the in both files, then it should still be classed as three matches. Basically I would like matches to be on a word for word basis.
Everything about the script other than this is perfect so I would rather not go back to the drawing board completely as I am happy with everything other than this.
I hope I've explained it ok, if anyone need any clarification let me know.
If you just wan to count unique matches, you can use a hash instead of a list to collect the words from csv1, just like you do for csv2, and then also count the occurrences of each word separately:
#!/usr/bin/env perl
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (#csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my %words;
$words{$_}++ for split /\s+/, $title; #/ get words
## Collect unique words
my #titlewords = keys(%words);
my #new; #add exception words which shouldn't be matched
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(and|if|where)$/i;
}
#titlewords = #new;
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
my #matches = ( $value=~/\b$word\b/ig );
my $numIncsv2 = scalar(#matches);
#matches = ( $title=~/\b$word\b/ig );
my $numIncsv1 = scalar(#matches);
++$count if $value =~ /\b$word\b/i;
if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
$count = $desired+1;
last;
}
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;

how to replace a div[ ] element with another div[ ] element?

for ($i=0; $i<10; $i++)
{
my $v1 = $sel->get_text("//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td/div/div");
my $v2 = $sel->get_text("//body[#\id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td[2]/div/div")
print ($v1 . $v2);
}
For every iteration, it has to find the 14th element starting from div[10] & replace it with the increased div[ ] element (Ex: if 14th element is div, replace it by div[2]. In the next iterartion find 14th element i.e., div[2] & replace it by div[3] & so on ).
By using PATTERN matching, it can't. Is there any method by using regex for finding that particular element & replacing it ? how can i do it ?
my $a = "//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td/div/div";
my #arr = split ('/' , $a);
print "#arr \n";
my $size1 = #arr;
print "$size1\n";
print $arr[16];
foreach my $a2 (#arr)
{
print "$a2 \n";
}
my $b = "//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td[2]/div/div";
Two variables as mentioned in the above question as v1 & v2 (edited as $a and $b), the modification has to apply for both of them. I think i'm almost near to what you've told. Can yoy please help me further
use 5.010;
my $xpath = q(//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div/table/tbody/tr/td/div/div);
for my $i (0..10) {
my #nodes = split qr'/', $xpath;
$nodes[16] .= "[$i]" unless 0 == $i;
say join '/', #nodes;
}
Results:
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[1]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[2]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[3]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[4]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[5]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[6]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[7]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[8]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[9]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[10]/table/tbody/tr/td/div/div
Ummm, all elements are separated by /, right? So you can use the native split method to split the portion of the text following div[10] based on /. Store it in an array $arr. Merge it to find the length of the string, say $len. Find the index of the div[10], say $orig_index. Then you find the 14th element, do a regex match to see which format it is in:
$arr[13] =~ /div([\d+])?/;
if ($1) {
$arr[13] =~ /div[$1]/div[($1+1)]/e;
}
else {
$arr[13] = div[2];
}
Now that the 14th element is changed, concatenate the array to get the new output string for the portion from the portion between div[10] and the 14th one:
{
local $" = '';
$newstring = "#arr";
}
splice($originalstring,$orig_index,$len,$newstring);
I think that will do.