WWW::Mechanize and iteration - perl

i am trying to scrape info from http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1 from lines 1184 to 1325, basically the up coming games for the next 7 days. i have the code working for a single instance, but i can't figure out how to iterate the code so that it will scrape all the games info until it hits the end of the 7 day's worth of games. Is there some sort of loop i can create that will scrape until i hit a certain tag or something? Here is my code so far, thanks in advance!
my $page = WWW::Mechanize->new;
$page->get('http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1');
my $stream = HTML::TokeParser->new(\$page->{content});
my #fixture;
my $tag = $stream->get_tag("td");
while($tag->[1]{class} ne "dateTime"){
$tag = $stream->get_tag("td");
}
if ($tag->[1]{class} eq "dateTime") {
push(#fixture, $stream->get_trimmed_text("/a"));
}
$stream->get_tag("a");
$stream->get_tag("a");
push(#fixture, $stream->get_trimmed_text("/a"));
$stream->get_tag("a");
push(#fixture, $stream->get_trimmed_text("/a"));
foreach $element (#fixture){
print $element, "\t";
}
print "\n";

Try Web::Query for parsing HTML, it is a much nicer to use than TokeParser. It works declarative instead of imperative and you select elements with CSS expressions.
If there is a score v, add the row to the result set, else discard the row.
use Web::Query 'wq';
my $football_matches = wq($mech->content)
->find('tr.match')
->map(sub {
my (undef, $e) = #_;
return 'v' eq $e->find('td.score')->text
? [
$e->attr('id'),
map { $e->find("td.$_")->text }
(qw(tournament dateTime homeTeam score awayTeam prices))
]
: ();
});
use Data::Dumper; print Dumper $football_matches;
$VAR1 = [
['tn7gc635476', '', ' Mo 12Mar 2012 ', 'Arsenal', 'v', 'Newcastle', ' '],
['tn7gc649937', '', ' Tu 13Mar 2012 ', 'Liverpool', 'v', 'Everton', ' '],
['tn7gc635681', '', ' Sa 17Mar 2012 ', 'Fulham', 'v', 'Swansea', ' '],
['tn7gc635661', '', ' Sa 17Mar 2012 ', 'Wigan', 'v', 'West Brom', ' '],
['tn7gc635749', '', ' Su 18Mar 2012 ', 'Wolves', 'v', 'Man Utd', ' '],
['tn7gc635556', '', ' Su 18Mar 2012 ', 'Newcastle', 'v', 'Norwich', ' ']
];

Related

in a sub routine ... using return to get some elements of array

I am trying to get some elements of array to return in the subroutine .it gives me only the last one dd 4 13
here is my csv
aa ,1 ,no ,ed ,8
bb ,2 ,yes ,ed ,10
cc ,3 ,no ,ed ,12
dd ,4 ,yes ,ed ,13
here is my perl code
use strict;
use Getopt::Std;
my $input_file = $ARGV[0];
my #data = read_data_from_csv($input_file);
sub read_data_from_csv
{
my ($fh) = #_;
my #d = ();
open(FH, "<$fh") || die "Error: no open for $fh: $!";
while (<FH>) {
chomp;
my #list =split(/,/);
my ($aa) = $list[0];
my ($bb) = $list[1];
my ($cc) = $list[4];
push (#d , ($aa, $bb, $cc));
}
close (FH);
return #d
}
print "#data\n";
cat test.csv
aa ,1 ,no ,ed ,8
bb ,2 ,yes ,ed ,10
cc ,3 ,no ,ed ,12
dd ,4 ,yes ,ed ,13
/pkg/qct/software/perl/5.20.0/bin/perl test.pl test.csv
dd yes 13
Your code as posted works:
use strict;
use Getopt::Std;
my $input_file = $ARGV[0];
my #data = read_data_from_csv($input_file);
sub read_data_from_csv {
my ($fh) = #_;
my #d = ();
#open(FH, "<$fh") || die "Error: no open for $fh: $!";
while (<DATA>) {
chomp;
my #list = split(/,/);
my ($aa) = $list[0];
my ($bb) = $list[1];
my ($cc) = $list[4];
push( #d, ( $aa, $bb, $cc ) );
}
close(FH);
return #d;
}
print "#data\n";
__DATA__
aa ,1 ,no ,ed ,8
bb ,2 ,yes ,ed ,10
cc ,3 ,no ,ed ,12
dd ,4 ,yes ,ed ,13
Output:
aa 1 8 bb 2 10 cc 3 12 dd 4 13
I would suggest you need to check your input file - if on unix, try cat -v which might show you you've got bad line endings.
An easy fix if you do have this problem (or to test it) is include:
s/[\r\n]//g;
More generally though I think there's a few errors in your code - that 'push' for example, probably isn't doing what you're thinking, because you're compressing your CSV into a flat array.
I'd also suggest using Data::Dumper to test outputs, because it's clearer what's happening:
$VAR1 = [
' aa ',
'1 ',
'8 ',
' bb ',
'2 ',
'10',
' cc ',
'3 ',
'12',
'dd ',
'4 ',
'13'
];
As you can see, you've flattened your data, which I am assuming isn't what you want, based on what your write in push.
So you might want to consider using [] instead, because then you get:
$VAR1 = [
[
' aa ',
'1 ',
'8 '
],
[
' bb ',
'2 ',
'10'
],
[
' cc ',
'3 ',
'12'
],
[
'dd ',
'4 ',
'13'
]
];
You can also direct assign array slices, rather than having a 1-1:
push ( #d, [ #list[0,1,4] ] );
Also as a final point - it's bad style to use single letter variables, and you should also use warnings;.
Giving you:
use strict;
use Getopt::Std;
use warnings;
use Data::Dumper;
my $input_file = $ARGV[0];
my #data = read_data_from_csv($input_file);
sub read_data_from_csv {
my ($fh) = #_;
my #d = ();
##NB Commented out so I can use inline data
#open(FH, "<$fh") || die "Error: no open for $fh: $!";
while (<DATA>) {
chomp;
s/[\r\n]//g;
my #list = split(/,/);
push( #d, [ #list[0,1,4] ]);
}
##close(FH);
return #d;
}
print Dumper \#data;
__DATA__
aa ,1 ,no ,ed ,8
bb ,2 ,yes ,ed ,10
cc ,3 ,no ,ed ,12
dd ,4 ,yes ,ed ,13
You might want to consider that instead of opening the file specified in $ARGV[0] this is precisely what the 'magic' file handle <> does. It opens file specified on command line and iterates them, but it also allows you to 'pipe' into the process on STDIN:
Making your program:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
## flattened array
my #data_A = map { s/[\r\n]//g; ( split(/,/) ) [0,1,4] } <>;
## preserving structure
my #data_B = map { s/[\r\n]//g; [( split(/,/) ) [0,1,4]] } <>;
print Dumper \#data_B;
I appreciate your code is probably a reduction of the core problem, but I'm just wanting to illustrate simplification options.
You can use https://metacpan.org/pod/Text::CSV_XS module. Text::CSV_XS - comma-separated values manipulation routines. you can use functionality as per your requirement.
use strict;
use warnings;
use Text::CSV_XS;
use Data::Dumper qw(Dumper);
my #rows;
my #column_numbers = (0,1,4);
my $csv_file = "tst.csv";
# check https://metacpan.org/pod/Text::CSV_XS#binary
# check http://metacpan.org/pod/Text::CSV_XS#auto_diag
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf8)", $csv_file or die "$csv_file: $!";
# $csv->getline_all ($fh); will return a reference to a list of getline ($fh) results.
#The map function of Perl provides a simple way to transform a list of values to another list of values.
#rows = map { [ #{ $_ }[#column_numbers] ] } #{ $csv->getline_all($fh) };
print Dumper(\#rows);
close $fh;
Output
$VAR1 = [
[
'aa ',
'1 ',
'8 '
],
[
'bb ',
'2 ',
'10'
],
[
'cc ',
'3 ',
'12'
],
[
'dd ',
'4 ',
'13'
]
];
Note : Please check solution provided by #ikegami Unable to read multiple columns from a .csv using Text::CSV_XS in Perl
map: 1) https://perldoc.perl.org/functions/map.html
2) https://perlmaven.com/transforming-a-perl-array-using-map

how would I trim and split into perl hash

I have this string
$st="station_1:50, station_2:40, station_3:60";
how would I split this into a Perl hash table ?
I try
%hash = map{split /\:/, $_}(split /, /, $st);
it does correctly - but what if there is n-space between the , and station?
how would I make it so it strip out all the leading space?
If there may or may not be a space, split on /, ?/ instead of /, /. If there may be any number of spaces, use /, */.
The solution with your code (added \s* to the second split):
perl -we '
my $_ = "station_1:50, station_2:40, station_3:60";
my %hash = map {split /:/} split /,\s*/;
use Data::Dumper;
print Dumper \%hash
'
OUTPUT:
$VAR1 = {
'station_1' => '50',
'station_3' => '60',
'station_2' => '40'
};
Another working way using regex:
CODE
$ echo "station_1:50, station_2:40, station_3:6" |
perl -MData::Dumper -lne '
my %h;
$h{$1} = $2 while /\b(station_\d+):(\d+)/ig;
print Dumper \%h
'
SAMPLE OUTPUT
$VAR1 = {
'station_3' => '6',
'station_1' => '50',
'station_2' => '40'
};

Can someone help me with the column sorting code in perl

I am learning perl and wanted to sort a column in a file. I found this code online.
my #array = (
"kyy1 x753y420 31082010 07:01:11",
"exr1 x831y444 31082010 07:43:45",
"eef1 x717y532 31082010 07:30:17",
"bab3 x789y486 31082010 08:08:56",
"sam1 x1017y200 31082010 07:25:18",
"jmd2 x789y466 31082010 07:38:22",
"baa3cqc x720y440 31082010 07:26:37"
);
# Sort by first column - login name
my #sortedName = sort { (split ' ', $a)[0] cmp (split ' ', $b)[0] } #array;
# Sort by second column - SKU number
my #sortedSkno = sort { (split ' ', $a)[1] cmp (split ' ', $b)[1] } #array;
# Sort by third - date - and fourth - time - column combined!
my #sortedTime = sort { (split ' ', $a)[2].(split ' ', $a)[3] cmp (split ' ', $b)[2].(split ' ', $b)[3] } #array;
print "Array\n";
print join( "\n", #array )."\n\n";
print "Sort Name\n";
print join( "\n", #sortedName )."\n\n";
print "Sort Skno\n";
print join( "\n", #sortedSkno )."\n\n";
print "Sort Date\n";
print join( "\n", #sortedTime )."\n\n";
But i don't understand the following line completely
# Sort by first column - login name
my #sortedName = sort { (split ' ', $a)[0] cmp (split ' ', $b)[0] } #array;
What i don't get is that, sort function takes two arguments (split ' ', $a)[0] and (split ' ', $b)[0]. These represent the special variables $a and $b used by sort function. If that is true then how come split function have $a again as in (split ' ', $a)[0] !?
The following modification is understood much better. Its same as
my #sortedName=sort { (split ' ', $a)[0] cmp (split ' ', $b)[0] } #array
but much more readable. This is the modification
my #sortedName=sort{ #arr_a=split(' ',$a);
#arr_b=split(' ',$b);
#arr_a[0] cmp #arr_b[0];
}(#array);
Here the #array is read and two arguments or elements from the list are saved in $a and $b. These are then split using spaces and stored in #arr_a,#arr_b. Then we just compare the 0th element and sort.

Adding an "and" before the last element in a comma-interpolated array in Perl

I want to create a subroutine that adds commas to elements and adds an "and" before the last element, e.g., so that "12345" becomes "1, 2, 3, 4, and 5". I know how to add the commas, but the problem is the result I get is "1, 2, 3, 4, and 5," and I don't know how to get rid of the last comma.
sub commas {
my #with_commas;
foreach (#_) {
push (#with_commas, ($_, ", ")); #up to here it's fine
}
splice #with_commas, -2, 1, ("and ", $_[-1]);
#with_commas;
}
As you can probably tell, I'm trying to delete the last element in the new array (#with_commas), since it has the comma appended, and add in the last element in the old array (#_, passed to the sub routine from the main program, with no added comma).
When I run this, the result is, e.g., "1, 2, 3, 4, and 5," -- with the comma at the end. Where is that comma coming from? Only #with_commas was supposed to get the commas.
Any help is appreciated.
sub format_list {
return "" if !#_;
my $last = pop(#_);
return $last if !#_;
return join(', ', #_) . " and " . $last;
}
print format_list(#list), "\n";
This also handles lists with only one element, unlike most of the other answers.
You could use join and modify the last element to include an and:
my #list = 1 .. 5;
$list[-1] = "and $list[-1]" if $#list;
print join ', ', #list;
There is a CPAN module for this, Lingua::Conjunction. I use it myself, and recommend it over rolling your own solution. The usage syntax is very simple:
conjunction(#list);
#!/usr/bin/perl
use warnings;
use strict;
sub commas {
return "" if #_ == 0;
return $_[0] if #_ == 1;
my $last = pop #_;
my $rest = join (", ", #_);
return $rest.", and ".$last;
}
my #a = (1,2,3,4,5);
print commas(#a), "\n";
Add the commas then add the "and ":
use v5.10;
my $string = join ', ', 1 .. 5;
substr
$string,
rindex( $string, ', ' ) + 2,
0,
'and '
;
say $string;
So, work that in as the case when you have more than two elements:
use v5.10;
my #array = 1..5;
my $string = do {
if( #array == 1 ) {
#array[0];
}
elsif( #array == 2 ) {
join ' and ', #array
}
elsif( #array > 2 ) {
my $string = join ', ', #array;
my $commas = $string =~ tr/,//;
substr
$string,
rindex( $string, ', ' ) + 2,
0,
'and '
;
$string;
}
};
say $string;
Just in the spirit of TIMTOWTDI (though, frankly, #perreal's answer is better as far as readability):
sub commas {
my $last_index = $#_;
my #with_commas = map { (($_==$last_index) ? "and " : "") . $_[$_] }
0 .. $last_index;
print join("," #with_commas)
}
This is somewhat similar to Alan's answer (more convoluted/complicated), but the benefit compared to that is that it would work if you need to add "and " to any OTHER element than the last one; Alan's only works when you know the exact offset (e.g. last element)
Small hint
for( 1 .. 10 ) {
print ;
$_ == 10 ? print '' : ($_ != 9 ? print ', ' : print ' and ');
}

Reading sections from a file in Perl

I am trying to read values from an input file in Perl.
Input file looks like:
1-sampledata1 This is a sample test
and data for this continues
2-sampledata2 This is sample test 2
Data for this also is on second line
I want to read the above data so that data for 1-sampledata1 goes into #array1 and data for 2-sampledata2 goes in #array2 and so on.
I will have about 50 sections like this. like 50-sampledata50.
EDIT: The names wont always be X-sampledataX. I just did that for example. So names cant be in a loop. I think I'll have to type them manually
I so far have the following (which works). But I am looking for a more efficient way to do this..
foreach my $line(#body){
if ($line=~ /^1-sampledata1\s/){
$line=~ s/1-ENST0000//g;
$line=~ s/\s+//g;
push (#array1, $line);
#using splitarray because i want to store data as one character each
#for ex: i wana store 'This' as T H I S in different elements of array
#splitarray1= split ('',$line);
last if ($line=~ /2-sampledata2/);
}
}
foreach my $line(#body){
if ($line=~ /^2-sampledata2\s/){
$line=~ s/2-ENSBTAP0//g;
$line=~ s/\s+//g;
#splitarray2= split ('',$line);
last if ($line=~ /3-sampledata3/);
}
}
As you can see I have different arrays for each section and different for loops for each section. If I go with approach I have so far then I will end up with 50 for loops and 50 arrays.
Is there another better way to do this? In the end I do want to end up with 50 arrays but do not want to write 50 for loops. And since I will be looping through the 50 arrays later on in the program, maybe store them in an array? I am new to Perl so its kinda overwhelming ...
The first thing to notice is that you are trying to use variable names with integer suffixes: Don't. Use an array whenever you find your self wanting to do that. Second, you only need to read to go over the file contents once, not multiple times. Third, there is usually no good reason in Perl to treat a string as an array of characters.
Update: This version of the code uses existence of leading spaces to decide what to do. I am leaving the previous version up as well for reference.
#!/usr/bin/perl
use strict;
use warnings;
my #data;
while ( my $line = <DATA> ) {
chomp $line;
if ( $line =~ s/^ +/ / ) {
push #{ $data[-1] }, split //, $line;
}
else {
push #data, [ split //, $line ];
}
}
use Data::Dumper;
print Dumper \#data;
__DATA__
1-sampledata1 This is a sample test
and data for this continues
2-sampledata2 This is sample test 2
Data for this also is on second line
Previous version:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
while ( my $line = <DATA> ) {
chomp $line;
$line =~ s/\s+/ /g;
if ( $line =~ /^[0-9]+-/ ) {
push #data, [ split //, $line ];
}
else {
push #{ $data[-1] }, split //, $line;
}
}
use Data::Dumper;
print Dumper \#data;
__DATA__
1-sampledata1 This is a sample test
and data for this continues
2-sampledata2 This is sample test 2
Data for this also is on second line
#! /usr/bin/env perl
use strict;
use warnings;
my %data;
{
my( $key, $rest );
while( my $line = <> ){
unless( ($rest) = $line =~ /^ \s+(.*)/x ){
($key, $rest) = $line =~ /^(.*?)\s+(.*)/;
}
push #{ $data{$key} }, $rest;
}
}
The code below is very similar to #Brad Gilbert's and #Sinan Unur's solutions:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my (%arrays, $label);
while (my $line = <DATA>)
{
($label, $line) = ($1, $2) if $line =~ /^(\S+)(.*)/; # new data block
$line =~ s/^\s+//; # strip whitespaces from the begining
# append data for corresponding label
push #{$arrays{$label}}, split('', $line) if defined $label;
}
print $arrays{'1-sampledata1'}[2], "\n"; # 'i'
print join '-', #{$arrays{'2-sampledata2'}}; # 'T-h-i-s- -i-s- -s-a-m-p-l
print Dumper \%arrays;
__DATA__
1-sampledata1 This is a sample test
and data for this continues
2-sampledata2 This is sample test 2
Data for this also is on second line
Output
i
T-h-i-s- -i-s- -s-a-m-p-l-e- -t-e-s-t- -2-D-a-t-a- -f-o-r- -t-h-i-s- -a-l-s-o- -i-s- -o-n- -s-e-c-o-n-d- -l-i-n-e-
$VAR1 = {
'2-sampledata2' => [
'T',
'h',
'i',
's',
' ',
'i',
's',
' ',
's',
'a',
'm',
'p',
'l',
'e',
' ',
't',
'e',
's',
't',
' ',
'2',
'D',
'a',
't',
'a',
' ',
'f',
'o',
'r',
' ',
't',
'h',
'i',
's',
' ',
'a',
'l',
's',
'o',
' ',
'i',
's',
' ',
'o',
'n',
' ',
's',
'e',
'c',
'o',
'n',
'd',
' ',
'l',
'i',
'n',
'e',
'
'
],
'1-sampledata1' => [
'T',
'h',
'i',
's',
' ',
'i',
's',
' ',
'a',
' ',
's',
'a',
'm',
'p',
'l',
'e',
' ',
't',
'e',
's',
't',
'a',
'n',
'd',
' ',
'd',
'a',
't',
'a',
' ',
'f',
'o',
'r',
' ',
't',
'h',
'i',
's',
' ',
'c',
'o',
'n',
't',
'i',
'n',
'u',
'e',
's',
'
'
]
};
You should, instead, use a hash map to arrays.
Use this regex pattern to get the index:
/^(\d+)-sampledata(\d+)/
And then, with my %arrays, do:
push($arrays{$index}), $line;
You can then access the arrays with $arrays{$index}.