Getting data from table ? - perl

how to display data (Stock name, Capitals, Close Price, Market value)from the website in terminal? I have this website:
http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt.php?l=en-us
, I create somethink.
my $url = 'http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt.php?l=en-us';
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
But I don't really know how to use $content to print the data which I need.
I'll be grateful for each help :)

You need to take a look at the excellent HTML::TableExtract module
Here's an example that uses the module to extract the data you require. I've used the URL for the printer-friendly version of the page for two reasons: the standard page uses JavaScript to build the table after it has been downloaded, so it isn't available to LWP::Simple which doesn't have JavaScript support; and it includes all the information on a single page, whereas the main page splits it up into many short sections
This is a far more robust, clear, and flexible technique than using regex patterns to parse HTML, which is generally a terrible idea
use strict;
use warnings 'all';
use LWP::Simple;
use HTML::TableExtract;
use open qw/ :std :encoding(utf-8) /;
use constant URL => 'http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt_print.php?l=en-us';
my $content = get URL or die "Couldn't get " . URL;
my $te = HTML::TableExtract->new( headers => [
qr/Stock\s+Name/,
qr/Capitals/,
qr/Close\s+Price/,
qr/Market\s+Value/,
] );
$te->parse($content);
for my $row ( $te->rows ) {
next unless $row->[0]; # Skip the final row with empty fields
$_ = qq{"$_"} for $row->[0]; # Enclose the Stock Name in quotes
tr/,//d for #{$row}[1,2,3]; # and remove commas from the numeric columns
print join(',', #$row), "\n";
}
output
"OBI Pharma, Inc.",171199584,594.00,101692
"Vanguard International Semiconductor Co.",1638982267,53.90,88341
"Hermes Microvision, Inc.",71000000,1155.00,82005
"TaiMed Biologics Inc.",247732750,238.00,58960
"Phison Electronics Corp.",197373993,271.00,53488
"FamilyMart.co.,Ltd",223220000,202.00,45090
"WIN SEMICONDUCTORS CORP.",596666262,65.30,38962
"PChome online Inc.",99854871,368.50,36796
"TUNG THIH ELECTRONIC CO.,LTD.",84488699,435.00,36752
"ST.SHINE OPTICAL CO.,LTD",50416516,694.00,34989
"POYA CO.,LTD",95277388,350.00,33347
"SIMPLO TECHNOLOGY CO.,LTD.",308284198,108.00,33294
"LandMark Optoelectronics Corporation",69909752,474.50,33172
"Ginko International Co., Ltd.",92697472,340.00,31517
"GIGASOLAR MATERIALS CORPORATION",60989036,506.00,30860
"TTY Biopharm Company Limited",248649959,114.00,28346
"CHIPBOND TECHNOLOGY CORPORATION",649261998,41.90,27204
"Globalwafers.Co.,Ltd.",369250000,69.10,25515
"eMemory Technology lnc.",75782242,321.00,24326
"Parade Technology, Ltd.",76111677,315.50,24013
"PharmaEngine, Inc.",102101000,235.00,23993
"JIH SUN FINANCIAL HOLDING CO., LTD",3396302860,6.86,23298
...

Simple pattern matching and some trick enough for to do it.
In your task $content contain the whole text.
First, extract the table body content from the $content by using .+ with s flag. s flag helps to allow, match the any character with new line.
Second, split the extracted data by using </tr>.
Third, Iterate the foreach for the array then again will do pattern matching with grouping for extract the data.
Here $l1 and $l2 stores the rank and stock code. And the other data will be stored into the #arc variable
my $url = 'http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt_print.php?l=en-us&d=2016/06/04&s=0,asc,0';
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
my ($table_body) = $content =~m/<tbody>(.+)<\/tbody>/s;
my #ar = split("</tr>",$table_body);
foreach my $lines(#ar)
{
my ($l1,$l2,#arc) = $lines =~m/>(.+?)<\/td>/g;
$, = "\t\t";
print #arc,"\n";
}

Related

Pull information from webpage using wget and Perl Script

The purpose of this Perl script is to pull information from webpage using wget and Perl. In this case, this is the website I am trying to extract information from: https://nocable.org/tv-listings/bm95-pomona-ca-91768
I only want the title to display. For example,
2.1 - The Ellen Degeneress Show
4.1 - NBC News
7.1 - Hell's Kitchen
And that's it.
This is what I have so far, but I'm not sure if I'm doing it right or wrong. Can anyone help or give some kind of advice on what to do to display just channel 2.1, 4.1 and 7.1 with the title only? Specifically with wget and Perl script. Any response is appreciated it, as I am still learning about Perl.
use warnings;
use strict;
use feature 'say';
use LWP::Simple;
use HTML::TableExtract;
use open qw(:encoding(UTF-8) :std);
my $url = ' https://nocable.org/tv-listings/bm95-pomona-ca-91768';
my $page = get($url) or die "Can't load $url: $!";
my $tec = HTML::TableExtract->new();
$tec->parse($page);
foreach my $rowref ($tec->rows)
{
next if not #$rowref;
# Clean up undefined/whitespace/newlines, often found in HTML
my #row = map {
$_ = '' if not defined; # keep undefined fields for formatting
s/^\s*|\s*$//g; #/ leading and trailing whitespace
s/\s+|\n/ /g; # multiple spaces, newlines
$_ # return it
} #$rowref;
say join ' | ', #row;
}
Your major problem seems to be that you're looking at the wrong table. There are multiple tables on that page and you can use the attribs parameter when creating your parser object in order to return the one you're interested in.
my $tec = HTML::TableExtract->new(attribs => {id => 'programming_data' });
Having added that, your code will print the info for all the channels. Then you need to create a way to identify the channels you're interested in. I used a regex:
my #channels = qw(2.1 4.1 7.1);
my $channel_re = '\b(' . join('|', map { quotemeta } #channels) . ')\b';
And in the main loop, you can skip channels that you aren't interested in:
next unless $rowref->[0] =~ /$channel_re/;

Save a row to csv format

I have a set of rows from a DB that I would like to save to a csv file.
Taking into account that the data are ascii chars without any weird chars would the following suffice?
my $csv_row = join( ', ', #$row );
# save csv_row to file
My concern is if that would create rows that would be acceptable as CSV by any tool and e.g not be concern with quoting etc.
Update:
Is there any difference with this?
my $csv = Text::CSV->new ( { binary => 1, eol => "\n"} );
my $header = join (',', qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 ) );
$csv->print( $fh, [$header] );
foreach my $row ( #data ) {
$csv->print($fh, $row );
}
This gives me as a first line:
" COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4"
Please notice the double quotes and the rest of the rows are without any quotes.
What is the difference than my plain join? Also do I need the binary set?
The safest way should be to write clean records with a comma separator. The simpler the better, specially with the format that has so much variation in real life. If needed, double quote each field.
The true strength in using the module is for reading of "real-life" data. But it makes perfect sense to use it for writing as well, for a uniform approach to CSV. Also, options can then be set in a clear way, and the module can iron out some glitches in data.
The Text::CSV documentation tells us about binary option
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7E (tilde), or binary data, you must set binary => 1 in the call to new. To cover the widest range of parsing options, you will always want to set binary.
I'd say use it. Since you write a file this may be it for options, along with eol (or use say method). But do scan the many useful options and review their defaults.
As for your header, the print method expects an array reference where each field is an element, not a single string with comma-separated fields. So it is wrong to say
my $header = join (',', qw(COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4)); # WRONG
$csv->print( $fh, [$header] );
since the $header is a single string which is then made the sole element of the (anonymous) array reference created by [ ... ]. So it prints this string as the first field in the row, and since it detects in it the separator , itself it also double-quotes. Instead, you should have
$csv->print($fh, [COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4]);
or better assign column names to #header and then do $csv->print($fh, \#header).
This is also an example of why it is good to use the module for writing – if a comma slips into an element of the array, supposed to be a single field, it is handled correctly by double-quoting.
A complete example
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
my $file = 'output.csv';
open my $fh_out , '>', 'output.csv' or die "Can't open $file for writing: $!";
my #headers = qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 );
my #data = 1..4;
$csv->print($fh_out, \#headers);
$csv->print($fh_out, \#data);
close $fh_out;
what produces the file output.csv
COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4
1,2,3,4

Perl - Need to append duplicates in a file and write unique value only

I have searched a fair bit and hope I'm not duplicating something someone has already asked. I have what amounts to a CSV that is specifically formatted (as required by a vendor). There are four values that are being delimited as follows:
"Name","Description","Tag","IPAddresses"
The list is quite long (and there are ~150 unique names--only 2 in the sample below) but it basically looks like this:
"2B_AppName-Environment","desc","tag","192.168.1.1"
"2B_AppName-Environment","desc","tag","192.168.22.155"
"2B_AppName-Environment","desc","tag","10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4"
"6G_ServerName-AltEnv","desc","tag","192.192.192.40"
"6G_ServerName-AltEnv","desc","tag","192.168.50.5"
I am hoping for a way in Perl (or sed/awk, etc.) to come up with the following:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
So basically, the resulting file will APPEND the duplicates to the first match -- there should only be one line per each app/server name with a list of comma-separated IP addresses just like what is shown above.
Note that the "Decription" and "Tag" fields don't need to be considered in the duplication removal/append logic -- let's assume these are blank for the example to make things easier. Also, in the vendor-supplied list, the "Name" entries are all already sorted to be together.
This short Perl program should suit you. It expects the path to the input CSV file as a parameter on the command line and prints the result to STDOUT. It keeps track of the appearance of new name fields in the #names array so that it can print the output in the order that each name first appears, and it takes the values for desc and tag from the first occurrence of each unique name.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my $name = $row->[0];
if ($data{$name}) {
$data{$name}[3] .= ','.$row->[3];
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
output
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Update
Here's a version that ignores any record that doesn't have a valid IPv4 address in the fourth field. I've used Regexp::Common as it's the simplest way to get complex regex patterns right. It may need installing on your system.
use strict;
use warnings;
use Text::CSV;
use Regexp::Common;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my ($name, $address) = #{$row}[0,3];
next unless $address =~ $RE{net}{IPv4};
if ($data{$name}) {
$data{$name}[3] .= ','.$address;
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
I would advise you to use a CSV parser like Text::CSV for this type of problem.
Borodin has already pasted a good example of how to do this.
One of the approaches that I'd advise you NOT to use are regular expressions.
The following one-liner demonstrates how one could do this, but this is a very fragile approach compared to an actual csv parser:
perl -0777 -ne '
while (m{^((.*)"[^"\n]*"\n(?:(?=\2).*\n)*)}mg) {
$s = $1;
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;
print $s
}' test.csv
Outputs:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Explanation:
Switches:
-0777: Slurp the entire file
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
while (m{^((.*)"[^"]*"\n(?:(?=\2).*\n)*)}mg): Separate text into matching sections.
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;: Join all ip addresses by a comma in matching sections.
print $s: Print the results.

Extracting DNA sequences from FASTA file with BioPerl with non-standard header

I'm trying to extract sequences from a database using the following code:
use strict;
use Bio::SearchIO;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";
Where the header of the sequence I'm trying to extract is: C7136661:0-107, as in the file:
>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA
The code works fine when I switch the header to something more standard (like test). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?
By default, Bio::DB::Fasta will use all non-space characters immediately following the > on the header line to form the key for the sequence. In your case this looks like C7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop) you can use $db->seq("$id:$start-$stop"), so a call to $db->seq('C7136661:0-107') looks like you are asking for $db->seq('C7136661', 0, 107), and that key doesn't exist.
I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid callback to modify the key. Then if you use just C7136661 to retrieve the sequence it will work.
This code demonstrates. Note that you will probably already have a .index cache file that you must delete before you see any change in behaviour.
use strict;
use warnings;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = qw(
secondround_merged_expanded.fasta
C7136661
1 10
);
my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);
sub makeid {
my ($head) = #_;
$head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
$1;
}
my $seq = $db->seq($id, $start, $end);
print $seq, "\n";
I have related question to this post. I was wondering if anyone has tried what happens when the position in the query is beyond the outside the limit of the fasta position. So lets say, the fasta contains 100 bases and you query contains position 102, does this method trap the error. I tried this in some real data and it appears to always return "1", however, my fasta sequences contains 0/1 and so it is hard to understand if this is an error code/ it is returning the output for the wrong base.
I tried looking in the documentation but could not find anything.

Mechanize example - quiete simple but too complex for me: need explanations

Good day dear community. I am new to programming. And i want to digg deeper into Perl.
So i have a Mechanize example - quiete simple but too complex for me: need explanations. I need your help here with this!
use strict;
$|++;
use WWW::Mechanize;
use File::Basename;
my $m = WWW::Mechanize->new;
$m->get("http://www.despair.com/indem.html");
my #top_links = #{$m->links};
for my $top_link_num (0..$#top_links) {
next unless $top_links[$top_link_num][0] =~ /^http:/;
$m->follow_link( n=>$top_link_num ) or die "can't follow $top_link_num";
print $m->uri, "\n";
for my $image (grep m{^http://store4}, map $_->[0], #{$m->links}) {
my $local = basename $image;
print " $image...", $m->mirror($image, $local)->message, "\n"
}
$m->back or die "can't go back";
}
can anybody give me a line by line explanation?
I tried the first coupe of lines.
However you need to make sure to first read and understand the following documentation:
1) Perl Intro - especially variable scoping part
2) Perl data
3) Perl Data Structures Cookbook
P.S. As Eric said in the comment, this code is definitely NOT a very good example for someone just starting. It's got way too many non-trivial ideas/concepts/moving parts.
use strict;
# Does not allow undeclared global variables or other unsafe constructs.
# You should ALWAYS code with "use strict; use warnings"
# See http://perldoc.perl.org/strict.html
$|++;
# Turn on autoflush on STDOUT filehandle.
# See "http://perldoc.perl.org/perlvar.html" for "$|" and other special variables.
# P.S. This "++" is a hack - it would be a lot more readable to do "$| = 1;"
# since $| only cares whether the value is zero or non-zero.
use WWW::Mechanize; # Load the module for getting web sites.
use File::Basename; # Load the module for finding script's name/path.
my $m = WWW::Mechanize->new; # Create new object via a constructor (new)
$m->get("http://www.despair.com/indem.html");
# Retrieve the contents of the URL.
# See http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm
# for the module's documentation (aka POD)
my #top_links = #{$m->links};
# Declare a "#top_links" array,
# get the list of links on the above page (returns array reference)
# and de-reference that array reference and store it in #top_links array
for my $top_link_num (0..$#top_links) {
# Loop over all integers between 0 and the last index of #top_links array
# (e.g. if there were 3 links, loop over 0,1,2
# Assign the current loop value to $top_link_num variable
next unless $top_links[$top_link_num][0] =~ /^http:/;
# go to next iteration of the loop unless the current link's URL is HTTP protocol
# Current link is the element of the array with current undex -
# $top_links[$top_link_num]
# The link data is stored as an array reference,
# with the link URL being the first element of the arrayref
# Therefore, $top_links[$top_link_num][0] - which is the shorthand
# for $top_links[$top_link_num]->[0] as you learned
# from reading Data Structures Cookbook I linked - is the URL
# To check if URL is HTTP prtocol, we check if it starts with http:
# via regular expression - see "http://perldoc.perl.org/perlre.html"
$m->follow_link( n=>$top_link_num ) or die "can't follow $top_link_num";
print $m->uri, "\n";
for my $image (grep m{^http://store4}, map $_->[0], #{$m->links}) {
my $local = basename $image;
print " $image...", $m->mirror($image, $local)->message, "\n"
}
$m->back or die "can't go back";
}