Perl HTML::TableExtract- can't find headers

Perl HTML::TableExtract- can't find headers - perl

I'm having a little trouble getting the HTML:TableExtract module in perl working. The problem (I think), is that the table headers contain html code to produce subscripts and special symbols, so I'm not sure how this should be searched for using the headers method. I've tried using the full headers (with tags), and also just the text, neither of which work. I'm trying to extract the tables from the following page (and similar ones for other isotopes):
http://www.nndc.bnl.gov/nudat2/getdataset.jsp?nucleus=208PB&unc=nds
Since I've had no luck with the headers method, I've also tried just specifying the depth and count in the object constructor (presumably both = 0 since there is only one top level table on the page), but it still doesn't find anything. Any assistance would be greatly appreciated!
Here is my attempt using the headers method:
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::TableExtract;
my $numArgs = $#ARGV + 1;
if ($numArgs != 1) {
print "Usage: perl convertlevels.pl <HTML levels file>\n";
exit;
}
my $htmlfile = $ARGV[0];
open(INFILE,$htmlfile) or die();
my $OutFileName;
if($htmlfile =~ /getdataset.jsp\?nucleus\=(\d+\w+)/){
$htmlfile =~ /getdataset.jsp\?nucleus\=(\d+\w+)/;
$OutFileName = "/home/dominic/run19062013/src/levels/".$1.".lev";
}
my $htmllines = <INFILE>;
open(OUTFILE,">",$OutFileName) or die();
my $te = new HTML::TableExtract->new(headers => ['E<sub>level</sub> <br> (keV)','XREF','Jπ','T<sub>1/2</sub>'] );
$te->parse_file($htmllines);
if ($te->tables)
{
print "I found a table!";
}else{
print "No tables found :'(";
}
close INFILE;
close OUTFILE;
Please ignore for now what is going on with the OUTFILE- the intention is to reformat the table contents and print into a separate file that can be easily read by another application. The trouble I am having is that the table extract method cannot find any tables, so when I test to see if anything found, the result is always false! I've also tried some of the other options in the constructor of the table extract object, but same story for every attempt! First time user so please excuse my n00bishness.
Thanks!

Related

perl - system command arguments give error

Tried hard to find a solution for this. But I probably need some help. I am trying to pass a bunch of arguments in system command in perl. But I get an irrelevant error. I have my variables correctly declared with the right scope and still get this error below. Here is my code.
#!/usr/bin/perl
use warnings;
use strict;
my $mi = 0;
my $mj = 0;
my #regbyte;
my #databyte;
my $filename;
my #args;
#regbyte = ("00","01","02","03","04","05","06","07","08","09","0A","0B","0C","0D","0E","0F","10","11","12");
#databyte = ("00","01","02","03","04","05","06","07","08","09", "0A", "0B");
for($mi=0; $mi<13; $mi++)
{
for($mj=0; $mj<256; $mj++)
{
$filename = "write_" . $regbyte[$mi] . "_" . $databyte[$mj] . ".atp";
system("perl perl_2_ver2.5.pl", $filename, $regbyte[$mi], $databyte[$mj], "n");
}
}
This is the error message I get.
Global symbol "$databyte" requires explicit package name at perl_2_ver2.8.pl line 20.
Execution of perl_2_ver2.8.pl aborted due to compilation errors.

I'm puzzled about a few things, in particular the trailing "n" you have in your system call. Is that supposed to be "\n"? Because it's unnecessary and wrong in that context.
The main problem is that you have
for ( $mj = 0; $mj < 256; $mj++ ) { .. }
and then access $databyte[$mj] when #databyte has only twelve elements. It's hard to know what you might mean.
Here's how I would write something that works, but may not be your intention.
use strict;
use warnings 'FATAL';
for my $regbyte (0 .. 0x12) {
for my $databyte (0 .. 0x0B) {
my $filename = sprintf "write_%02X_%02X.atp", $regbyte, $databyte;
system("perl perl_2_ver2.5.pl $filename $regbyte $databyte");
}
}
It looks like you want to run your script perl_2_ver2.5.pl with input consisting of all files that look like write_*_*.atp. Is that right?
Unless the directory contains atp files that you don't want to process, you are probably better off using just
while (my $filename = glob 'write*.atp') {
next unless /\Awrite_(\p{hex}{2})_(\p{hex}{2}).atp\z/;
system("perl perl_2_ver2.5.pl $filename $1 $2");
}
which just processes all the files that do exist and match the pattern.

I copy/pasted your code and only replaced the program parameter for the system call and I do not get the error you are reporting. However there are many array elements accessed, that don't exist.
You can limit your loops using the arraysizes like this:
for($mi=0; $mi<$#regbyte; $mi++)
And I believe you have two alternatives for your system call, either perl_2_ver2.5.pl is executable, then you can say (supposed, same directory):
system("./perl_2_ver2.5.pl", $filename, $regbyte[$mi], $databyte[$mj], "n");
Or you have to call:
system("perl" , "./perl_2_ver2.5.pl", $filename, $regbyte[$mi], $databyte[$mj], "n");

How to deal with calling data that is parsed into a hash in perl

So I parsed the following XML code using Perl and i'm trying to call the spectrum results but i'm having difficulty since it is a hash. I keep getting the error message reference found where even sized list expected.
<message>
<cmd id="result_data">
<result-file-header>
<path>String</path>
<duration>Float</duration>
<spectra-count>Integer</spectra-count>
</result-file-header>
<scan-results count="Integer">
<scan-result>
<spectrum-index>Integer</spectrum-index>
<time-stamp>Integer</time-stamp>
<tic>Float</tic>
<start-mass>Float</start-mass>
<stop-mass>Float</stop-mass>
<spectrum count="Integer">mass,abundance;mass1,abundance1;
mass2,abundance2</spectrum>
</scan-result>
<scan-result>
<spectrum-index>Integer</spectrum-index>
<time-stamp>Integer</time-stamp>
<tic>Float</tic>
<start-mass>Float</start-mass>
<stop-mass>Float</stop-mass>
<spectrum count="Integer">mass3,abundance3;mass4,abundance4;
mass5,abundance5</spectrum>
</scan-result>
</scan-results>
</cmd>
</message>
Here is the Perl code i'm using:
my $file = "gapiparseddataexample1.txt";
unless(open FILE, '>'.$file) {
die "\nUnable to create $file\n";
}
use warnings;
use XML::Simple;
use Data::Dumper;
my $values= XMLin('samplegapi.xml', ForceArray => [ 'scan-result' ,'result-file-header']);
print Dumper($values);
my $results = $values->{'cmd'}->{'scan-results'}->{'scan-result'};
my $results1=$values->{'cmd'}->{'result-file-header'};
for my $data (#$results) {
print FILE "Spectrum Index",":",$data->{"spectrum-index"},"\n";
print FILE "Total Ion Count",":",$data->{tic},"\n";
%spectrum=$data->{spectrum};
print FILE "Spectrum",":",%spectrum, "\n";
for my $data1 (#$results1) {
print FILE "Duration",":",$data1->{duration},"\n";
}
}
I want to be able to print out the spectrum value pairs.

This:
$spectrum=$data->{spectrum};
print FILE "Spectrum",":", $spectrum->{'content'}, "\n";
for my $data1 (#$results1) {
print FILE "Duration",":",$data1->{duration},"\n";
}
Should give you this (which I assume is what you want):
Spectrum:mass,abundance;mass1,abundance1;
mass2,abundance2
You'll want to remove the newline value from 'content' I imagine (so it doesn't split over two lines).
Explanation for anyone that's curious
The element contents have been shoved into "->content" because element also has an attribute. In this case, one called "count":
<spectrum count="Integer">mass3,abundance3;mass4,abundance4;
mass5,abundance5</spectrum>
This sort of behaviour is common in other languages and other XML parsing libraries too (e.g. sometimes they shove it into an element with the key 0). Sometimes it happens even when elements don't have regular attributes but are of specific types.
If you were to var dump $data->{$spectrum} you'd see the structure (again that usually applies in other languages and with other XML parsing libraries too).

Perl create byte array and file stream

I need to be able to send a file stream and a byte array as a response to an HTTP POST for the testing of a product. I am using CGI perl for the back end, but I am not very familiar with Perl yet, and I am not a developer, I am a Linux Admin. Sending a string based on query strings was very easy, but I am stuck on these two requirements. Below is the script that will return a page with Correct or Incorrect depending on the query string. How can I add logic to return a filestream and byte array as well?
#!/usr/bin/perl
use CGI ':standard';
print header();
print start_html();
my $query = new CGI;
my $value = $ENV{'QUERY_STRING'};
my $number = '12345';
if ( $value == $number ) {
print "<h1>Correct Value</h1>\n";
} else {
print "<h1>Incorrect value, you said: $value</h1>\n";
}
print end_html();

Glad to see new people dabbling in Perl from the sysadmin field. This is precisely how I started.
First off, if you're going to use the CGI.pm module I would suggest you use it to your advantage throughout the script. Where you've previously inputted <h1> you can use your CGI object to do this for you. In the end, you'll end up with much cleaner, more manageable code:
#!/usr/bin/perl
use CGI ':standard';
print header();
print start_html();
my $value = $ENV{'QUERY_STRING'};
my $number = '12345';
if ( $value == $number ) {
h1("Correct Value");
} else {
h1("Incorrect value, you said: $value");
}
print end_html();
Note that your comparison operator (==) will only work if this is a number. To make it work with strings as well, use the eq operator.
A little clarification regarding what you mean regarding filestreams and byte arrays ... by file stream, do you mean that you want to print out a file to the client? If so, this would be as easy as:
open(F,"/location/of/file");
while (<F>) {
print $_;
}
close(F);
This opens a file handle linked to the specified file, read-only, prints the content line by line, then closes it. Keep in mind that this will print out the file as-is, and will not look pretty in an HTML page. If you change the Content-type header to "text/plain" this would probably be more within the lines of what you're looking for. To do this, modify the call which prints the HTTP headers to:
print header(-type => 'text/plain');
If you go this route, you'll want to remove your start_html() and end_html() calls as well.
As for the byte array, I guess I'll need a little bit more information about what is being printed, and how you want it formatted.

Perl Plucene Index Search

Fooling around more with the Perl Plucene module and, having created my index, I am now trying to search it and return results.
My code to create the index is here...chances are you can skip this and read on:
#usr/bin/perl
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Index::Writer;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Search::HitCollector;
use Plucene::Search::IndexSearcher;
use Plucene::QueryParser;
use Try::Tiny;
my $content = $ARGV[0];
my $doc = Plucene::Document->new;
my $i=0;
$doc->add(Plucene::Document::Field->Text(content => $content));
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
if (!(-d "solutions" )) {
$i = 1;
}
if ($i)
{
my $writer = Plucene::Index::Writer->new("solutions", $analyzer, 1); #Third param is 1 if creating new index, 0 if adding to existing
$writer->add_document($doc);
my $doc_count = $writer->doc_count;
undef $writer; # close
}
else
{
my $writer = Plucene::Index::Writer->new("solutions", $analyzer, 0);
$writer->add_document($doc);
my $doc_count = $writer->doc_count;
undef $writer; # close
}
It creates a folder called "solutions" and various files to it...I'm assuming indexed files for the doc I created. Now I'd like to search my index...but I'm not coming up with anything. Here is my attempt, guided by the Plucene::Simple examples of CPAN. This is after I ran the above with the param "lol" from the command line.
#usr/bin/perl
use Plucene::Simple;
my $plucy = Plucene::Simple->open("solutions");
my #ids = $plucy->search("content : lol");
foreach(#ids)
{
print $_;
}
Nothing is printed, sadly )-=. I feel like querying the index should be simple, but perhaps my own stupidity is limiting my ability to do this.

Three things I discovered in time:
Plucene is a grossly inefficient proof-of-concept and the Java implementation of Lucene is BY FAR the way to go if you are going to use this tool. Here is some proof: http://www.kinosearch.com/kinosearch/benchmarks.html
Lucy is a superior choice that does the same thing and has more documentation and community (as per the comment on the question).
How to do what I asked in this problem.
I will share two scripts - one to import a file into a new Plucene index and one to search through that index and retrieve it. A truly working example of Plucene...can't really find it easily on the Internet. Also, I had tremendous trouble CPAN-ing these modules...so I ended up going to the CPAN site (just Google), getting the tar's and putting them in my Perl lib (I'm on Strawberry Perl, Windows 7) myself, however haphazard. Then I would try to run them and CPAN all the dependencies that it cried for. This is a sloppy way to do things...but it's how I did them and now it works.
#usr/bin/perl
use strict;
use warnings;
use Plucene::Simple;
my $content_1 = $ARGV[0];
my $content_2 = $ARGV[1];
my %documents;
%documents = (
"".$content_2 => {
content => $content_1
}
);
print $content_1;
my $index = Plucene::Simple->open( "solutions" );
for my $id (keys %documents)
{
$index->add($id => $documents{$id});
}
$index->optimize;
So what does this do...you call the script with two command line arguments of your choosing - it creates a key-value pair of the form "second argument" => "first argument". Think of this like the XMLs in the tutorial at the apache site (http://lucene.apache.org/solr/api/doc-files/tutorial.html). The second argument is the field name.
Anywho, this will make a folder in the directory the script was run in - in that folder will be files made by lucene - THIS IS YOUR INDEX!! All we need to do now is search that index using the power of Lucene, something made easy by Plucene. The script is the following:
#usr/bin/perl
use strict;
use warnings;
use Plucene::Simple;
my $content_1 = $ARGV[0];
my $index = Plucene::Simple->open( "solutions" );
my (#ids, $error);
my $query = $content_1;
#ids = $index->search($query);
foreach(#ids)
{
print $_."---seperator---";
}
You run this script by calling it from the command line with ONE argument - for example's sake let it be the same first argument as you called the previous script. If you do that you will see that it prints your second argument from the example before! So you have retrieved that value! And given that you have other key-value pairs with the same value, this will print those too! With "---seperator---" between them!

Writing multiple CSV files using Perl

I am trying to create multiple CSV files, which I'm going to use to input tables into Mysql (I had trouble writing code to put straight into mysql, and thought this would be easier for me to write, even if it is a bit convoluted) . The code compiles correctly and creates the files, but only the first file receives any data (and the first file goes into mysql fine).
use Text::CSV;
use IO::File;
my $GeneNumber = 4;
my #genearray;
my #Cisarray;
my #Cisgene;
$csv = Text::CSV->new ({ binary => 1, eol => $/ });
$iogene = new IO::File "> Gene.csv";
$iocis = new IO::File "> Cis.csv";
$iocisgene = new IO::File ">Cisgene.csv";
for(my $i=1; $i<=$GeneNumber; $i++)
{
#genearray=();
push(#genearray, 'Gene'.$i);
push(#genearray, rand());
push(#genearray, rand());
my $CisNumber=int(rand(2)+1);
for (my $j=1;$j<=$CisNumber;$j++){
#Cisgene=();
#Cisarray=();
push(#Cisgene, 'Gene'.$i);
push(#Cisgene, 'Cis'.$i.$j);
my $cisgeneref = \#cisgeneref;
$status = $csv->print ($iocisgene, $cisgeneref);
$csv->eol();
push (#Cisarray, 'Cis'.$i.$j);
push (#Cisarray, rand());
my $cisref = \#cisref;
$status = $csv->print ($iocis, $cisref);
$csv->eol();
}
my $generef= \#genearray;
$status = $csv->print ($iogene, $generef);
$csv->eol();
}
I am guessing the problem is something to do with
$status = $csv->print ($iocisgene, $cisgeneref);
I tried creating three versions of:
$csv = Text::CSV->new ({ binary => 1, eol => $/ });
however I still encountered the same problem.
Thanks

This looks like one of those (very common) cases where adding use strict and use warnings to the top of your program will help you to track down the problem easily.
In particular, the lines:
my $cisgeneref = \#cisgeneref;
and:
my $cisref = \#cisref;
look rather suspect as they take references to arrays (#cisgeneref and #cisref) which you have not used previously in the program.
I suspect that you really wanted:
my $cisgeneref = \#Cisgene;
and:
my $cisref = \#Cisarray;
Attempting to write Perl code without use strict and use warnings is a terrible idea.

Why don't you simply print the string to the files, without those libs:
open GENE, ">", "Gene.csv" or print "Can't create new file: $!\n\n";
print GENE "This;could;be;a;CSV File\nWith;2 lines;and;5 columns;=)";
close GENE or print "Can't close file: $!";
P.S. In my definitions, my CSV don't use commas, but instead use semicolons.
Also, #davorg sugestion of using strict and warnings is highly recommended.