Writing multiple CSV files using Perl - perl

I am trying to create multiple CSV files, which I'm going to use to input tables into Mysql (I had trouble writing code to put straight into mysql, and thought this would be easier for me to write, even if it is a bit convoluted) . The code compiles correctly and creates the files, but only the first file receives any data (and the first file goes into mysql fine).
use Text::CSV;
use IO::File;
my $GeneNumber = 4;
my #genearray;
my #Cisarray;
my #Cisgene;
$csv = Text::CSV->new ({ binary => 1, eol => $/ });
$iogene = new IO::File "> Gene.csv";
$iocis = new IO::File "> Cis.csv";
$iocisgene = new IO::File ">Cisgene.csv";
for(my $i=1; $i<=$GeneNumber; $i++)
{
#genearray=();
push(#genearray, 'Gene'.$i);
push(#genearray, rand());
push(#genearray, rand());
my $CisNumber=int(rand(2)+1);
for (my $j=1;$j<=$CisNumber;$j++){
#Cisgene=();
#Cisarray=();
push(#Cisgene, 'Gene'.$i);
push(#Cisgene, 'Cis'.$i.$j);
my $cisgeneref = \#cisgeneref;
$status = $csv->print ($iocisgene, $cisgeneref);
$csv->eol();
push (#Cisarray, 'Cis'.$i.$j);
push (#Cisarray, rand());
my $cisref = \#cisref;
$status = $csv->print ($iocis, $cisref);
$csv->eol();
}
my $generef= \#genearray;
$status = $csv->print ($iogene, $generef);
$csv->eol();
}
I am guessing the problem is something to do with
$status = $csv->print ($iocisgene, $cisgeneref);
I tried creating three versions of:
$csv = Text::CSV->new ({ binary => 1, eol => $/ });
however I still encountered the same problem.
Thanks

This looks like one of those (very common) cases where adding use strict and use warnings to the top of your program will help you to track down the problem easily.
In particular, the lines:
my $cisgeneref = \#cisgeneref;
and:
my $cisref = \#cisref;
look rather suspect as they take references to arrays (#cisgeneref and #cisref) which you have not used previously in the program.
I suspect that you really wanted:
my $cisgeneref = \#Cisgene;
and:
my $cisref = \#Cisarray;
Attempting to write Perl code without use strict and use warnings is a terrible idea.

Why don't you simply print the string to the files, without those libs:
open GENE, ">", "Gene.csv" or print "Can't create new file: $!\n\n";
print GENE "This;could;be;a;CSV File\nWith;2 lines;and;5 columns;=)";
close GENE or print "Can't close file: $!";
P.S. In my definitions, my CSV don't use commas, but instead use semicolons.
Also, #davorg sugestion of using strict and warnings is highly recommended.

Related

XML::Simple returns "Out of memory" error for large XMLs

This might take a while to explain, but I have a file (XMLList.txt) that contains the paths to multiple IDOC XMLs. The contents of the XMLList.txt look like this:
/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/AU_DHL_PW_Inbound_Delivery_from_Pfizer_20171220071754.xml
/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/AU_DHL_PW_Inbound_Delivery_from_Pfizer_20171220083310.xml
/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/CCMastOut_MQ_GLB_1_20171220154826.xml
I'm attempting to create a Perl script that reads each XML and parses just the values of the tags DOCNUM, SNDPRN and RCVPRN from each XML file into a pipe delimited file "report.csv"
Another thing to note is that my XML files could be:
All on a single line - example
<?xml version="1.0" encoding="UTF-8"?><ZDELVRY073PL><IDOC BEGIN="1">
<EDI_DC40 SEGMENT="1"><TABNAM>EDI_DC40</TABNAM><MANDT>400</MANDT>
<DOCNUM>0000000443474886</DOCNUM><DOCREL>731</DOCREL><STATUS>30</STATUS>
<DIRECT>1</DIRECT><OUTMOD>4</OUTMOD><IDOCTYP>DELVRY07</IDOCTYP>
<CIMTYP>ZDELVRY073PL</CIMTYP><MESTYP>ZIBDADV</MESTYP><MESCOD>IBG</MESCOD>
<SNDPOR>SAPQ01</SNDPOR><SNDPRT>LS</SNDPRT><SNDPRN>Q01CLNT400</SNDPRN>
<RCVPOR>XMLDIST_MT</RCVPOR><RCVPRT>LS</RCVPRT><RCVPFC>LS</RCVPFC>
<RCVPRN>AU_DHL</RCVPRN>.... </EDI_DC40></IDOC>
or multiline XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<INVOIC02>
<IDOC>
<EDI_DC40>
<TABNAM/>
<DOCNUM>0000000658056255</DOCNUM>
<DIRECT/>
<IDOCTYP>INVOIC02</IDOCTYP>
<MESTYP>INVOIC</MESTYP>
<SNDPOR>SAPP01</SNDPOR>
<SNDPRT/>
<SNDPRN>ALE400</SNDPRN>
<RCVPOR>XMLINVOICE</RCVPOR>
<RCVPRT>KU</RCVPRT>
<RCVPRN>C18BASWARE</RCVPRN>
<CREDAT>20171220</CREDAT>
<CRETIM>222323</CRETIM>
</EDI_DC40>
The script I've used so far seems to work for small XMLs. However, some XMLs > 50 MB throw this error:
Out of memory! Out of memory! Callback called exit at
/usr/opt/perl5/lib/site_perl/5.10.1/XML/SAX/Base.pm
line 1941 (#1)
(F) A subroutine invoked from an external package via call_sv()
exited by calling exit.
Out of memory!
So, here's the code I'm using. Would like your help tweaking this:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# use module
use XML::Simple;
use Data::Dumper;
# create object
my $xml = new XML::Simple;
my $file_list = 'XMLList.txt';
open(my $fh_i, '<:encoding(UTF-8)', $file_list)
or die "Could not open file '$file_list' $!";
my $csv_out = 'report.csv';
open(my $fh_o, '>', $csv_out)
or die "Could not open file '$csv_out' $!";
while (my $row = <$fh_i>) {
$row =~ s/\R//g;
my $data = $xml->XMLin($row);
print $fh_o "$data->{IDOC}->{EDI_DC40}->{DOCNUM}|";
print $fh_o "$data->{IDOC}->{EDI_DC40}->{SNDPRN}|";
print $fh_o "$data->{IDOC}->{EDI_DC40}->{RCVPRN}\n";
}
close $fh_o;
I recommend that people stop using XML::Simple when they have a problem using it. That module is nice to get started but its not meant to be a long term solution. Even then, see Why is XML::Simple “Discouraged”?
XML::Twig is what I often use for these tasks. You can set up handlers for tags and get that part of the tree. You process it and move on. That might be as simple as something like this where I set up a subroutine to process each EDI_DC40 as I encounter it:
use Text::CSV_XS;
use XML::Twig;
my $csv = Text::CSV_XS->new;
my $twig = XML::Twig->new(
twig_handlers => {
'EDI_DC40' => \&process_EDI_DC40,
},
);
$twig->parsefile( $ARGV[0] );
sub process_EDI_DC40 {
my( $twig, $thingy ) = #_;
my #values = map { $thingy->first_child( $_ )->text }
qw(DOCNUM RCVPRN SNDPRN);
$csv->say( *STDOUT, \#values );
}
First off, if the file contains newlines,
while (my $row = <$fh_i>){
$row =~ s/\R//g;
my $data = $xml->XMLin($row);
is going to read one line at a time from the file and attempt to do an XML conversion on that line alone instead of the whole document. I would recommend that you slurp each file into a buffer and use regex to eliminate newlines and carriage returns before XMLin conversion. Also, XMLin will die unceremoniously if there are any XML errors in the file, so you want to run it in an eval block.

Perl HTML::TableExtract- can't find headers

I'm having a little trouble getting the HTML:TableExtract module in perl working. The problem (I think), is that the table headers contain html code to produce subscripts and special symbols, so I'm not sure how this should be searched for using the headers method. I've tried using the full headers (with tags), and also just the text, neither of which work. I'm trying to extract the tables from the following page (and similar ones for other isotopes):
http://www.nndc.bnl.gov/nudat2/getdataset.jsp?nucleus=208PB&unc=nds
Since I've had no luck with the headers method, I've also tried just specifying the depth and count in the object constructor (presumably both = 0 since there is only one top level table on the page), but it still doesn't find anything. Any assistance would be greatly appreciated!
Here is my attempt using the headers method:
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::TableExtract;
my $numArgs = $#ARGV + 1;
if ($numArgs != 1) {
print "Usage: perl convertlevels.pl <HTML levels file>\n";
exit;
}
my $htmlfile = $ARGV[0];
open(INFILE,$htmlfile) or die();
my $OutFileName;
if($htmlfile =~ /getdataset.jsp\?nucleus\=(\d+\w+)/){
$htmlfile =~ /getdataset.jsp\?nucleus\=(\d+\w+)/;
$OutFileName = "/home/dominic/run19062013/src/levels/".$1.".lev";
}
my $htmllines = <INFILE>;
open(OUTFILE,">",$OutFileName) or die();
my $te = new HTML::TableExtract->new(headers => ['E<sub>level</sub> <br> (keV)','XREF','Jπ','T<sub>1/2</sub>'] );
$te->parse_file($htmllines);
if ($te->tables)
{
print "I found a table!";
}else{
print "No tables found :'(";
}
close INFILE;
close OUTFILE;
Please ignore for now what is going on with the OUTFILE- the intention is to reformat the table contents and print into a separate file that can be easily read by another application. The trouble I am having is that the table extract method cannot find any tables, so when I test to see if anything found, the result is always false! I've also tried some of the other options in the constructor of the table extract object, but same story for every attempt! First time user so please excuse my n00bishness.
Thanks!

How do I read in an editable file that contains words that I don't want stemmed using Lingua::Stem's add_exceptions($exceptions_hash_ref) in perl?

I am using Perl's Lingua::Stem module (Lingua::Stem) and I want to have a text file or other editable file format to contain a list of words I do not want stemmed. I want to be able to add words to the file any time.
Their example shows:
add_exceptions($exceptions_hash_ref);
What is the best way to do this?
I used their method in hard coding some exceptions, but I want to do this with a file.
# adding default exceptions
Lingua::Stem::add_exceptions({ 'emily' => 'emily',
'driven' => 'driven',
});
You can define a function to load exceptions from the given file:
sub load_exceptions {
my $fname = shift;
my %list;
open (my $in, "<", $fname) or die("load_exceptions: $fname");
while (<$in>) {
chomp;
$list{$_} = $_;
}
close $in;
return \%list;
}
And use it:
Lingua::Stem::add_exceptions(load_exceptions("notstem.txt"));
Example input file:
emily
driven
Assuming your "editable" file is whitespace separated, like so:
emily emily
driven driven
Your code could be:
open my $fh, "<", "excep.txt" or die $!;
my $href = { map split, <$fh> };
Lingua::Stem::add_exceptions($href);

Perl Plucene Index Search

Fooling around more with the Perl Plucene module and, having created my index, I am now trying to search it and return results.
My code to create the index is here...chances are you can skip this and read on:
#usr/bin/perl
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Index::Writer;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Search::HitCollector;
use Plucene::Search::IndexSearcher;
use Plucene::QueryParser;
use Try::Tiny;
my $content = $ARGV[0];
my $doc = Plucene::Document->new;
my $i=0;
$doc->add(Plucene::Document::Field->Text(content => $content));
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
if (!(-d "solutions" )) {
$i = 1;
}
if ($i)
{
my $writer = Plucene::Index::Writer->new("solutions", $analyzer, 1); #Third param is 1 if creating new index, 0 if adding to existing
$writer->add_document($doc);
my $doc_count = $writer->doc_count;
undef $writer; # close
}
else
{
my $writer = Plucene::Index::Writer->new("solutions", $analyzer, 0);
$writer->add_document($doc);
my $doc_count = $writer->doc_count;
undef $writer; # close
}
It creates a folder called "solutions" and various files to it...I'm assuming indexed files for the doc I created. Now I'd like to search my index...but I'm not coming up with anything. Here is my attempt, guided by the Plucene::Simple examples of CPAN. This is after I ran the above with the param "lol" from the command line.
#usr/bin/perl
use Plucene::Simple;
my $plucy = Plucene::Simple->open("solutions");
my #ids = $plucy->search("content : lol");
foreach(#ids)
{
print $_;
}
Nothing is printed, sadly )-=. I feel like querying the index should be simple, but perhaps my own stupidity is limiting my ability to do this.
Three things I discovered in time:
Plucene is a grossly inefficient proof-of-concept and the Java implementation of Lucene is BY FAR the way to go if you are going to use this tool. Here is some proof: http://www.kinosearch.com/kinosearch/benchmarks.html
Lucy is a superior choice that does the same thing and has more documentation and community (as per the comment on the question).
How to do what I asked in this problem.
I will share two scripts - one to import a file into a new Plucene index and one to search through that index and retrieve it. A truly working example of Plucene...can't really find it easily on the Internet. Also, I had tremendous trouble CPAN-ing these modules...so I ended up going to the CPAN site (just Google), getting the tar's and putting them in my Perl lib (I'm on Strawberry Perl, Windows 7) myself, however haphazard. Then I would try to run them and CPAN all the dependencies that it cried for. This is a sloppy way to do things...but it's how I did them and now it works.
#usr/bin/perl
use strict;
use warnings;
use Plucene::Simple;
my $content_1 = $ARGV[0];
my $content_2 = $ARGV[1];
my %documents;
%documents = (
"".$content_2 => {
content => $content_1
}
);
print $content_1;
my $index = Plucene::Simple->open( "solutions" );
for my $id (keys %documents)
{
$index->add($id => $documents{$id});
}
$index->optimize;
So what does this do...you call the script with two command line arguments of your choosing - it creates a key-value pair of the form "second argument" => "first argument". Think of this like the XMLs in the tutorial at the apache site (http://lucene.apache.org/solr/api/doc-files/tutorial.html). The second argument is the field name.
Anywho, this will make a folder in the directory the script was run in - in that folder will be files made by lucene - THIS IS YOUR INDEX!! All we need to do now is search that index using the power of Lucene, something made easy by Plucene. The script is the following:
#usr/bin/perl
use strict;
use warnings;
use Plucene::Simple;
my $content_1 = $ARGV[0];
my $index = Plucene::Simple->open( "solutions" );
my (#ids, $error);
my $query = $content_1;
#ids = $index->search($query);
foreach(#ids)
{
print $_."---seperator---";
}
You run this script by calling it from the command line with ONE argument - for example's sake let it be the same first argument as you called the previous script. If you do that you will see that it prints your second argument from the example before! So you have retrieved that value! And given that you have other key-value pairs with the same value, this will print those too! With "---seperator---" between them!

How do I access a hash from another subroutine?

I am trying to create some scripts for web testing and I use the following piece of code to set up variables from a config file:
package setVariables;
sub readConfig{
open(FH, "workflows.config") or die $!;
while(<FH>)
{
($s_var, $s_val) = split("=", $_);
chomp($s_var);
chomp($s_val);
$args{$s_var} = $s_val;
print "set $s_var = $s_val\n";
}
close(FH);
}
for example: var1=val1
var2=val2
var3=val3
etc...
I want to be able to pass the values set by this subroutine to a subroutine in another package. This is what I have for the package I want it passed into.
package startTest;
use setVariables;
sub startTest{
my %args = %setVariables::args;
my $s_var = $setVariables::s_var;
my $s_val = $setVariables::s_var;
setVariables::readConfig(); #runs the readConfig sub to set variables
my $sel = Test::WWW::Selenium->new( host => "localhost",
port => 4444,
browser => $args{"browser"},
browser_url => $args{"url"} );
$sel->open_ok("/index.aspx");
$sel->set_speed($args{"speed"});
$sel->type_ok("userid", $args{"usrname"});
$sel->type_ok("password", $args{"passwd"});
$sel->click_ok("//button[\#value='Submit']");
$sel->wait_for_page_to_load_ok("30000");
sleep($args{"sleep"});
}
Unfortunately its not holding on to the variables as is and I don't know how to reference them.
Thank you for any help.
Your code has some problems. Let's fix those first.
# Package names should start with upper case unless they are pragmas.
package SetVariables;
# Do this EVERYWHERE. It will save you hours of debugging.
use strict;
use warnings;
sub readConfig{
# Use the three argument form of open()
open( my $fh, '<', "workflows.config")
or die "Error opening config file: $!\n";
my %config;
# Use an explicit variable rather than $_
while( my $line = <$fh> )
{
chomp $line; # One chomp of the line is sufficient.
($s_var, $s_val) = split "=", $line;
$config{$s_var} = $s_val;
print "set $s_var = $s_val\n";
}
close $fh;
return \%config;
}
Then use like so:
use SetVariables;
my $config = SetVariables::readConfig();
print "$_ is $config->{$_}\n"
for keys %$config;
But rather than do all this yourself, check out the many, many config file modules on CPAN. Consider Config::Any, Config::IniFiles, Config::JSON.
You note in your comment that you are trying to work with multiple files, your main code and a couple of packages.
One pattern that is common is to load your config in your main code and pass it (or select elements of it) to consuming code:
package LoadConfig;
sub read_config {
my $file = shift;
my $config;
# Do stuff to read a file into your config object;
return $config;
}
1;
Meanwhile in another file:
package DoStuff;
sub run_some_tests {
my $foo = shift;
my $bar = shift;
# Do stuff here
return;
}
sub do_junk {
my $config;
my $foo = $config->{foo};
# Do junk
return;
}
1;
And in your main script:
use DoStuff;
use LoadConfig;
my $config = LoadConfig::read_config('my_config_file.cfg');
run_some_tests( $config->{foo}, $config->{bar} );
do_junk( $config );
So in run_some_tests() I extract a couple elements from the config and pass them in individually. In do_junk() I just pass in the whole config variable.
Are your users going to see the configuration file or just programmers? If it's just programmers, put your configuration in a Perl module, then use use to import it.
The only reason to use a configuration file for only programmers if you are compiling the program. Since Perl programs are scripts, don't bother with the overhead of parsing a configuration file; just do it as Perl.
Unless it's for your users and its format is simpler than Perl.
PS: There's already a module called Config. Call yours My_config and load it like this:
use FindBin '$RealBin';
use lib $RealBin;
use My_config;
See:
perldoc FindBin
perldoc Config
I would suggest using a regular format, such as YAML, to store the configuration data. You can then use YAML::LoadFile to read back a hash reference of the configuration data and then use it.
Alternatively, if you don't want to use YAML or some other configuration format with pre-written modules, you'll need for your reading routine to actually return either a hash or a a hashref.
If you need some more background information, check out perlref, perlreftut and perlintro.
all you need to do is collect the variable in a hash and return a reference to it in readConfig:
my %vars = ( var1 => val1,
var2 => val2,
var3 => val3,
);
return \%vars;
and in startTest:
my $set_vars = setVariables::readConfig();