Can you save to disk using XPath's setNodeText function? - perl

I'm using an XPath like "school/student[4]". Could the setNodeText function save to harddisk? My changes only seem to be made in memory.

If I understand correctly, you are trying to change a document then write it to disk.
use XML::LibXML qw( );
my $parser = XML::LibXML->new();
my $doc = $parser->parse_fh(...);
my $root = $doc->documentElement();
for my $node ($root->findnodes('//school/student[4]')) {
$node->removeChildNodes();
$node->appendText("New text");
}
open(my $fh, '>:raw', ...) or die $!;
print($fh $doc->toString());

You can dump the XML structure using the undocumented method getNodeAsXML. The output isn't garanteed to be valid XML (e.g. no header), but it usually does the trick.
my $str = $xp->getNodeAsXML();
print $str;
Source: http://www.perlmonks.org/?node_id=567212

Related

XML::Simple returns "Out of memory" error for large XMLs

This might take a while to explain, but I have a file (XMLList.txt) that contains the paths to multiple IDOC XMLs. The contents of the XMLList.txt look like this:
/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/AU_DHL_PW_Inbound_Delivery_from_Pfizer_20171220071754.xml
/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/AU_DHL_PW_Inbound_Delivery_from_Pfizer_20171220083310.xml
/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/CCMastOut_MQ_GLB_1_20171220154826.xml
I'm attempting to create a Perl script that reads each XML and parses just the values of the tags DOCNUM, SNDPRN and RCVPRN from each XML file into a pipe delimited file "report.csv"
Another thing to note is that my XML files could be:
All on a single line - example
<?xml version="1.0" encoding="UTF-8"?><ZDELVRY073PL><IDOC BEGIN="1">
<EDI_DC40 SEGMENT="1"><TABNAM>EDI_DC40</TABNAM><MANDT>400</MANDT>
<DOCNUM>0000000443474886</DOCNUM><DOCREL>731</DOCREL><STATUS>30</STATUS>
<DIRECT>1</DIRECT><OUTMOD>4</OUTMOD><IDOCTYP>DELVRY07</IDOCTYP>
<CIMTYP>ZDELVRY073PL</CIMTYP><MESTYP>ZIBDADV</MESTYP><MESCOD>IBG</MESCOD>
<SNDPOR>SAPQ01</SNDPOR><SNDPRT>LS</SNDPRT><SNDPRN>Q01CLNT400</SNDPRN>
<RCVPOR>XMLDIST_MT</RCVPOR><RCVPRT>LS</RCVPRT><RCVPFC>LS</RCVPFC>
<RCVPRN>AU_DHL</RCVPRN>.... </EDI_DC40></IDOC>
or multiline XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<INVOIC02>
<IDOC>
<EDI_DC40>
<TABNAM/>
<DOCNUM>0000000658056255</DOCNUM>
<DIRECT/>
<IDOCTYP>INVOIC02</IDOCTYP>
<MESTYP>INVOIC</MESTYP>
<SNDPOR>SAPP01</SNDPOR>
<SNDPRT/>
<SNDPRN>ALE400</SNDPRN>
<RCVPOR>XMLINVOICE</RCVPOR>
<RCVPRT>KU</RCVPRT>
<RCVPRN>C18BASWARE</RCVPRN>
<CREDAT>20171220</CREDAT>
<CRETIM>222323</CRETIM>
</EDI_DC40>
The script I've used so far seems to work for small XMLs. However, some XMLs > 50 MB throw this error:
Out of memory! Out of memory! Callback called exit at
/usr/opt/perl5/lib/site_perl/5.10.1/XML/SAX/Base.pm
line 1941 (#1)
(F) A subroutine invoked from an external package via call_sv()
exited by calling exit.
Out of memory!
So, here's the code I'm using. Would like your help tweaking this:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# use module
use XML::Simple;
use Data::Dumper;
# create object
my $xml = new XML::Simple;
my $file_list = 'XMLList.txt';
open(my $fh_i, '<:encoding(UTF-8)', $file_list)
or die "Could not open file '$file_list' $!";
my $csv_out = 'report.csv';
open(my $fh_o, '>', $csv_out)
or die "Could not open file '$csv_out' $!";
while (my $row = <$fh_i>) {
$row =~ s/\R//g;
my $data = $xml->XMLin($row);
print $fh_o "$data->{IDOC}->{EDI_DC40}->{DOCNUM}|";
print $fh_o "$data->{IDOC}->{EDI_DC40}->{SNDPRN}|";
print $fh_o "$data->{IDOC}->{EDI_DC40}->{RCVPRN}\n";
}
close $fh_o;
I recommend that people stop using XML::Simple when they have a problem using it. That module is nice to get started but its not meant to be a long term solution. Even then, see Why is XML::Simple “Discouraged”?
XML::Twig is what I often use for these tasks. You can set up handlers for tags and get that part of the tree. You process it and move on. That might be as simple as something like this where I set up a subroutine to process each EDI_DC40 as I encounter it:
use Text::CSV_XS;
use XML::Twig;
my $csv = Text::CSV_XS->new;
my $twig = XML::Twig->new(
twig_handlers => {
'EDI_DC40' => \&process_EDI_DC40,
},
);
$twig->parsefile( $ARGV[0] );
sub process_EDI_DC40 {
my( $twig, $thingy ) = #_;
my #values = map { $thingy->first_child( $_ )->text }
qw(DOCNUM RCVPRN SNDPRN);
$csv->say( *STDOUT, \#values );
}
First off, if the file contains newlines,
while (my $row = <$fh_i>){
$row =~ s/\R//g;
my $data = $xml->XMLin($row);
is going to read one line at a time from the file and attempt to do an XML conversion on that line alone instead of the whole document. I would recommend that you slurp each file into a buffer and use regex to eliminate newlines and carriage returns before XMLin conversion. Also, XMLin will die unceremoniously if there are any XML errors in the file, so you want to run it in an eval block.

How to read from URL line-wise?

I'm looking for the "moral equivalent" of the (fictitious) openremote below:
my $handle = openremote( 'http://some.domain.org/huge.tsv' ) or die $!;
while ( <$handle> ) {
chomp;
# etc.
# do stuff with $_
}
close $handle;
IOW, I'm looking for a way to open a read handle to a remote file so that I can read from it line-by-line. (Typically this file will be larger than I want to read entirely into memory. This means that solutions based on stuffing the value returned by LWP::Simple::get (for example) into an IO::String are not suitable.)
I'm sure this is really basic stuff, but I have not been able to find it after a lot of searching.
Here's a "solution" much like the other responses but it cheats a bit by using IO::All
use IO::All ;
my $http_io = io->http("http://some.domain.org/huge.tsv");
while (my $line = $http_io->getline || $http_io->getline) {
print $line;
}
After you have an object with io->http you can use IO methods to look at it (like getline() etc.).
Cheers.
You can use LWP::UserAgent's parameter :content_file => $filename to save the big file to disk directly, without filling the memory with it, and then you can read that file in your program.
$ua->get( $url, ':content_file' => $filename );
Or you can use the parameter :content_cb => \&callback and in the callback subroutine you can process the data chunk by chunk as it is downloaded. This is probably the way you need.
$ua->get( $url, ':content_cb' => \&callback );
sub callback {
my ( $chunk, $response, $protocol ) = #_;
#Do whatever you like with $chunk
}
Read (a little) more about this with perldoc LWP::UserAgent.
Use LWP::Simple coupled with IO::String like so:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use IO::String;
my $handle = IO::String->new(get("http://stackoverflow.com"));
while (defined (my $line = <$handle>)) {
print $line;
}
close $handle;
Hope it works for you.
Paul

FTP + uncompress + readline

I want to extract some data from a large-ish (3+ GB, gzipped) FTP download, and do this on-the-fly, to avoid dumping then full download on my disk.
To extract the desired data I need to examine the uncompressed stream line-by-line.
So I'm looking for the moral equivalent of
use PerlIO::gzip;
my $handle = open '<:gzip', 'ftp://ftp.foobar.com/path/to/blotto.txt.gz'
or die $!;
for my $line (<$handle>) {
# etc.
}
close($handle);
FWIW: I know how to open a read handle to ftp://ftp.foobar.com/path/to/blotto.txt.gz (with Net::FTP::repr), but I have not yet figured out how to add a :gzip layer to this open handle.
It took me a lot longer than it should have to find the answer to the question above, so I thought I'd post it for the next person who needs it.
OK, the answer is (IMO) not at all obvious: binmode($handle, ':gzip').
Here's a fleshed-out example:
use strict;
use Net::FTP;
use PerlIO::gzip;
my $ftp = Net::FTP->new('ftp.foobar.com') or die $#;
$ftp->login or die $ftp->message; # anonymous FTP
my $handle = $ftp->retr('/path/to/blotto.txt.gz') or die $ftp->message;
binmode($handle, ':gzip');
for my $line (<$handle>) {
# etc.
}
close($handle);
The code below is from IO::Compress FAQ
use Net::FTP;
use IO::Uncompress::Gunzip qw(:all);
my $ftp = new Net::FTP ...
my $retr_fh = $ftp->retr($compressed_filename);
gunzip $retr_fh => $outFilename, AutoClose => 1
or die "Cannot uncompress '$compressed_file': $GunzipError\n";
To get the data line by line, change it to this
use Net::FTP;
use IO::Uncompress::Gunzip qw(:all);
my $ftp = new Net::FTP ...
my $retr_fh = $ftp->retr($compressed_filename);
my $gunzip = new IO::Uncompress::Gunzip $retr_fh, AutoClose => 1
or die "Cannot uncompress '$compressed_file': $GunzipError\n";
while(<$gunzip>)
{
...
}

How do I read in an editable file that contains words that I don't want stemmed using Lingua::Stem's add_exceptions($exceptions_hash_ref) in perl?

I am using Perl's Lingua::Stem module (Lingua::Stem) and I want to have a text file or other editable file format to contain a list of words I do not want stemmed. I want to be able to add words to the file any time.
Their example shows:
add_exceptions($exceptions_hash_ref);
What is the best way to do this?
I used their method in hard coding some exceptions, but I want to do this with a file.
# adding default exceptions
Lingua::Stem::add_exceptions({ 'emily' => 'emily',
'driven' => 'driven',
});
You can define a function to load exceptions from the given file:
sub load_exceptions {
my $fname = shift;
my %list;
open (my $in, "<", $fname) or die("load_exceptions: $fname");
while (<$in>) {
chomp;
$list{$_} = $_;
}
close $in;
return \%list;
}
And use it:
Lingua::Stem::add_exceptions(load_exceptions("notstem.txt"));
Example input file:
emily
driven
Assuming your "editable" file is whitespace separated, like so:
emily emily
driven driven
Your code could be:
open my $fh, "<", "excep.txt" or die $!;
my $href = { map split, <$fh> };
Lingua::Stem::add_exceptions($href);

How do I process the response as a file without using the :content_file option?

Example code:
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://example.com/file.zip');
if ($response->is_success) {
# get the filehandle for $response->content
# and process the data
}
else { die $response->status_line }
I need to open the content as a file without prior saving it to the disk. How would you do this?
You can open a fake filehandle that points to a scalar. If the file argument is a scalar reference, Perl will treat the contents of the scalar as file data rather than a filename.
open my $fh, '<', $response->content_ref;
while( <$fh> ) {
# pretend it's a file
}
Not quite a file, but here is a relevant SO question: What is the easiest way in pure Perl to stream from another HTTP resource?