XML parsing using perl

XML parsing using perl - perl

I tried to research on simple question I have but couldn't do it. I am trying to get data from web which is in XML and parse it using perl. Now, I know how to loop on repeating elements. But, I am stuck when its not repeating (I know this might be silly). If the elements are repeating, I put it in array and get the data. But, when there is only a single element it throws and error saying 'Not an array reference'. I want my code such that it can parse at both time (for single and multiple elements). The code I am using is as follows:
use LWP::Simple;
use XML::Simple;
use Data::Dumper;
open (FH, ">:utf8","xmlparsed1.txt");
my $db1 = "pubmed";
my $query = "13054692";
my $q = 16354118; #for multiple MeSH terms
my $xml = new XML::Simple;
$urlxml = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=$db1&id=$query&retmode=xml&rettype=abstract";
$dataxml = get($urlxml);
$data = $xml->XMLin("$dataxml");
#print FH Dumper($data);
foreach $e(#{$data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading}})
{
print FH $e->{DescriptorName}{content}, ' $$ ';
}
Also, can I do something such that the separator $$ will not get printed after the last element?
I also tried the following code:
$mesh = $data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading};
while (my ($key, $value) = each(%$mesh)){
print FH "$value";
}
But, this prints all the childnodes and I just want the content node.

Perl's XML::Simple will take a single item and return it as a scalar, and if the value repeats it sends it back as an array reference. So, to make your code work, you just have to force MeshHeading to always return an array reference:
$data = $xml->XMLin("$dataxml", ForceArray => [qw( MeshHeading )]);

I think you missed the part of "perldoc XML::Simple" that talks about the ForceArray option:
check out ForceArray because you'll almost certainly want to turn it on
Then you will always get an array, even if the array contains only one element.

As others have pointed out, the ForceArray option will solve this particular problem. However you'll undoubtedly strike another problem soon after due to XML::Simple's assumptions not matching yours. As the author of XML::Simple, I strongly recommend you read Stepping up from XML::Simple to XML::LibXML - if nothing else it will teach you more about XML::Simple.

Since $data->{PubmedArticle}-> ... ->{MeshHeading} can be either a string or an array reference depending on how many <MeshHeading> tags are present in the document, you need to examine the value's type with ref and conditionally dereference it. Since I am unaware of any terse Perl idioms for doing this, your best bet is to write a function:
sub toArray {
my $meshes = shift;
if (!defined $meshes) { return () }
elsif (ref $meshes eq 'ARRAY') { return #$meshes }
else { return ($meshes) }
}
and then use it like so:
foreach my $e (toArray($data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading})) { ... }
To prevent ' $$ ' from being printed after the last element, instead of looping over the list, concatenate all the elements together with join:
print FH join ' $$ ', map { $_->{DescriptionName}{content} }
toArray($data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading});

This is a place where XML::Simple is being...simple. It deduces whether there's an array or not by whether something occurs more than once. Read the doc and look for the ForceArray option to address this.
To only include the ' $$ ' between elements, replace your loop with
print FH join ' $$ ', map $_->{DescriptorName}{content}, #{$data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading}};

Related

Retain quotes on CSV fields that were quoted in the input

I have a CSV file such that a few of the fields are quoted regardless of whether they need to be. What I wish to do is load this file, modify a few of the values, and produce the modified CSV with the quoted fields intact.
I'm currently using Perl's Text::CSV package to attempt to solve this problem, but have ran into a bit of a roadblock. The following is a small test script to demonstrate the problem:
use Text::CSV;
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
produces:
input: hello,"world"
output: hello,world
According to Text::CSV's documentation, an is_quoted() function exists to test if a field had been quoted in the input, but if I use this to add surrounding quotes to a field, I get unexpected results:
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
for my $idx (0..$#flds) {
if ($csv->is_quoted($idx)) {
$flds[$idx] = qq^"$flds[$idx]"^;
}
}
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
Producing:
input: hello,"world"
output: hello,"""world"""
where I believe the quotes I've added before the combine() are being seen as part of the field, and so are being escaped with a second double quote as combine() is processing.
What would be the best way to ensure quoted fields are left intact from input to output? I'm not certain the application will accept always_quote'ed fields... Is there some combination of Text::CSV object attributes that will allow for keeping quotes intact? Or perhaps am I left with adjusting the record post-combine?

It's a shame but it appears that while keep_meta_info gives you access to the metadata there's no option to tell Text::CSV to reapply the is_quoted state on output.
Depending on how complex your record is you could just reassemble it yourself. But then you'd have to cope with changes to string fields that were previously safely unquoted but after your processing now require quotes. That will depend on the types of changes you introduce, i.e. whether or not you ever expect that a previously "safe" string value will become unsafe. If the answer is "never" (i.e. 0.00000% chance), then you should just do the reassembly yourself and document what you've done.
Post-processing would require that you CSV-parse the string to handle the possibility of commas and other unsafe characters inside strings, so that may not be an option.
Or, you could dive into the code for Text::CSV and implement the desired functionality. I.e. allow the user to force quoting of a specific field on output. I played around with it, and it looks like part of the required mechanism might be in place but unfortunately all I have access to is the XS version, which delegates to native code, so I can't delve deeper at this time. This is as far as I got:
Original combine method. Note the setting of _FFLAGS to undef.
sub combine
{
my $self = shift;
my $str = "";
$self->{_FIELDS} = \#_;
$self->{_FFLAGS} = undef;
$self->{_STATUS} = (#_ > 0) && $self->Combine (\$str, \#_, 0);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine
My attempt. I guessed that the second argument to Combine might be the flags, but since the (lowercase) combine API is based on receiving an array and not an arrayref, there's no way to pass two arrays in. I changed it to expect two arrayrefs and tried passing the second to Combine but that failed with "Can't call method "print" on unblessed reference".
sub combine2
{
my $self = shift;
my $str = "";
my $f = shift;
my $g = shift;
$self->{_FIELDS} = $f;
$self->{_FFLAGS} = $g;
$self->{_STATUS} = (#$f > 0) && $self->Combine (\$str, $f, $g);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine

Perl Array Dereference Problem with DBI::fetchall_arrayref

I'm a Perl newbie and am having issues with dereferencing an array that is a result of fetchall_arrayref in the DBI module:
my $sql = "SELECT DISTINCT home_room FROM $classlist";
my $sth = $dbh->prepare($sql);
$sth->execute;
my $teachers = $sth->fetchall_arrayref;
foreach my $teacher (#{$teachers}) {
print $teacher;
}
Running this will print the reference instead of the values in the array.
However, when I run:
my $arrref = [1,2,4,5];
foreach (#{$arrref}) {
print "$_\n";
}
I get the values of the array.
What am I doing wrong? Thank you for your help!
Jeff

From the doc
The fetchall_arrayref method can be
used to fetch all the data to be
returned from a prepared and executed
statement handle. It returns a
reference to an array that contains
one reference per row.
So in your example, $teacher is an ARRAY ref.
So you will need to loop through this array ref
foreach my $teacher (#{$teachers}) {
foreach my $titem (#$teacher) {
print $titem;
}
}

if you want to extract only the teacher column, you want to use:
my #teachers = #{$dbh->selectcol_arrayref($sql)};

fetchall_arrayref fetches all the results of the query, so what you're actually getting back is a reference to an array of arrays. Each row returned will be an arrayref of the columns. Since your query has only one column, you can say:
my $teachers = $sth->fetchall_arrayref;
foreach my $teacher (#{$teachers}) {
print $teacher->[0];
}
to get what you want.
See more:
Arrays of arrays in Perl.

You have a reference to an array of rows. Each row is a reference to an array of fields.
foreach my $teacher_row (#$teachers) {
my ($home_room) = #$teacher_row;
print $home_room;
}
You would have seen the difference with Data::Dumper.
use Data::Dumper;
print(Dumper($teachers));
print(Dumper($arrref));

$sth->fetchall_arrayref returns a reference to an array that contains one reference per row!
Take a look at DBI docs here.

Per the documentation of DBI's fetchall_arrayref():
The fetchall_arrayref method can be
used to fetch all the data to be
returned from a prepared and executed
statement handle. It returns a
reference to an array that contains
one reference per row.
You're one level of indirection away:
my $sql = "SELECT DISTINCT home_room FROM $classlist";
my $sth = $dbh->prepare($sql);
$sth->execute;
my $teachers = $sth->fetchall_arrayref;
foreach my $teacher (#{$teachers}) {
local $" = ', ';
print "#{$teacher}\n";
}
The data structure might be a little hard to visualize sometimes. When that happens I resort to Data::Dumper so that I can insert lines like this:
print Dumper $teacher;
I've found that sometimes by dumping the datastructure I get an instant map to use as a reference-point when creating code to manipulate the structure. I recently worked through a real nightmare of a structure just by using Dumper once in awhile to straighten my head out.

You can use map to dereference the returned structure:
#teachers = map { #$_->[0] } #$teachers;
Now you have a simple array of teachers.

perl print array from subroutine

#! /usr/local/bin/perl
sub getClusters
{
my #clusters = `/qbo/bin/getclusters|grep -v 'qboc33'`;
chomp(#clusters);
return \#clusters;
}
ummm okay .. how do I get at this array to print since ...
foreach $cluster (getClusters())
{ print $cluster."\n"; }
doesn't seem to work.
Thanks.

You are returning a reference, and not dereferencing it anywhere.
foreach $cluster (#{getClusters()})
OR
return #clusters;
Either should fix it (with slightly different effects), with the first one being preferred (your array is kind of big).
You'd use the non-referenced array return for limited number of elements, usually for the purpose of multi-return (thus, usually, limited to 2 or 3, known-length arrays).

If you ran your program under use strict; use warnings;, it would have told you why it failed. As Amadan said, you need to dereference the reference you return.

Perl Solution
#!/usr/local/bin/perl
use strict;
use warnings;
main();
sub main{
{
local $" = "\n";
print "#{getClusters()}";
}
}
sub getClusters{
my #tArray = `/qbo/bin/getclusters|grep -v 'qboc33'`;
chomp #tArray;
return \#tArray;
}
Notice
You don't need a foreach loop for debugging, you can just reset the $" operator however to separate array elements however you like (eg, , , , or how I set it in the code above \n).
Returning an array ref is a plus, don't send back the full array (good job)
use strict/warnings, especially when debugging
try to avoid system calls using ``

To make it easy, you can first receive the return value and then print it like
use strict;
use warning;
my $cluster_array = getClusters();
my #cluster_return = #{$cluster_array};
foreach my $cluster(#cluster_return){
print"$cluster\n";
}

Converting code to perl sub, but not sure I'm doing it right

I'm working from a question I posted earlier (here), and trying to convert the answer to a sub so I can use it multiple times. Not sure that it's done right though. Can anyone provide a better or cleaner sub?
I have a good deal of experience programming, but my primary language is PHP. It's frustrating to know how to execute in one language, but not be able to do it in another.
sub search_for_key
{
my ($args) = #_;
foreach $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
my $thiskey = NULL;
my #result = map { $args->{search_ary}[$_][0] } # Get the 0th column...
grep { #$args->{search_in} =~ /$args->{search_ary}[$_][1]/ } # ... of rows where the
0 .. $#array; # first row matches
$thiskey = #result;
print "\nReturning: " . $thiskey . "\n";
return $thiskey;
}
search_for_key({
'search_ary' => $ref_cam_make,
'search_in' => 'Canon EOS Rebel XSi'
});
---Edit---
From the answers so far, I've cobbled together the function below. I'm new to Perl, so I don't really understand much of the syntax. All I know is that it throws an error (Not an ARRAY reference at line 26.) about that grep line.
Since I seem to not have given enough info, I will also mention that:
I am calling this function like this (which may or may not be correct):
search_for_key({
'search_ary' => $ref_cam_make,
'search_in' => 'Canon EOS Rebel XSi'
});
And $ref_cam_make is an array I collect from a database table like this:
$ref_cam_make = $sth->fetchall_arrayref;
And it is in the structure like this (if I understood how to make the associative fetch work properly, I would like to use it like that instead of by numeric keys):
Reference Array
Associative
row[1][cam_make_id]: 13, row[1][name]: Sony
Numeric
row[1][0]: 13, row[1][1]: Sony
row[0][0]: 19, row[0][1]: Canon
row[2][0]: 25, row[2][1]: HP
sub search_for_key
{
my ($args) = #_;
foreach my $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
print grep { $args->{search_in} =~ #$args->{search_ary}[$_][1] } #$args->{search_ary};
}

You are moving in the direction of a 2D array, where the [0] element is some sort of ID number and the [1] element is the camera make. Although reasonable in a quick-and-dirty way, such approaches quickly lead to unreadable code. Your project will be easier to maintain and evolve if you work with richer, more declarative data structures.
The example below uses hash references to represent the camera brands. An even nicer approach is to use objects. When you're ready to take that step, look into Moose.
use strict;
use warnings;
demo_search_feature();
sub demo_search_feature {
my #camera_brands = (
{ make => 'Canon', id => 19 },
{ make => 'Sony', id => 13 },
{ make => 'HP', id => 25 },
);
my #test_searches = (
"Sony's Cyber-shot DSC-S600",
"Canon cameras",
"Sony HPX-32",
);
for my $ts (#test_searches){
print $ts, "\n";
my #hits = find_hits($ts, \#camera_brands);
print ' => ', cb_stringify($_), "\n" for #hits;
}
}
sub cb_stringify {
my $cb = shift;
return sprintf 'id=%d make=%s', $cb->{id}, $cb->{make};
}
sub find_hits {
my ($search, $camera_brands) = #_;
return grep { $search =~ $_->{make} } #$camera_brands;
}

This whole sub is really confusing, and I'm a fairly regular perl user. Here are some blanket suggestions.
Do not create your own undef ever -- use undef then return at the bottom return $var // 'NULL'.
Do not ever do this: foreach $row, because foreach my $row is less prone to create problems. Localizing variables is good.
Do not needlessly concatenate, for it offends the style god: not this, print "\nReturning: " . $thiskey . "\n";, but print "\nReturning: $thiskey\n";, or if you don't need the first \n: say "Returning: $thiskey;" (5.10 only)
greping over 0 .. $#array; is categorically lame, just grep over the array: grep {} #{$foo[0]}, and with that code being so complex you almost certainly don't want grep (though I don't understand what you're doing to be honest.). Check out perldoc -q first -- in short grep doesn't stop until the end.
Lastly, do not assign an array to a scalar: $thiskey = #result; is an implicit $thiskey = scalar #result; (see perldoc -q scalar) for more info. What you probably want is to return the array reference. Something like this (which eliminates $thiskey)
printf "\nReturning: %s\n", join ', ', #result;
#result ? \#result : 'NULL';

If you're intending to return whether a match is found, this code should work (inefficiently). If you're intending to return the key, though, it won't -- the scalar value of #result (which is what you're getting when you say $thiskey = #result;) is the number of items in the list, not the first entry.
$thiskey = #result; should probably be changed to $thiskey = $result[0];, if you want mostly-equivalent functionality to the code you based this off of. Note that it won't account for multiple matches anymore, though, unless you return #result in its entirety, which kinda makes more sense anyway.

Perl throws an error message about syntax

So, building off a question about string matching (this thread), I am working on implementing that info in solution 3 into a working solution to the problem I am working on.
However, I am getting errors, specifically about this line of the below function:
next if #$args->{search_in} !~ /#$cur[1]/;
syntax error at ./db_index.pl line 16, near "next "
My question as a perl newbie is what am I doing wrong here?
sub search_for_key
{
my ($args) = #_;
foreach $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
my $thiskey = NULL;
foreach $cur (#{$args->{search_ary}}){
print "\n" . #$cur[1] . "\n"
next if #$args->{search_in} !~ /#$cur[1]/;
$thiskey = #$cur[0];
last;
}
return $thiskey;
}

You left off the semicolon at the end of the previous line. That's what caused the syntax error, anyway. I think you're also misusing $args, but it's hard to be sure about that without knowing how you're calling this function.

There are several issues here.
Are you adding use strict; and use warnings; at the top of your script before you do anything else? You only posted the sub, but it is clear that you are not using these.
What is NULL? (strict will not let you use bare-words...) Be sure to read What is Truth in Perl? The more Perly way is to deal with "truth" or "false" is defined / undef or exists or specifically test for a value chosen as a convention.
Missing ; after print "\n" . #$cur[1] . "\n"
Your data structures seem way too complicated. From what I can tell, you are passing a reference to a hash of arrays, true? Why your data structures get really obscure, back up and look at what you are trying to do...
Perl gives you plenty of way to shoot yourself in the foot. It is not strictly typed and you will do yourself (and your readers) a favor by naming references as a derivative of what they refer to. So instead of $args use $ref2HoArefs for example.

Side note, are you sure you can't just use a hash for what you're doing? It seems awfully complicated do do something so simple:
my %hash = (
key1 => 'value1',
key2 => 'value2',
);
exists $hash{$search_in}; # true/false.
my $result = $hash{$search_in}; # returns 'value1' when $search_in is 'key1'
Or if you need to search by value:
my %flip = reverse %hash;
$result = $flip{$search_in};
And if you really need a regex key ( or value ) lookup:
sub string_match {
my ($lookup_hash, $key ) = #_;
for my $hash_key ( %{ $lookup_hash } ){
return $hash_key if $key =~ $lookup_hash->{$hash_key};
}
return; # not found.
}
my $k = string_match({
'whitespace at end' => qr/\s+$/,
'whitespace at start' => qr/^\s+/,
}, "Some Garbage string "); # k == whitespace at end

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

XML parsing using perl - perl

I think you missed the part of "perldoc XML::Simple" that talks about the ForceArray option: check out ForceArray because you'll almost certainly want to turn it on Then you will always get an array, even if the array contains only one element.

Related

Retain quotes on CSV fields that were quoted in the input

Perl Array Dereference Problem with DBI::fetchall_arrayref

perl print array from subroutine

Converting code to perl sub, but not sure I'm doing it right

Perl throws an error message about syntax

Categories

Resources