Perl libXML find node by attribute value - perl

I have very large XML document that I am iterating through. The XML's use mostly attributes rather than node values. I may need to find numerous nodes in the file to piece together one grouping of information. They are tied together via different ref tag values. Currently each time I need to locate one of the nodes to extract data from I am looping through the entire XML and doing a match on the attribute to find the correct node. Is there a more efficient way to just select a node of a given attribute value instead of constantly looping and compare? My current code is so slow it is almost useless.
Currently I am doing something like this numerous times in the same file for numerous different nodes and attribute combinations.
my $searchID = "1234";
foreach my $nodes ($xc->findnodes('/plm:PLMXML/plm:ExternalFile')) {
my $ID = $nodes->findvalue('#id');
my $File = $nodes->findvalue('#locationRef');
if ( $searchID eq $ID ) {
print "The File Name = $File\n";
}
}
In the above example I am looping and using an "if" to compare for an ID match. I was hoping I could do something like this below to just match the node by attribute instead... and would it be any more efficient then looping?
my $searchID = "1234";
$nodes = ($xc->findnodes('/plm:PLMXML/plm:ExternalFile[#id=$searchID]'));
my $File = $nodes->findvalue('#locationRef');
print "The File Name = $File\n";

Do one pass to extract the information you need into a more convenient format or to build an index.
my %nodes_by_id;
for my $node ($xc->findnodes('//*[#id]')) {
$nodes_by_id{ $node->getAttribute('id') } = $node;
}
Then your loops become
my $node = $nodes_by_id{'1234'};
(And stop using findvalue instead of getAttribute.)

If you will be doing this for lots of IDs, then ikegami's answer is worth reading.
I was hoping I could do something like this below to just match the node by attribute instead
...
$nodes = ($xc->findnodes('/plm:PLMXML/plm:ExternalFile[#id=$searchID]'));
Sort of.
For a given ID, yes, you can do
$nodes = $xc->findnodes("/plm:PLMXML/plm:ExternalFile[\#id=$searchID]");
... provided that $searchID is known to be numeric. Notice the double quotes in perl means the variables interpolate, so you should escape the #id because that is part of the literal string, not a perl array, whereas you want the value of $searchID to become part of the xpath string, so it is not escaped.
Note also that in this case you are asking for it in scalar context will have a XML::LibXML::Nodelist object, not the actual node, nor an arrayref; for the latter you will need to use square brackets instead of round ones as I have done in the next example.
Alternatively, if your search id may not be numeric but you know for sure that it is safe to be put in an XPath string (e.g. doesn't have any quotes), you can do the following:
$nodes = [ $xc->findnodes('/plm:PLMXML/plm:ExternalFile[#id="' . $searchID . '"]') ];
print $nodes->[0]->getAttribute('locationRef'); # if you're 100% sure it exists
Notice here that the resulting string will enclose the value in quotation marks.
Finally, it is possible to skip straight to:
print $xc->findvalue('/plm:PLMXML/plm:ExternalFile[#id="' . $searchID . '"]/#locationRef');
... providing you know that there is only one node with that id.

I think you just need to do some study on XPath expressions. For example, you could do something like this:
my $search_id = "1234";
my $query = "/plm:PLMXML/plm:ExternalFile/[\#id = '$search_id']";
foreach my $node ($xc->findnodes($query)) {
# ...
}
In the XPath expression you can also combine multiple attribute checks, e.g.:
[#id = '$search_id' and contains(#pathname, '.pdf')]
One XPath Tutorial of many

If you have a DTD for your document that declares the id attribute as DTD ID, and you make sure the DTD is read when parsing the document, you can access the elements with a certain id efficiently via $doc->getElementById($id).

Related

What's the most efficient way to check multiple hash references in perl

I have a multidimensional data structure for tracking different characteristics of files I am comparing and merging data for. The structure is set up as such:
$cumulative{$slice} = {
DATA => $data,
META => $got_meta,
RECOVER => $recover,
DISPO => $dispo,
DIR => $dir,
};
All of the keys, save DIR (which is just a simple string), are references to hashes, or arrays. I would like to have a simple search for KEYS that match "BASE" for the value DIR points to for each of the $slice keys. My initial thought was to use grep, but I'm not sure how to do that. I thought something like this would be ok:
my (#base_slices) = grep { $cumulative{$_}->{DIR} eq "BASE" } #{$cumulative{$_}};
I was wrong. Is there a way to do this without a loop, or is that pretty much the only way to check those values? Thanks!
Edit: Thanks to Ikegami for answering succinctly, even without my fully representing the outcome of the search. I have changed the question a little bit to more clearly explain the issue I was having.
This is wrong:
#{$cumulative{$slice}}
It gets the value of the array referenced by $cumulative{$slice}. But $cumulative{$slice} is not a reference to an array; it's a reference to a hash. This expression makes no sense, as results in the error
Not an ARRAY reference
What would be correct? Well, it's not quite clear what you want.
Maybe you want the keys of the elements of %cumulative whose DIR attribute equal BASE.
my #matching_keys = # 3. Save the results.
grep { $cumulative{ $_ }->{ DIR } eq "BASE" } # 2. Filter them.
keys( %cumulative ); # 1. Get the keys.
(The -> is optional between indexes, so $cumulative{ $_ }{ DIR } is also fine.)
Maybe you don't need the keys. Maybe you want the values of the elements of %cumulative whose DIR attribute equal BASE.
my #matching_values = # 3. Save the results.
grep { $_->{ DIR } eq "BASE" } # 2. Filter them.
values( %cumulative ); # 1. Get the values.
This was posted for the initial form of the question, before the edit, and reflects what I did and/or did not understand in that formulation.
The use of #{$cummulative{$_}}, with $_ presumably standing for $slice, indicates that the value for key $slice is expected to be an arrayref. However, the question shows there to be a hashref. This is either an error or the question mis-represents the problem.
If the expression in grep accurately represents the problem, for values of $slice that are given or can be built at will, then just feed that list of $slice values to the shown grep
my #base_slices = grep { $cumululative{$_}{DIR} eq 'BASE' } #slice_vals;
or
my #base_slices =
grep { $cumululative{$_}{DIR} eq 'BASE' }
map { generate_list_of_slice_values($_) }
LIST-OF-INPUTS;
That generate_list_of_slice_values() stands for whatever way the values for $slice get acquired dynamically from some input.†
There is no need for a dereferencing arrow for the key DIR (a syntax convenience), and no need for parenthesis around #base_slices since having an array already provides the needed list context.
Please clarify what $slice is meant to be and I'll update.
† The code in map's block gets elements of LIST-OF-INPUTS one at a time (as $_) and whatever it evaluates with each is joined into its return list. That is passed to grep for filtering: elements of its input list are provided to the code in the block one at a time as $_ and those for which the code evaluates to "true" (in Perl's sense) pass, forming the grep's return list.

glob is not picking all the files matching a pattern, is there an issue with the syntax

I have a directory where there are more than 10 files which start with pattern "my_Report".
While i tried using glod for this job,it just picked only a single file . Is there any problem with the syntax below
$g_map{"Rep"} = glob ("data_1/reports/my_Report*");
Alternatively , i tried using grep to find all the files and stored it in a hash
$g_map{"Rep"} = [grep {!/\.xh$/} <data_1/reports/my_Report*>];
My Requirement is to find all the files with specific pattern from the directory and store it in a hash with key "Rep"
How do i achieve the same with glob?
Thanks in Advance
Your first call is in scalar context. In scalar context, glob returns (at most) a single result.
To retrieve all the matching files, use list context (like you do in your second call)
$g_map{"Rep"} = [ glob("data_1/reports/my_Report*") ]
or if you are expecting one result or just want the first result
($g_map{"Rep"}) = glob("data_1/reports/my_Report*");
glob returns a list, but you're calling it in scalar context, which is why you're only getting a single result. Try this:
#{ $g_map{Rep} } = glob ("data_1/reports/my_Report*");
That'll turn $g_map{Rep} hash key into an array reference, and all of the files will be stored in it.
You can access it like this:
for (#{ $g_map{Rep} }){
print "filename: $_\n";
}

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

What does the Perl split function return when there is no value between tokens?

I'm trying to split a string using the split function but there isn't always a value between tokens.
Ex: ABC,123,,,,,,XYZ
I don't want to skip the multiple tokens though. These values are in specific positions in the string. However, when I do a split, and then try to step through my resulting array, I get "Use of uninitialized value" warnings.
I've tried comparing the value using $splitvalues[x] eq "" and I've tried using defined($splitvalues[x]) , but I can't for the life of me figure out how to identify what the split function is putting in to my array when there is no value between tokens.
Here's the snippet of my code (now with more crunchy goodness):
my #matrixDetail = ();
#some other processing happens here that is based on matching data from the
##oldDetail array with the first field of the #matrixLine array. If it does
#match, then I do the split
if($IHaveAMatch)
{
#matrixDetail = split(',', $matrixLine[1]);
}
else
{
#matrixDetail = ('','','','','','','');
}
my $newDetailString =
(($matrixDetail[0] eq '') ? $oldDetail[0] : $matrixDetail[0])
. (($matrixDetail[1] eq '') ? $oldDetail[1] : $matrixDetail[1])
.
.
.
. (($matrixDetail[6] eq '') ? $oldDetail[6] : $matrixDetail[6]);
because this is just snippets, I've left some of the other logic out, but the if statement is inside a sub that technically returns the #matrixDetail array back. If I don't find a match in my matrix and set the array equal to the array of empty strings manually, then I get no warnings. It's only when the split populates the #matrixDetail.
Also, I should mention, I've been writing code for nearly 15 years, but only very recently have I needed to work with Perl. The logic in my script is sound (or at least, it works), I'm just being anal about cleaning up my warnings and trying to figure out this little nuance.
#!perl
use warnings;
use strict;
use Data::Dumper;
my $str = "ABC,123,,,,,,XYZ";
my #elems = split ',', $str;
print Dumper \#elems;
This gives:
$VAR1 = [
'ABC',
'123',
'',
'',
'',
'',
'',
'XYZ'
];
It puts in an empty string.
Edit: Note that the documentation for split() states that "by default, empty leading fields are preserved, and empty trailing ones are deleted." Thus, if your string is ABC,123,,,,,,XYZ,,,, then your returned list will be the same as the above example, but if your string is ,,,,ABC,123, then you will have a list with three empty strings in elements 0, 1, and 2 (in addition to 'ABC' and '123').
Edit 2: Try dumping out the #matrixDetail and #oldDetail arrays. It's likely that one of those isn't the length that you think it is. You might also consider checking the number of elements in those two lists before trying to use them to make sure you have as many elements as you're expecting.
I suggest to use Text::CSV from CPAN. It is a ready made solution which already covers all the weird edge cases of parsing CSV formatted files.
delims with nothing between them give empty strings when split. Empty strings evaluate as false in boolean context.
If you know that your "details" input will never contain "0" (or other scalar that evaluates to false), this should work:
my #matrixDetail = split(',', $matrixLine[1]);
die if #matrixDetail > #oldDetail;
my $newDetailString = "";
for my $i (0..$#oldDetail) {
$newDetailString .= $matrixDetail[$i] || $oldDetail[$i]; # thanks canSpice
}
say $newDetailString;
(there are probably other scalars besides empty string and zero that evaluate to false but I couldn't name them off the top of my head.)
TMTOWTDI:
$matrixDetail[$_] ||= $oldDetail[$_] for 0..$#oldDetail;
my $newDetailString = join("", #matrixDetail);
edit: for loops now go from 0 to $#oldDetail instead of $#matrixDetail since trailing ",,," are not returned by split.
edit2: if you can't be sure that real input won't evaluate as false, you could always just test the length of your split elements. This is safer, definitely, though perhaps less elegant ^_^
Empty fields in the middle will be ''. Empty fields on the end will be omitted, unless you specify a third parameter to split large enough (or -1 for all).

Best way to prevent output of a duplicate item in Perl in realtime during a loop

I see a lot of 'related' questions showing up, but none I looked at answer this specific scenario.
During a while/for loop that parses a result set generated from a SQL select statement, what is the best way to prevent the next line from being outputted if the line before it contains the same field data (whether it be the 1st field or the xth field)?
For example, if two rows were:
('EML-E','jsmith#mail.com','John','Smith')
('EML-E','jsmith2#mail.com','John','Smith')
What is the best way to print only the first row based on the fact that 'EML-E' is the same in both rows?
Right now, I'm doing this:
Storing the first field (specific to my scenario) into a 2-element array (dupecatch[1])
Checking if dupecatch[0] = dupcatch[1] (duplicate - escape loop using 's')
After row is processed, set dupecatch[0] = dupecatch[1]
while ($DBS->SQLFetch() == *PLibdata::RET_OK)
{
$s=0; #s = 1 to escape out of inside loop
while ($i != $array_len and $s==0)
{
$rowfetch = $DBS->{Row}->GetCharValue($array_col[$i]);
if($i==0){$dupecatch[1] = $rowfetch;} #dupecatch prevents duplicate primary key field entries
if($dupecatch[0] ne $dupecatch[1])
{
dosomething($rowfetch);
}
else{$s++;}
$i++;
}
$i=0;
$dupecatch[0]=$dupecatch[1];
}
That is that standard way if you only care about duplicate items in a row, but $dupecatch[0] is normally named $old and $dupecatch[1] normally just the variable in question. You can tell the array is not a good fit because you only ever refer to its indices.
If you want to avoid all duplicates you can use a %seen hash:
my %seen;
while (defined (my $row = get_data())) {
next if $seen{$row->[0]}++; #skip all but the first instance of the key
do_stuff();
}
I suggest using DISTINCT in your SQL statement. That's probably by far the easiest fix.