Get specific data form Xapian database with Perl - perl

I'm writing a perl script to retrieve search results from a Xapian database.
I uses the Search::Xapian module and tried the basic Xapian Query Example. This basic program allow to make a query and get a array of results sorted by relevancy. My problem is that the get_data() method return the whole datas from the document (url, filname, abstract, author, ...) mixed together as a string.
I searched in the CPAN module for a method to get each data one by one but I didn't find it.
Is it possible to get the filename, url, author, ... one by one to put them in a specific variable ?

You've not posted the code to produce this, or details of your setup. See the simplesearch.pl example, rather than print it out, assign what you want to a variable:
# Display the results.
printf "%i results found.\n", $mset->get_matches_estimated();
printf "Results 1-%i:\n", $mset->size();
foreach my $m ($mset->items()) {
printf "%i: %i%% docid=%i [%s]\n", $m->get_rank() + 1, $m->get_percent(), $m->get_docid(), $m->get_document()->get_data();
}

Related

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Perl XML::SAX - character() method error

I'm new to using Perl XML::SAX and I encountered a problem with the characters event that is triggered. I'm trying to parse a very large XML file using perl.
My goal is to get the content of each tag (I do not know the tag names - given any xml file, I should be able to crack the record pattern and return every record with its data and tag like Tag:Data).
While working with small files, everything is ok. But when running on a large file, the characters{} event does partial reading of the content. There is no specific pattern in the way it cuts down the reading. Sometimes its the starting few characters of data and sometimes its last few characters and sometimes its just one letter from the actual data.
The Sax Parser is:
$myhandler = MyFilter->new();
$parser = XML::SAX::ParserFactory->parser(Handler => $myhandler);
$parser->parse_file($filename);
And, I have written my own Handler called MyFilter and overridding the character method of the parser.
sub characters {
my ($self, $element) = #_;
$globalvar = $element->{Data};
print "content is: $globalvar \n";
}
Even this print statement, reads the values partially at times.
I also tried loading the Parsesr Package before calling the $parser->parse() as:
$XML::SAX::ParserPackage = "XML::SAX::ExpatXS";
Stil doesn't work. Could anyone help me out here? Thanks in advance!
Sounds like you need XML::Filter::BufferText.
http://search.cpan.org/dist/XML-Filter-BufferText/BufferText.pm
From the description "One common cause of grief (and programmer error) is that XML parsers aren't required to provide character events in one chunk. They can, but are not forced to, and most don't. This filter does the trivial but oft-repeated task of putting all characters into a single event."
It's very easy to use once you have it installed and will solve your partial character data problem.

Using hash as a reference is deprecated

I searched SO before asking this question, I am completely new to this and have no idea how to handle these errors. By this I mean Perl language.
When I put this
%name->{#id[$#id]} = $temp;
I get the error Using a hash as a reference is deprecated
I tried
$name{#id[$#id]} = $temp
but couldn't get any results back.
Any suggestions?
The correct way to access an element of hash %name is $name{'key'}. The syntax %name->{'key'} was valid in Perl v5.6 but has since been deprecated.
Similarly, to access the last element of array #id you should write $id[$#id] or, more simply, $id[-1].
Your second variation should work fine, and your inability to retrieve the value has an unrelated reason.
Write
$name{$id[-1]} = 'test';
and
print $name{$id[-1]};
will display test correctly
%name->{...}
has always been buggy. It doesn't do what it should do. As such, it now warns when you try to use it. The proper way to index a hash is
$name{...}
as you already believe.
Now, you say
$name{#id[$#id]}
doesn't work, but if so, it's because of an error somewhere else in the code. That code most definitely works
>perl -wE"#id = qw( a b c ); %name = ( a=>3, b=>4, c=>5 ); say $name{#id[$#id]};"
Scalar value #id[$#id] better written as $id[$#id] at -e line 1.
5
As the warning says, though, the proper way to index an array isn't
#id[...]
It's actually
$id[...]
Finally, the easiest way to get the last element of an array is to use index -1. The means your code should be
$name{ $id[-1] }
The popular answer is to just not dereference, but that's not correct. In other words %$hash_ref->{$key} and %$hash_ref{$key} are not interchangeable. The former is required to access a hash reference nested as an element in another hash reference.
For many moons it has been common place to nest hash references. In fact there are several modules that parse data and store it in this kind of data structure. Instantly depreciating the behavior without module updates was not a good thing. At times my data is trapped in a nested hash and the only way to get it is to do something like.
$new_hash_ref = $target_hash_ref->{$key1}
$new_hash_ref2 = $target_hash_ref->{$key2}
$new_hash_ref3 = $target_hash_ref->{$key3}
because I can't
foreach my $i(keys(%$target_hash_ref)) {
foreach(%$target_hash_ref->{$i} {
#do stuff with $_
}
}
anymore.
Yes the above is a little strange, but creating new variables just to avoid accessing a data structure in a certain way is worse. Am I missing something?
If you want one item from an array or hash use $. For a list of items use # and % respectively. Your use of # as a reference returned a list instead of an item which perl may have interpreted as a hash.
This code demonstrates your reference of a hash of arrays.
#!/usr/bin perl -w
my %these = ( 'first'=>101,
'second'=>102,
);
my #those = qw( first second );
print $these{$those[$#those]};
prints '102'

Problems check username input against flat file for user creation

I am working on a user login and am having trouble with the user creation part. My problem is that I am trying to check the input username against a text file to see if that username already exists. I can't seem to get it to compare the input username to the array that I have brought in. I have tried two different ways of accomplishing this. One using an array and another using something I read online that I don't quite understand. Any help or explanation would be greatly appreciated.
Here is my attempt using an array to compare off of
http://codepad.org/G7xmsf3z
Here is my second attempt
http://codepad.org/SbeqmdbG
In your first attempt, try to put the if inside of the loop:
foreach my $pair(#incomingarray) {
(my $name,my $value) = split (/:/, $pair);
if ($name eq $username) {
print p("Username is already taken, try again");
close(YYY);
print end_html();
}
else {
open(YYY, ">>password.txt");
print YYY $username.":".$hashpass."\n";
print p("Your account has been created sucessfully");
close(YYY);
print end_html();
}
}
In you second attempt, I think you should try and change the line:
if (%users eq $username) {
with this one:
if (defined $users{$username}) {
As has been stated above regarding locking the flatfile from other processes there is the issue with scaling too. the more users you have the slower the lookup will be.
I started years ago with a flat file, believing I would never scale enough to require a real database and didn't want to learn how to use mySQL for example. Eventually after flatfile corruptions and long lookup times I had no choice but to move to a database.
Later you will find yourself wanting to store user preferences and such, it's easy to add a new field to a database. Flatfile will end up having the overhead of splitting each line into separate fields.
I'd suggest you do it properly with a database.
As in my comment, you should not be using a flatfile to hold your user info. You should use a proper database that will handle concurrent access for you rather than having to understand and code up how to deal with all of that yourself!
If you insist on using an array, you can search it with grep() if it is not "too large":
if (grep /^$username:/, #incomingarray) {
print "user name '$username' is already registered, try again\n";
}
else {
print "user name '$username' is not already registered\n";
}
I see some other problems in your code as well.
You should always prefer lexical (my) variables over package (our) variables.
Why do you think (erroneously) that $name and $username cannot be lexical variables?
You should always use the 3-arg form of open() and check its return value like in your 2nd code example. Your open() in the 1st code example is how it was done many many years ago.

After querying DB I can't print data as well as text anymore to browser

I'm in a web scripting class, and honestly and unfortunately, it has come second to my networking and design and analysis classes. Because of this I find I encounter problems that may be mundane but can't find the solution to it easily.
I am writing a CGI form that is supposed to work with a MySQL DB. I can insert and delete into the DB just fine. My problem comes when querying the DB.
My code compiles fine and I don't get errors when trying to "display" the info in the DB through the browser but the data and text doesn't in fact display. The code in question is here:
print br, 'test';
my $dbh = DBI->connect("DBI:mysql:austinc4", "*******", "*******", {RaiseError => 1} );
my $usersstatement = "select * from users";
my $projstatment = "select * from projects";
# Get the handle
my $userinfo = $dbh->query($usersstatement);
my $projinfo = $dbh->query($projstatement);
# Fetch rows
while (#userrow = $userinfo->fetchrow()) {
print $userrow[0], br;
}
print 'end';
This code is in an if statement that is surrounded by the print header, start_html, form, /form, end_html. I was just trying to debug and find out what was happening and printed the statements test and end. It prints out test but doesn't print out end. It also doesn't print out the data in my DB, which happens to come before I print out end.
What I believe I am doing is:
Connecting to my DB
Forming a string the contains the command/request to the DB
Getting a handle for my query I perform on the DB
Fetching a row from my handle
Printing the first field in the row I fetched from my table
But I don't see why my data wouldn't print out as well as the end text. I looked in DB and it does in fact contain data in the DB and the table that I am trying to get data from.
This one has got me stumped, so I appreciate any help. Thanks again. =)
Solution:
I was using a that wasn't supported by the modules I was including. This leads me to another question. How can I detect errors like this? My program does in fact compile correctly and the webpage doesn't "break". Aside from me double checking that all the methods I do use are valid, do I just see something like text not being displayed and assume that an error like this occurred?
Upon reading the comments, the reason your program is broken is because query() does not execute an SQL query. Therefore you are probably calling an undefined subroutine unless this is a wrapper you have defined elsewhere.
Here is my original posting of helpful hints, which still apply:
I hope you have use CGI, use DBI, etc... and use CGI::Carp and use strict;
Look in /var/log/apache2/access.log or error.log for the bugs
Realize that the first thing a CGI script prints MUST be a valid header or the web server and browser become unhappy and often nothing else displays.
Because of #3 print the header first BEFORE you do anything, especially before you connect to the database where the script may die or print something else because otherwise the errors or other messages will be emitted before the header.
If you still don't see an error go back to #2.
CGIs that use CGI.pm can be run from a command line in a terminal session without going through the webserver. This is also a good way to debug.