Query Expansion in Perl - perl

Is there any existing implementation for query expansion in Perl?
By query expansion I mean, when the user enter a query in our database
it will expand the search based on related terms.
In principle we have a XML file (e.g. MESH) with which
we want to refer to for query expansion.

Bio::DB::MeSH - Term retrieval from a Web MeSH database
my $mesh = Bio::DB::MeSH->new();
my $term = $mesh->get_exact_term('Butter');
print $term->description;

You got a serviceable answer already but there is a far deeper and more robust alternative for more serious usage: UMLS::Similarity and UMLS::Interface. The problem being these are a little bit of a bear to install, require MySQL, take up quite a bit of disk space, and require you to have the MeSH stuff locally and make sure your usage is in compliance with a couple dozen related licenses for the dictionaries/sources.
I do not mean to disparage Bio::DB::MeSH, it's useful and part of a bigger picture (BioPerl), but it has fragile heuristics and is at the mercy of availability and trivial HTML changes in the target/source site (it was broken the last time I used it for example though it was easy to patch locally).

Related

How to localize CPAN module and dependencies

I am trying to localize a CPAN module MooX::Options using Locale::TextDomain after having read "On the state of I18N in perl".
In the discussion in the pull request the question came up how to deal with messages not originating in the module itself, but in a dependency. In this specific case, when you specify an option on the command line which is not defined anywhere in the code, you'll get the warning:
Unknown option: xyz
originating in the module Getopt::Long, which in itself is not localized yet.
The question is how to deal with these. I see basically three strategies:
Ignore them, which I find dissatisfactory.
Try to someway or other catch all the corner cases and messages in the module I'm currently localizing (in this case MooX::Options), and this way working around the missing localization in the dependent modules. This option seems brittle, as I'd have to constantly adapt to changes in the base modules. Sometimes, it might by next to impossible to catch messages, as they're written to output streams directly by the modules (as is the case in this example).
Try to localize the dependent modules themselves. This option seems hard to achieve, as different projects might use different I18N tools and strategies themselves and the dependency graph might be huge.
All in all, I think this problem is more general and not that specific to perl and cpan modules. So, I'm interessted in your thoughts, strategies and approaches.
I have rather strong opionions on the idea of translating computing terms, and most people disagree with my views, so take what I am saying with a grain of salt.
I do not understand the point of internationalizing a library for parsing command line options unless you want to further ghettoize what is already a small group of users of said library.
Would wget be more useful to Turkish users if instead it was called wal or wgetir? Or, instead of wget --mirror, should Turkish users write getir --ayna? What about that w?
If you just translate the messages, what is the point of outputting a help message in response to wget -h when the Turkish equivalent would be wget -y?
The fact is, almost all attempts at translating programming related terms I have seen are simply awful. The people who are most eager to translate are usually not in command of either human language — Nor do they seem to understand what they are translating.
However, as a result of these eager people, I find that at least the Turkish translations of pretty much any software I touch is just awful. Whatever Danish translations I have seen did not fare much better, but, at least, they were tolerable owing to the greater commonality of structure between Danish and English.
I think everyone's energy is better spent on actually making sure their programs handle content, including names of external resources/references, in different languages well, rather than giving me error messages in some Frankenstein language, or letting me specify command line options whose mnemonics do not match their descriptions etc, or presenting menus that contain of strings of words that really do not convey any meaning.
I have felt this way for the last for many decades now ... Even when I was patching IBM PC keyboard drivers with hex editors so people at various places could type reports in WordStar, and create charts in Harvard Graphics.
So, my unpopular advice is to put your energy elsewhere ...
For example, use exception objects so the user of your library (who is likely a programmer and will understand "Directory not found" much more readily than "Kütük bulunamadı") can deduce in a human-language independent way what happened, and what message to show the user. I haven't looked closely at MooX::Options, but I notice there is at least one string croak.
Here is an actual error message from an IBM product:
Belirtilen kütük örüntüsüyle eşleşen hiçbir kütük bulunamadı
You can ask every one of the almost 200 million Turkic people on earth what a "kütük örüntüsü" is, and only the person who actually came up with this non-sensical string of characters will be able to tell you that it corresponds to "file pattern". What, then, do they gain by using the phrase "kütük örüntüsü" versus "file pattern"? Nothing.
However, they lose the ability to communicate with, and, also, compete with, programmers in the English speaking world.
PS: Apologies for all Turkish examples, but I feel most comfortable drawing abominable examples based on my native language.

How to add logging information to perl legacy code

I have a medium to large size system built in perl, that has been developed during the last 15 years and is built of many scripts and pm files,
and in order to improve the system i need more data, the easiest way as i see it to get this data is to have every function in the code to print out the start and end time to some log so it will be possible the understand what is taking the most time.
however this is an old system and some parts are less maintainable than others and on top of it i need it to be running which means in order to get real data i need it to print this out from production.
what i want to do is to override in some way the function declration to wrap each function start in a line like
NAME start STARTTIME PARAMS
and when it leaves the function
NAME ended STARTTIME PARAMS
does anybody can point me to the right direction?
Thanks
Take a look at Devel::NYTProf. It can profile the amount of time that all of your subs are taking (and do a lot more). It doesn't involve a lot of messy code modification; instead you just run your script with it:
perl -d:NYTProf your_script.pl
Previous answers are spot on (I especially recommend Devel::NYTProf). However, a more general technique you could apply in general to gather data about your subroutines' behaviour is fiddling with the symbol table, "appending" (or prepending) code to the actual sub's code.
A couple of pointers:
In Perl, can I call a method before executing every function in a package? (this answer shows a code example you could adapt to your particular situation)
Hook::LexWrap is a module that lets you augment subroutine behaviour in several ways, without touching the original code
HTH
sounds like you need to use a profiler
http://www.perl.org/about/whitepapers/perl-profiling.html
Perl profilers usually have a huge impact on the program performance, so using them in production may not be a great idea.
You can try Devel::ContinuousProfiler that claims to have very low impact (I myself have never used it, just discovered it this morning!)

What are "not so well defined problems" that LISP is supposed to solve?

Most people agree that LISP helps to solve problems that are not well defined, or that are not fully understood at the beginning of the project.
"Not fully understood"" might indicate that we don't know what problem we are trying to solve, so the developer refines the problem domain continuously. But isn't this process language independent?
All this refinement does not take away the need for, say, developing algorithms/solutions for the final problem that does need to be solved. And that is the actual work.
So, I'm not sure what advantage LISP provides if the developer has no idea where he's going i.e. solving a problem that is not finalised yet.
Lisp (not "LISP") has a number of advantages when you're facing problems that are not well-defined. First of all, you have a REPL where you can quickly experiment with -- that helps in sketching out quick functions and trying to play with them, leading to a very rapid development cycle. Second, having a dynamically typed language is working well in this context too: with a statically typed language you need to "design more" before you begin, and changing the design leads to changing more code -- in contrast, with Lisps you just write the code and the data it operates on can change as needed. In addition to these, there's the usual benefits of a functional language -- one with first class lambda functions, etc (eg, garbage collection).
In general, these advantage have been finding their way into other languages. For example, Javascript has everything that I listed so far. But there is one more advantage for Lisps that is still not present in other languages -- macros. This is an important tool to use when your problem calls for a domain specific language. Basically, in Lisp you can extend the language with constructs that are specific to your problem -- even if these constructs lead to a completely different language.
Finally, you need to plan ahead for what happens when the code becomes more than a quick experiment. In this case you want your language to cope with "growing scripts into applications" -- for example, having a module system means that you can get a more "serious"
application. For example, in Racket you can get your solution separated into such modules, where each can be written in its own language -- it even has a statically typed language which makes it possible to start with a dynamically typed development cycle and once the code becomes more stable and/or big enough that maintenance becomes difficult, you can switch some modules into the static language and get the usual benefits from that. Racket is actually unique among Lisps and Schemes in this kind of support, but even with others the situation is still far more advanced than in non-Lisp languages.
In AI (Artificial Intelligence) historically Lisp was seen as the AI assembly language. It was used to build higher-level languages which help to work with the problem domain in a more direct way. Many of these domains need a lot of 'knowledge' for finding usable answers.
A typical example is an expert system for, say, oil exploration. The expert system gets as inputs (geological) observations and gives information about the chances to find oil, what kind of oil, in what depths, etc. To do that it needs 'expert knowledge' how to interpret the data. When you start such a project to develop such an expert system it is typically not clear what kind of inferences are needed, what kind of 'knowledge' experts can provide and how this 'knowledge' can be written down for a computer.
In this case one typically develops new languages on top of Lisp and you are not working with a fixed predefined language.
As an example see this old paper about Dipmeter Advisor, a Lisp-based expert system developed by Schlumberger in the 1980s.
So, Lisp does not solve any problems. But it was originally used to solve problems that are complex to program, by providing new language layers which should make it easier to express the domain 'knowledge', rules, constraints, etc. to find solutions which are not straight forward to compute.
The "big" win with a language that allows for incremental development is that you (typically) has a read-eval-print loop (or "listener" or "console") that you interact with, plus you tend to not need to lose state when you compile and load new code.
The ability to keep state around from test run to test run means that lengthy computations that are untouched by your changes can simply be kept around instead of being re-computed.
This allows you to experiment and iterate faster. Being able to iterate faster means that exploration is less of a hassle. Very useful for exploratory programming, something that is typical with dealing with less well-defined problems.

Are MooseX::Declare and MooseX::Method::Signatures production ready?

From the current version (0.98) of the Moose::Manual::MooseX are the lines:
We have high hopes for the future of
MooseX::Method::Signatures and
MooseX::Declare. However, these
modules, while used regularly in
production by some of the more insane
members of the community, are still
marked alpha just in case backwards
incompatible changes need to be made.
I noticed that for MooseX::Method::Signatures the change log for September 2009 mentions the removal of the "scary ALPHA disclaimer".
So, are these still "alpha"?
Would I still be considered one of the "more insane" to use them?
I'd say they are production ready - I'm using them in production - but there are several things to consider:
Performance
MooseX::Declare and dependencies do almost all of their magic at compile time. Depending on the size of your program, you might find anywhere from half a second to several seconds of additional initialization overhead. If this a problem, don't use MooseX::Declare.
At runtime, the main overhead is type and argument checking, which you should (ideally) be doing anyway. That said, Moose type constraints have some overheads, namely coercion and the more complex (MooseX::Types::Structured-style) constraints. Don't use these if performance is an issue.
Stability
MooseX::Declare and MooseX::Method::Signature's external syntax is now stable. But it is important to know that the internals are subject to extreme change. (fortunately, changes for the better)
To give you an idea, the signature itself is grabbed using a big block of C code stolen from the Perl tokenizer (toke.c). This can break in some situations since it isn't actually parsing anything. The bit inside the brackets is parsed using PPI, which is designed for pure Perl, but the resulting PPI tree is then hacked up to get something useful. Devel::Declare itself is a hack - after it sees specific keywords (e.g. 'role', 'class', 'method') the Devel::Declare-using module must rewrite the source code by hand, with no interaction with the real Perl parser.
Corner cases may cause Perl to segfault. Or rewrite the source code badly, so you get syntax errors but have no idea what's causing them without -MO::Deparse. If you mess up the MooseX::Declare syntax by accident, there is no guarantee that the module will detect this and give you a sensible error. The ALPHA message may have gone, but this is still doing dark and scary things internally, and you should be prepared for that.
UPDATE
MooseX::Declare has not been updated much, and you may wish to look at alternatives such as Moops. Personally, I have decided to stick with pure Moose until Perl itself begins to support class/method/has syntax natively, which is possibly on the cards.
I think it's a matter of differing perspectives as much as anything -- rafl is one of the aforementioned "more insane members of the community" while Rolsky is more conservative. It's up to you to decide who you agree with, and really I think that the most important variable is your own code.
MooseX::Declare is good code. It won't randomly blow up your machine, it's not awful for performance, and it offers a lot of nifty stuff while reducing the amount of boilerplate that you have to write. But it might change in the future, making your code refuse to compile until it's updated; it might make your editor and other development tools confused when it sees syntax that it can't parse, it might piss off your collaborators by making them learn a new module to work with your code, or it might piss off your boss by making it so any future maintainer has to learn a new module to work with your code. Which of those things apply to you, and to what degree? You know better than I do, I hope.
There are people who feel that the maturity and stability of MooseX::Delcare, Devel::Declare on which it's based, or even Moose itself are not yet ready for "prime time". I also know of two large companies with millions of visitors a month, who have MooseX::Declare in their production environment. I personally am happy with the stack I am provided with Moose and do not see a need yet to bring in MooseX::Declare. I know people who's opinion I deeply respect who refuse to write new code without the declarative sugar from MooseX::Declare.
All of this is to say, the decision on whether something is or is not production ready is highly dependent upon your production environment, your development needs, and taste for risk. Without being in your shoes we can't possibly give an informed decision as to how well any given tool matches that profile.
MooseX::Method::Signatures (MXMS), and MooseX::Declare which uses it, is not production ready. This is not because the code isn't stable, but because it is appallingly slow. Simply using the method keyword, no types or arguments, is a 500-1000x runtime performance hit over a regular method call. My Macbook Pro can do about 6,000 simple method calls per second using MXMS vs 5,000,000 with plain Perl.
Method::Signatures, in contrast, has almost no performance hit above what it would normally cost to do the requested checks. The syntax is almost exactly the same as MXMS and it supports Moose (and Mouse) types. Both rely on the same underlying syntax modifying technique. (Full disclosure, I am the author of Method::Signatures.)
If you like MooseX::Declare but want the performance of Method::Signatures, try Method::Signatures::Modifiers.
It depends on what you mean by "production ready". I wouldn't depend on them until their velocity slows down quite a bit. I like my production stuff to not need frequent care from external code changes, API adjustments, and so on. That's not something particular to Moose, but any young project.
You have to judge how much that matters to you. In some situations, pushing stuff into production is a lengthy process, so you must be circumspect with such things. At the other extreme, some places let you edit files directly on the production server. That is, you have to define your tolerance before anyone can tell you which side a given MooseX module is on.
MooseX::Declare and MooseX::Method::Signatures are working well but they can have really nasty penalty depending on what does your code do. This can be fixed by just not using method keyword or using Method::Signatures::Modifiers.
Performance penalty I am seeing is around 2-5x compared to Method::Signatures::Modifiers (5x being mostly for one specific class I was using). And it seems that it is mostly compile time or maybe first time initialization, because it is getting under 2x when the computation is longer.
Method::Signatures::Modifiers has better errors but you have to turn this optimization off when you use debugger (it goes haywire because it does not see these methods, you can see for yourself in -MO=Deparse output).
It may be worth it to get rid of Perl argument shifting hell.

When should I use OO Perl?

I'm just learning Perl.
When is it advisable to use OO Perl instead of non-OO Perl?
My tendency would be to always prefer OO unless the project is just a code snippet of < 10 lines.
TIA
From Damian Conway:
10 criteria for knowing when to use object-oriented design
Design is large, or is likely to become large
When data is aggregated into obvious structures, especially if there’s a lot of data in each aggregate
For instance, an IP address is not a good candidate: There’s only 4 bytes of information related to an IP address. An immigrant going through customs has a lot of data related to him, such as name, country of origin, luggage carried, destination, etc.
When types of data form a natural hierarchy that lets us use inheritance.
Inheritance is one of the most powerful feature of OO, and the ability to use it is a flag.
When operations on data varies on data type
GIFs and JPGs might have their cropping done differently, even though they’re both graphics.
When it’s likely you’ll have to add data types later
OO gives you the room to expand in the future.
When interactions between data is best shown by operators
Some relations are best shown by using operators, which can be overloaded.
When implementation of components is likely to change, especially in the same program
When the system design is already object-oriented
When huge numbers of clients use your code
If your code will be distributed to others who will use it, a standard interface will make maintenence and safety easier.
When you have a piece of data on which many different operations are applied
Graphics images, for instance, might be blurred, cropped, rotated, and adjusted.
When the kinds of operations have standard names (check, process, etc)
Objects allow you to have a DB::check, ISBN::check, Shape::check, etc without having conflicts between the types of check.
There is a good discussion about same subject # PerlMonks.
Having Moose certainly makes it easier to always use OO from the word go. The only real exception is if compilation start-up is an issue (Moose does currently have a compile time overhead).
I don't think you should measure it by lines of code.
You are right, often when you are just writing a simple script OO is probably too much overhead, but I think you should be more flexible regarding the 10 lines aproach.
In all cases when you are using OO Perl Rememebr to use Moose (or Mouse)
This question doesn't have that much to do with Perl. The question is "when, given a choice, should I use OO?" That "given a choice" bit is because in some languages (Java, for example), you really don't have any choice.
The answer is "when it makes sense". Think about the problem you're trying to solve. Does the problem fit into the OO concepts of classes and object? If it does, great, use OO. Otherwise use some other paradigm.
Perl is fairly flexible, and you can easily write procedural, functional, or OO Perl, or even mix them together. Don't get hung up on doing OO because everyone else is. Learn to use the right approach for each task.
All of this takes experience and practice, so make sure to try all these approaches out, and maybe even take some smaller problems and solve them in multiple ways to see how each works.
Damian Conway has a passage in Perl Best Practices about this. It is not a rule that you have to follow it, but it is probably better advice that I can give without knowing a lot about what you are doing.
Here is the publisher's page if that is a better place to link to the book.