Perl module for text comparison - perl

Can anyone suggest a Perl module which can compare two strings and return a degree to which they match? I searched CPAN extensively, and although there are similar modules like String::Approx and Data::Compare, they are not what I am looking for. Suppose I have two strings : I love you, and I boht you. I want functionality which will compare these two strings, taking into account numerous parameters, the matching of words in correct order (love as the first word in a string should not "match" love as the 4th word in the 2nd string, even though both strings have that word), words not matching but spelt almost similarly (like say love and loge), number of words, etc and return an index, say a number from 0 to 1 on a scale of 1, representing the degree of similarity between the two strings. Is there any such Perl module?

There are many such modules. Often, though, you'll have to make use of them in some special way to account for your own assumptions. Most of the string comparison tools like this just implement some algorithm for comparing one string to another. Most assume that if you have specific policy decisions to make, you'll code them yourself.
Personally, I am not sure I'd recommend Text::Levenshtein because of bugs and lack of ut8 support. I don't have a better recommendation either, though.
However, these searches will reveal lots of potential modules you could look into and determine what works best for your purpose (based on the names of common algorithms for doing this sort of thing):
https://metacpan.org/search?q=levenshtein
https://metacpan.org/search?q=wagner+fischer
https://metacpan.org/search?q=edit+distance
If you're interested in spoken similarities, you can also look into phonetic comparisons:
https://metacpan.org/search?q=phonetic
https://metacpan.org/search?q=soundex
https://metacpan.org/search?q=metaphone

Related

What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)?

I found these 2 words "dont" and "isnt" in the vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. I wonder if they were originally "don't" and "isn't". This will likely depend on the sentence_to_word parsing algorithms they used. If someone is familiar, please confirm if this is the case.
A secondary question is if this is a common way to deal with apostrophe for words like "don't", "isn't", "hasn't" and so on. i.e. just filter replace that apostrophe with an empty string such that "don" and "t" becomes one word.
Finally, I am also not sure if GloVe comes with API to do sentence_to_word parsing so you can be consistent with what the researchers have done originally.
I think dont and isnt really are originally don't and isn't. I have seen a few other such examples. I suspect this is just the specific way GloVe researchers handle this.

Uniform list pretty-printer

It is known that default printer can be confusing wrt lists because of no output for empty lists and 3 different notations being mixed (, vs (x;y;z) vs 1 2 3) and not obvious indentation/columnization (which is apparently optimized for table data). I am currently using -3! but it is still not ideal.
Is there a ready-made pretty-printer that has consistent uniform output format (basically what I am used to in any other language where list is not special)?
I've started using .j.j for string outputs in error messages more recently in preference to -3!. Mainly I think it is easier to parse in a text log, but also doesn't truncate in the same way.
It still transforms atoms and lists differently so it might not exactly meet your needs, if you really want that you could compose it with the old "ensure this is a list" trick:
myPrinter:('[.j.j;(),])
You might need to supply some examples to better explain your issues and your use-case for pretty-printing.
In general -3! is the most clear visual representation of the data. It is the stringified equivalent to another popular display method which is 0N!.
The parse function is useful for understanding how the interpreter reads/executes commands but I don't think that will be useful in your case

Perl operators are "discovered" and not designed?

Just reading this page: https://github.com/book/perlsecret/blob/master/lib/perlsecret.pod , and was really surprised with the statements like:
Discovered by Philippe Bruhat, 2012.
Discovered by Abigail, 2010. (Alternate nickname: "grappling hook")
Discovered by Rafaël Garcia-Suarez, 2009.
Discovered by Philippe Bruhat, 2007.
and so on...
The above operators are DISCOVERED, so they are not intentional by perl-design?
Thats mean than here is possibility than perl sill have some random character sequences what in right order doing something useful like the ()x!! "operator"?
Is here any other language what have discovered operatos?
From the page you linked:
They are like operators in the sense that these Perl programmers see
them often enough to recognize them without thinking about their
smaller parts, and eventually add them to their toolbox. And they are
like secrets in the sense that they have to be discovered by their
future user (or be transmitted by a fellow programmer), because they
are not explicitly documented.
That is, they are not really their own operators, but they are made up of smaller operators compounded to do something combinedly.
For example, the 'venus' operator (0+ or +0) numifies the object on its left or right. That's what adding zero in any form does, "secret" operator or not.
Perl has a bunch of operators that do special things, as well as characters that do special things when interpreted in a specific context. Rather than these being actual "operators" (i.e., not explicitly recognized by the Perl parser), think of them as combinations of certain functions/operations. For example ()X!!, which is known as the "Enterprise" operator, consists of () which is a list, followed by x, which is a repetition operator, followed by !! (the "bang bang" operator), which performs a boolean conversion. This is one of the reasons that Perl is so expressive.

What's a Good package for Phonetic Representation for Various Human Languages?

I'm currently working on a project for which I think being able to come up with phonetic representations of words in various languages would be really helpful. I know Aspell does this pretty well, but I don't think there's a very easy way to get at their phonetic representations, so I ask: is there some other good package for getting the phonetic representation of a word given the word and the language/dialect/accent/whatever it's coming from?
This doesn't need to be in any particular language, but if it were Perl, that would be best.
I've already tried Soundex, Metaphone, DoubleMetaphone, and everything else in Text::Phonetic, and none of that stuff was very good – definitely nowhere near as good as the stuff in Aspell.
The first thing that springs to mind is Soundex. Of course, there is a Perl module Soundex, too. While this is designed to generate a soundex "key" from input it might be useful in mapping different variants to a common key.
There is a package Text::Aspell in CPAN. Might be useful.
I you are trying to make a google style suggestion/correction system, it's not based on just phonetics or AI, but on a massive amount of user input. When a user makes a search, and doesn't click in any link but corrects the input and searches again, it gives google a lot of data about "correct" writing than a phonetics test or dictionary matching.
The main problem is in human language itself, it's not that people speak or write in a deterministic way, let alone in multiple languages.
Of course , i might be wrong, but if you need a library that let's you do this:
getLanguage(string);
I want to see that working, really.

What is Perl's secret of getting small code do so much?

I've seen many (code-golf) Perl programs out there and even if I can't read them (Don't know Perl) I wonder how you can manage to get such a small bit of code to do what would take 20 lines in some other programming language.
What is the secret of Perl? Is there a special syntax that allows you to do complex tasks in few keystrokes? Is it the mix of regular expressions?
I'd like to learn how to write powerful and yet short programs like the ones you know from the code-golf challenges here. What would be the best place to start out? I don't want to learn "clean" Perl - I want to write scripts even I don't understand anymore after a week.
If there are other programming languages out there with which I can write even shorter code, please tell me.
There are a number of factors that make Perl good for code golfing:
No data typing. Values can be used interchangeably as strings and numbers.
"Diagonal" syntax. Usually referred to as TMTOWTDI (There's more than one way to do it.)
Default variables. Most functions act on $_ if no argument is specified. (A few act
on #_.)
Functions that take multiple arguments (like split) often have defaults that
let you omit some arguments or even all of them.
The "magic" readline operator, <>.
Higher order functions like map and grep
Regular expressions are integrated into the syntax (i.e. not a separate library)
Short-circuiting operators return the last value tested.
Short-circuiting operators can be used for flow control.
Additionally, without strictures (which are off be default):
You don't need to declare variables.
Barewords auto-quote to strings.
undef becomes either 0 or '' depending on context.
Now that that's out of the way, let me be very clear on one point:
Golf is a game.
It's great to aspire to the level of perl-fu that allows you to be good at it, but in the name of $DIETY do not golf real code. For one, it's a horrible waste of time. You could spend an hour trying to trim out a few characters. Golfed code is fragile: it almost always makes major assumptions and blithely ignores error checking. Real code can't afford to be so careless. Finally, your goal as a programmer should be to write clear, robust, and maintainable code. There's a saying in programming: Always write your code as if the person who will maintain it is a violent sociopath who knows where you live.
So, by all means, start golfing; but realize that it's just playing around and treat it as such.
Most people miss the point of much of Perl's syntax and default operators. Perl is largely a "DWIM" (do what I mean) language. One of it's major design goals is to "make the common things easy and the hard things possible".
As part of that, Perl designers talk about Huffman coding of the syntax and think about what people need to do instead of just giving them low-level primitives. The things that you do often should take the least amount of typing, and functions should act like the most common behavior. This saves quite a bit of work.
For instance, the split has many defaults because there are some use cases where leaving things off uses the common case. With no arguments, split breaks up $_ on whitespace because that's a very common use.
my #bits = split;
A bit less common but still frequent case is to break up $_ on something else, so there's a slightly longer version of that:
my #bits = split /:/;
And, if you wanted to be explicit about the data source, you can specify the variable too:
my #bits = split /:/, $line;
Think of this as you would normally deal with life. If you have a common task that you perform frequently, like talking to your bartender, you have a shorthand for it the covers the usual case:
The usual
If you need to do something, slightly different, you expand that a little:
The usual, but with onions
But you can always note the specifics
A dirty Bombay Sapphire martini shaken not stirred
Think about this the next time you go through a website. How many clicks does it take for you to do the common operations? Why are some websites easy to use and others not? Most of the time, the good websites require you to do the least amount of work to do the common things. Unlike my bank which requires no fewer than 13 clicks to make a credit card bill payment. It should be really easy to give them money. :)
This doesn't answer the whole question, but in regards to writing code you won't be able to read in a couple days, here's a few languages that will encourage you to write short, virtually unreadable code:
J
K
APL
Golfscript
Perl has a lot of single character special variables that provide a lot of shortcuts eg $. $_ $# $/ $1 etc. I think it's that combined with the built in regular expressions, allows you to write some very concise but unreadable code.
Perl's special variables ($_, $., $/, etc.) can often be used to make code shorter (and more obfuscated).
I'd guess that the "secret" is in providing native operations for often repeated tasks.
In the domain that perl was originally envisioned for you often have to
Take input linewise
Strip off whitespace
Rip lines into words
Associate pairs of data
...
and perl simple provided operators to do these things. The short variable names and use of defaults for many things is just gravy.
Nor was perl the first language to go this way. Many of the features of perl were stolen more-or-less intact (or often slightly improved) from sed and awk and various shells. Good for Larry.
Certainly perl wasn't the last to go this way, you'll find similar features in python and php and ruby and ... People liked the results and weren't about to give them up just to get more regular syntax.
What's Java's secret of copying a variable in only one line, without worrying about buses and memory? Answer: the code is transformed to bigger code. Same for every language ever invented.