Predict whether name is forename or surname from author string (lexical analysis) - perl

I'm writing a parser that converts messy author strings into neatly formatted strings in the following format: ^([A-Z]\. )+[[:surname:]]$. Some examples below:
Smith JS => J. S. Smith
John Smith => J. Smith
John S Smith => J. S. Smith
J S Smith => J. S. Smith
I've managed to get quite far using various regular expressions to cover most of these, but have hit a wall for instances where a full name is provided in an unknown order. For example:
Smith John
John Smith
Smith John Stone
Obviously regular expressions won't be able to discern what order the forename, surname and middle name(s) are in, so my thought is to perform a lexical analysis on the author string, returning a type and confidence score for each token. Has anyone coded such a solution before, preferably in Perl? If so, I imagine my code would look something like this:
use strict;
use warnings;
use UnknownModule::NamePredictor qw( predict_name );
my $messy_author = "Smith John Stone";
my #names = split(' ',$messy_author);
for my $name (#names){
my ($type,$confidence) = predict_name($name);
}
I've seen a post here explaining the problem I have, but no viable solution has been suggested. I'd be quite surprised if no one has coded such a solution before if I'm honest, as there are huge training sets available. I may go down this route myself if it hasn't been done already.
Other things to consider:
I don't need this to be perfect. I'm looking for precision >90% ideally.
I have >100,000 messy author strings to play with. My goal is to pass as many cleanly as possible, evaluate and improve the approach over time.
These are definitely author strings, but they're muddled together in lots of different formats, hence the challenge I've set myself.
For everyone trying to point out that names aren't necessarily possible to categorise. In short, yes of course there will be those instances, hence why I'm gunning for imperfect precision. However, the majority can be categorised pretty comfortably. I know this simply because my human brain, with all its clever pattern recognising abilities, allows me to do it pretty well.
UPDATE: In the absence of an existing solution I've been looking at creating a model from a Support Vector Machine, using LIBSVM. I should be able to build a large and accurate training and test datasets using forenames and surnames taken from PubMed, which has a nice library of >25M articles containing categorised names. Unfortunately these don't have middle names though, just initials.

Related

Perl module for text comparison

Can anyone suggest a Perl module which can compare two strings and return a degree to which they match? I searched CPAN extensively, and although there are similar modules like String::Approx and Data::Compare, they are not what I am looking for. Suppose I have two strings : I love you, and I boht you. I want functionality which will compare these two strings, taking into account numerous parameters, the matching of words in correct order (love as the first word in a string should not "match" love as the 4th word in the 2nd string, even though both strings have that word), words not matching but spelt almost similarly (like say love and loge), number of words, etc and return an index, say a number from 0 to 1 on a scale of 1, representing the degree of similarity between the two strings. Is there any such Perl module?
There are many such modules. Often, though, you'll have to make use of them in some special way to account for your own assumptions. Most of the string comparison tools like this just implement some algorithm for comparing one string to another. Most assume that if you have specific policy decisions to make, you'll code them yourself.
Personally, I am not sure I'd recommend Text::Levenshtein because of bugs and lack of ut8 support. I don't have a better recommendation either, though.
However, these searches will reveal lots of potential modules you could look into and determine what works best for your purpose (based on the names of common algorithms for doing this sort of thing):
https://metacpan.org/search?q=levenshtein
https://metacpan.org/search?q=wagner+fischer
https://metacpan.org/search?q=edit+distance
If you're interested in spoken similarities, you can also look into phonetic comparisons:
https://metacpan.org/search?q=phonetic
https://metacpan.org/search?q=soundex
https://metacpan.org/search?q=metaphone

Looking for a fast hash-function

I'm looking for a special hash-function. Let's say I have a large list of strings, if I order them by their hash-values they should be ordered quasi randomly.
The most important point is: it must be super fast. I've tried md5 and sha1 and they're using to much cpu power.
Clashes are not a problem.
I'm using javascript, so it shouldn't be too complicated to implement.
Take a look at Murmur hash. It has a nice space/collision trade-off:
http://sites.google.com/site/murmurhash/
It looks as if you want the sort of hash function used in a hash table, not the sort used to detect duplicates or tampering.
Googling will yield you a wealth of information on alternative hash functions. To start with, stay away from cryptographic signature hashes (like MD-5 or SHA-1), they solve another problem.
You can read this, or this, or this, to start with.
If speed is paramount, you can implement a simple ad-hoc hash, e.g. take the first and last letter and order your string by the last and then first letter. The result would look, as you say, "quasi random" and it would be fast. For instance, part of my answer sorted that way would look like this:
ca ad-hoc
el like
es simple
gt taking
hh hash
nc can
ti implement
uy you
Hsieh, Murmur, Bob Jenkin's comes to my mind.
A nice page about hash functions that has some tests for quality and a simple S-box hash as well.

Why is there so much "magic" in Perl?

Looking through the perlsub and perlop manpages I've noticed that there are many references to "magic" and "magical" there (just search any of them for "magic"). I wonder why is Perl so rich in them.
Some examples:
print ++($foo = 'zz') # prints 'aaa'
printf "%d: %s", $! = 1, $! # prints '1: Operation not permitted'
while (my $line = <FH>) { ... } # $line is tested for definedness, not truth
use warnings; print "0 but true" + 1 # "0 but true" is a valid number!
When a Perl feature is described as "magic":
It means that that feature is
implemented by NBA star Magic Johnson.
Whenever Perl executes "magic", it is
actually sending an RPC call to a
remote receiver implanted in Magic
himself. He computes the answer, and
then sends a return message. The use
of Mr. Johnson for all the hard parts
of Perl provides a great abstraction
layer and simplifies porting to new
platforms. It's way easier than, say,
the Apache Portable Runtime.
Source: perrin on Perl Monks
It's official! Perl is more magical.
Hits from the following Google searches:
25 site:ruby-doc.org magic
36 site:docs.python.org magic
497 site:perldoc.perl.org magic
Magic, in Perl parlance is simply the word given to attributes applied to variables / functions that allow an extension of their functionality. Some of this functionality is available directly from Perl, and some requires the use of the C api.
A perfect example of magic is the tie interface which allows you to define your own implementation of a variable. Every operation that can be done to a variable (fetching or storing a value for instance) is exposed for reimplementation, allowing for elegant and logical syntactic constructs like a hash with values stored on disk, which are transparently loaded and saved behind the scenes.
Magic can also refer to the special ways that certain builtins can behave, such as how the first argument to map or grep can either be a block or a bare expression:
my #squares = map {$_**2} 1 .. 10;
my #roots = map sqrt, 1 .. 10;
which is not a behavior available to user defined subroutines.
Many other features of Perl, such as operator overloading or variables that can return different values when used with numeric or string operators are implemented with magic. Context could be seen as magic as well.
In a nutshell, magic is any time that a Perl construct behaves differently than a naive interpretation would suggest, an exception to the rule. Magic is of course very powerful, and should not be wielded without great care. Magic Johnson is of course involved in the execution of all magic (see FM's answer), but that is beyond the scope of this explaination.
I wonder why is Perl so rich in them.
To make things easy.
You'll find that most "magic" in Perl is to simplify the syntax for common tasks.
Because perl always Does What I Mean for some values of always.
I think (opinion more than fact) that this has to do with the organic growth viewpoint that Perl's creator Larry Wall has with the Perl language. Python is a study in the opposite approach, whose style often makes Perl hackers cringe at the perception of being forced to conform to a stylistic regime.
Some of it has to do with Perl being designed to be "efficient" at writing quick scripts to do Perl*-ish* tasks, in both wall clock time, and in keystrokes. Some of it has to do with the TMTOWTDI mantra of Perl and its followers.
Programmers tend to be opinionated about Perl's frequent usage of "magic", for some it is an eye-straining visual cacophony of chaos and disrespect for orderliness (which harkens back to the days of computer Priesthood in white lab coats behind a glass window), for others it is a shining example of getting things done efficiently, if not always obviously to the novice or outsider.
Perl's design philosophy is that simple things must be simple. This sounds good,and to some extent it is. However, there's a tradeoff involved: Making every simple thing a one-liner results in tons of special case hacks to save a few lines of code. Different people have different preferences regarding making simple operations within a language simple versus making the language specification simple. Perl is at one extreme. Java is at the other, at least among languages that people actually use. Python and C# are somewhere in between.

Why is Perl the best choice for most string manipulation tasks?

I've heard that Perl is the go-to language for string manipulation (and line noise ;). Can someone provide examples and comparisons with other language(s) to show me why?
It is very subjective, so I wouldn't say that Perl is the best choice, but it is certainly a valid choice for string manipulation. Other alternatives are Tcl, Python, AWK, etc.
I like Perl's capabilities because it has excellent support (better than POSIX as pointed out in the comment) for fast regexs and the implicit variables makes it easy to do basic string crunching with very little code.
If you have a *nix background a lot of what you already know will apply to Perl as well, which makes it fairly easy to pick up for a lot of people.
Perl -> Practical Extraction and Reporting Language
Perl's strength(when it comes to string processing) lies in it's very powerful Regular expression engine.
Because of this there are many people in the field of BioInformatics using Perl as their
main tool, hence the large number of posts about BioPerl on PerlMonks . In BioInformatics they work with strings a lot , they call them "sequences"(I don't know much about this).
Perlmonks.org is the heart of the Perl community, check out the immense number of hits
when you search for site:perlmonks.org regex 20,000 hits
You cannot ignore the sheer number of modules on CPAN:
375 modules under the namespace String on CPAN(Perl's module repository)
241 in Regex namespace
156 in Regexp namespace.
This is very clear evidence that Perl is a very powerful language when it comes to string processing.
So if you want to do some string processing and you're using Perl, you've got it covered :)
To address the second part of your question: Perl's reputation for line noise comes from 4 kinds of people:
Overly clever (for their own good) hackers (or sometimes just hacks) who value cleverness and showing off over readability. "If it was hard to write it should be hard to read" is NOT just a mythical attitude.
People who wouldn't know good software development if it hit them over the head with a cluebat. Such as people who save a couple of characters in a program by using $_ instead of a named variable. In a nested scope. Or never heard of comments. Or self-documenting identifiers. Or whitespace.
People who think that software development == code golf. More seriously, that the less the amount of characters in the code, the more readable it is, because they misunderstand what "conciseness" means in code.
(NOTE: first 2 sets are not mutually exclusive)
People who code/hack in perl (e.g. SysAdmins) who have very little training, experience or incentive to do software development. E.g. the percentage of people using Perl who do quick and dirty hacks with bad style and worse code quality is probably higher than, say Python.
Just for reference, 80% of awful Perl "code" in my $work falls under this - it was written by financial analysts who are smart enough to pick up a Perl book and some earlier scripts, clone off a script that does what business need is, and don't have CS/programming background to worry about how readable/maintainable their code was.
In other (and less snide) words, you can write beautiful, incredibly readable and easy to maintain software in Perl. It all depends on who does the writing, what their priorities and skills are. Also, just like with any other language, you can write a miserable write-only mess with it.
The difference from other languages is that very often, the write-onlyness of said mess, when done in Perl, does indeed consist of very high density of non-letter characters (sygils and special characters in poorly written RegExes). This high density can indeed, asymptotically approximate line noise.
Because It is what is perl made for. Because Perl is expressive, powerful and fast. I have beaten many times specialized products with small and dirty script in perl written in few minutes. For example, outer join and large join vs. MySQL (just because can't do merge join), ETL processing vs. Java Hadoop (because I have years experience to write it effectively and perl IO layer is just great) and so and so.
It's a very subjective question. Perhaps the true answer is that Perl has a nice syntax (incl. the regex syntax) that makes people want to sign it high praises over other languages? IMHO, any language that supports a rich regex syntax would be considerablly powerfull at string manipulation.
Kids these days! Back in the day, all we had was SNOBOL -- and we liked it! Try it sometime...you never know, you might want something respectable to fall back on when this Perl fad runs its course!
Perl is widely used for string manipulation tasks as its string manipulation API is easy to learn. And also its regex is widely used. It has been in use for a very long time and anyone with a Unix background would pick up perl very easily. Historically, perl was developed in the late 80's for report processing tasks and was "originally" developed for text processing tasks. So till date, the trend continues as anyone with a string manipulation task or text processing task would opt for perl as the first choice. Its not that other languages like python arent up to the task, but perl's popular in this area.
I like Perl a lot, write books about it, publish a magazine about it, and so on. I don't think I would ever say it's the best language to do anything in. A lot of that has to do with the task you need to do. For many string processing tasks, ETL, data cleanup, and so in, Perl is a very strong and capable language. You wouldn't have that much trouble doing simple tasks.
Your comment sounds like it comes from the early 1990s though, when the rest of the world hadn't caught up. Many of the dynamic languages are now up to task, so you might not have to switch languages. If you decide to use Perl and run into problems, there are plenty of people here who are willing to help, and not all of us will fault you if you choose something else. :)
At the beginning, Perl was developed for easy report processing and dealing with text files, thus it's got a very strong REGEX support. Most of the info on REGEX you can find in perldoc.
Perl was the go-to language for a long time. The problem is it can be pretty messy and difficult to maintain (some people can write Perl that avoids this, but it is very easy to wrote ugly code). I would not tell you to avoid Perl, but many have moved on to some modern alternatives.
I would recommend learning one of the newer scripting languages such as Python or Ruby. Both will work very well for your needs, and can easily handle more difficult tasks later on. They're both quite nice to work in, after having written C and Perl for so long.
In short, Perl would be a good hammer for this nail. Python and Ruby would be nail-guns.
I disagree that Perl is the best language for text processing. Simple things are easy; to replace foo with bar:
$data =~ s/foo/bar/g;
Harder things are not simple, though. Look at Data::SExpression, for example. It is a lot of code to do something very simple.
An similar implementation in Haskell with PArrow looks something like:
import Text.ParserCombinators.PArrow
data Atom = QuotedString String | Symbol String
deriving (Show, Eq)
data Sexp = Sexp [Sexp] | Atom Atom
deriving (Eq)
quotedString :: Char -> Char -> MD a Atom
quotedString quoteChar escapeChar = between q q inside >>^ QuotedString
where q = char quoteChar
inside = many $ (char escapeChar >>> anyChar) <+> notChar quoteChar
doubleQuotedString, symbol :: MD a Atom
doubleQuotedString = quotedString '"' '\\'
symbol = word >>^ Symbol
atom, sexp :: MD a Sexp
atom = (doubleQuotedString <+> symbol) >>^ Atom
sexp = atom <+> (between (char '(') (char ')') sexp' >>^ Sexp)
where sexp' = sepBy1 sexp spaces
Just sayin'. Perl is not the end-all-and-be-all of text manipulation. There are many reasons to prefer Perl to other languages, but parsing is not one of them.

Can aptitude for learning Programming paradigms be influenced by culture or native language's grammar? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
It is well known that different people have different aptitudes regarding various programming paradigms (e.g. some people have trouble learning non-procedural, especially functional languages. Some people have trouble understanding pointers - see Joel Spolsky's blog for musings on that. Some people have trouble grasping recursion).
I was recently reading about a study that looked at how the grammar of someone's native language affected their speed of learning math. Can't find that article now but a quick googling found this reference.
That led me to wondering whether someone's native culture or first language might affect their aptitude towards various programming paradigms. I'm more curious about positive influences - e.g. some trait that make it easier/faster for someone to learn a particular paradigm, for example native language grammar being very recursion-oriented.
To be clear, I'm looking for how culture/language grammare may affect the difference between aptitude of the same person towards various paradigms as opposed to how it affects overall aptitude towards programming between different persons.
Important: the only answers I'm interested in are either references to scientific studies, or personal observations from someone intimately familiar with a particular culture/language, including from their own experience.
E.g. I'm not interested in your opinion of how Chinese being your first language affects anything unless you speak Chinese or worked with extremely large set of Chinese-native programmers extensively.
I'm OK with your guesstimates not based on scientific studies, but please be sure to supply your reasoning about plausible causes of your observation.
I'm not interested in culture-bashing (any such commends will be deleted or flagged for deletion).
I'm also not particularly interested in culture-building - we all know Linus is from Finland and Tetris was written in Russia and Larry Wall is an American. Any culture/nation can produce a brilliant mind in any discipline. I'm interested in averages.
Disclaimer: I was a Cultural Anthropologist before I got into programming, so you know I'm going to be on a high horse, here.
Obviously, a person's history will have an impact on their aptitude for any particular task, but I think this has less to do with the structure or grammar of a person's language than it does with the particular material conditions of the culture in which that language is spoken.
For example, a pair of Anthropologists in the 60's went to various African communities and tested people's susceptibility to various optical illusions. Here is a classic one:
In this illusion, the bottom line looks longer, because the angled lines connecting it make it appear to be off in the distance.
These Anthropologists found that in many African cultures, the illusion doesn't work at all - people consider the lines to be the same length. By refining their study, they found that the only people who were susceptible to the illusion were people who had grown up in an urban environment. They hypothesized that the illusion did not work on people from remote jungle environments, because these people had little or no experience with right angles and seeing things at very long distances.
My point with this is that even if you successfully found a correlation between programmers' native languages and their abilities with certain aspects of programming, you couldn't be sure that the correlation wasn't spurious. For example, you might think that Asians tend to be bad drivers, and you might even be able to demonstrate this statistically. If you then concluded, however, that "bad driving" is some sort of fundamental characteristic of Asian-ness, you would be ignoring the fact that Asians are more likely to be from Asia, and thus to have had much less experience driving cars (or even being in cars) while growing up than Westerners (and especially Americans) have had.
With programming, we might think that a particular language inhibits programming ability, and not take note of the fact that the society in which that language is spoken has much less access to computers, and thus people growing up with that language appear to have less programming aptitude or ability to understand certain programming concepts.
In short, I wouldn't give much credence to the idea that language inhibits anyone's ability to understand anything in particular. The human mind is much too flexible and adaptable for that to be true.
This seems analogous to the Sapir-Whorf Hypothesis - that the facilities of a language affect the ease which which one can cogitate about certain subjects, or in the words of the Wikipedia article:
"The linguistic relativity principle (also known as the Sapir-Whorf Hypothesis) is the idea that the varying cultural concepts and categories inherent in different languages affect the cognitive classification of the experienced world in such a way that speakers of different languages think and behave differently because of it."
( http://en.wikipedia.org/wiki/Linguistic_relativity )
While there appears to be little definitive information here, the discussions appear to be relevant to the question, and perhaps worthy of further exploration.
Just a few random thoughts. I think the influence is generally very weak and can most of the time be neglected but they do exist and sometimes they can make us feel them.
In Chinese grammar, for example, we don't quite distinguish between plural and singular forms, but I wouldn't think we Chinese have any noticeable difficulty understanding the concepts of scalar and array in Perl. The reason might be this: although we generally don't need particular suffixes or changes in form to indicate whether something is singular or plural, we do have the concepts of plural and singular and we mostly depend upon the context to tell them apart. Grammar-wise, the context in Chinese may possibly be way more important than that in those languages belonging to indo-european family. We omit a lot of things sometimes when they have already been mentioned and sometimes when we just presume that these things can be implicitly well understood by the listener. In either case, we don't need those indefinite and definite articles (a, an, the) or those relative pronouns like, that, which and who, to indicate whether they're being mentioned for the first time or yet another time again. Maybe that's partially why I feel very comfortable with Perl's default variable "$". print; chomp; split; all act upon $, which has never ever been mentioned. But this is quite subjective.
I think the Chinese language is more characterized by implicitness and fuzziness than Indo-european languages. For example, We never ever pay attention to subject verb agreement and we never ever do verbal conjugation to denote tenses. This could mean that the Chinese are inclined use a not quite so logical mode of thinking. One of my teachers onced used an example to try to generalize (or maybe over-generalize)the difference between Chinese non-logical mode of thinking and American logical mode of thinking.
If the American version of quarrelling should be this:
“I can lick you.”
“No, you can’t.”
“Yes, I can.”
“No, you can’t.”
“I can.”
“you can’t.”
“Can!”
“Can’t!”
The Chinese version (translated in English) would be something like this:
I can lick you.
How dare you!
What if I dare?
Then you try.
Try? Hm, you wait and see.
Wait and see? I’m not afraid.
Not afraid? OK. You don’t run away.
Who runs away? Come on and lick
Well, I agree that there may be some differences between Chinese way of thinking and that of other countries but the example looks like a stereotype because the Chinese may easily switch to the use of the American version. Back to the question, I think the language and culture may indeed influence a programmer's learning process in one way or another but this influence is defninitely not decidingly noticeable. Maybe because of the culture you're exposed to makes you feel a little bit uncomfortable to get used to some notions in some programming language, recursion or whatever, but time will solve it.
I was recently reading about a study that looked at how the grammar of someone's native language affected their speed of learning math. ... Important: the only answers I'm interested in are either references to scientific studies, or personal observations from someone intimately familiar with a particular culture/language, including from their own experience.
I learned a lot of maths before I started programming (enough to count as "intimately familiar"), and IMO programming is relatively easy: more tangible.
Sometimes I've wondered whether it's beneficial to know more than one human language: if you only know one language, then you might think of the words "cat" and "dog" as being values, i.e. synonymous with cat and dog objects; but if you're fluent in more than one language, then "cat" and "dog" become pointers: because for example the French words "chat" and "chien" are referring/pointing to the same objects as "cat" and "dog", and so clearly there's a distinction between the word and the object.
It's disappointing that you post the question without linking to the article which inspired it. I thought of "reverse polish notation" and wondered whether that was at all the kind of differences in "grammar" that were considered in the original study.
The reference you cite seems to rest on the assumption that making it easier helps with learning. In my understanding, there is a countereffect: without enough challange, you're not learning enough.
There are theories/studies (anyone with a link?) that development of language created crucial pressure on expanding the cerebral cortex and thus "made us human". (in very darwinistic terms: more grey matter ==> better language capabilities ==> better teamwork ==> better survival as a group). So language complexity can't be all bad for learning.
(My only qualification is being an eager follower of The Frontal Cortex blog, so take this with a grain of salt.)
In german we have a strange ordering of numbers: 10^0 and 10^1 positions are switched, but others are normal, (e.g. 25 is 'five and twenty', 125 is 'one hundred five and twenty'). It's been claimed that this makes learning numbers harder, and thus german should adopt a more intuitive ordering.
I guess that it helps a lot with doing additions in your head - at least if you stay below 100 or 200 - You can first add the 10^0 position and already say it / write it down while taking any carry into account for the 10^1 position.
(That doesn't continue for 10^2, I guess that would be done in writing by the majority anyway)
Also: abstractions. There are languages where numbers aren't abstracted from objects, "two coconuts" and "two sabretooth tigers" don't share a common "two" word / concept. Such a language would probably be very bad for developing math skills. Here the abstraction (separating number and object) in language is important.
Generally, I'd say the language has a strong effect on shaping a developing mind, and I see no reason why this should not extend to culture.
Of course it's still open what would be the "right kind of complexity" - for what, and how particular language features affect general improvement vs. establishment of an elite (i.e. "sharpening the skills of the gifted, while hampering the rest").
Interesting Question, no doubt - looking forward to other replies.