How was the Google Books' Popular passages feature developed? - text-processing

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have
not settled, because we have not
anticipated, the question which will
be raised by the unenvisaged case when
it occurs; whether some degree of
peace in the park is to be sacrificed
to, or defended against, those
children whose pleasure or interest it
is to use these things. When the
unenvisaged case does arise, we
confront the issues at stake and can
then settle the question by choosing
between the competing interests in the
way which best satisfies us. In
doing...‎ Page 86
Appears in 15 books from 1968-2003
This would be a world fit for
"mechanical" jurisprudence. Plainly
this world is not our world; human
legislators can have no such knowledge
of all the possible combinations of
circumstances which the future may
bring. This inability to anticipate
brings with it a relative
indeterminacy of aim. When we are bold
enough to frame some general rule of
conduct (eg, a rule that no vehicle
may be taken into the park), the
language used in this context fixes
necessary conditions which anything
must satisfy...‎ Page 86
Appears in 8 books from 1968-2000
more
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't

Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.

In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.
This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.

If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?
Google scholar certainly has the information about citing from paper to paper maybe from book to book too.

Related

Does Program-o uses NLP ?

I am trying to make chat bot. I searched for some solutions and programs to help me.
Can someone tell me if Program-o uses natural language processing?
I have searched on google but i didn't find the answer.
Program-O is basically the engine that uses recursive pattern-matching on AIML to find a suitable response.
The answer given here explains in a bit more detail NLP in AIML
The pertinent paragraph being:
If by "natural language processing" you mean what is commonly called a "learning bot," the ALICE (AIML) bot does not meet the definition. The ALICE program (whose "brain" is the AIML scripting language) is a pattern-matching program. It searches a fairly large database - usually about 40,000 entries - for a phrase or term that matches one in the input, then selects a reply from the set designated by the closest match. It neither writes to its own files or generates spontaneous output. It doesn't "learn" by itself. Any changes or new information must be hard-coded into the AIML files by the botmaster.

How to localize CPAN module and dependencies

I am trying to localize a CPAN module MooX::Options using Locale::TextDomain after having read "On the state of I18N in perl".
In the discussion in the pull request the question came up how to deal with messages not originating in the module itself, but in a dependency. In this specific case, when you specify an option on the command line which is not defined anywhere in the code, you'll get the warning:
Unknown option: xyz
originating in the module Getopt::Long, which in itself is not localized yet.
The question is how to deal with these. I see basically three strategies:
Ignore them, which I find dissatisfactory.
Try to someway or other catch all the corner cases and messages in the module I'm currently localizing (in this case MooX::Options), and this way working around the missing localization in the dependent modules. This option seems brittle, as I'd have to constantly adapt to changes in the base modules. Sometimes, it might by next to impossible to catch messages, as they're written to output streams directly by the modules (as is the case in this example).
Try to localize the dependent modules themselves. This option seems hard to achieve, as different projects might use different I18N tools and strategies themselves and the dependency graph might be huge.
All in all, I think this problem is more general and not that specific to perl and cpan modules. So, I'm interessted in your thoughts, strategies and approaches.
I have rather strong opionions on the idea of translating computing terms, and most people disagree with my views, so take what I am saying with a grain of salt.
I do not understand the point of internationalizing a library for parsing command line options unless you want to further ghettoize what is already a small group of users of said library.
Would wget be more useful to Turkish users if instead it was called wal or wgetir? Or, instead of wget --mirror, should Turkish users write getir --ayna? What about that w?
If you just translate the messages, what is the point of outputting a help message in response to wget -h when the Turkish equivalent would be wget -y?
The fact is, almost all attempts at translating programming related terms I have seen are simply awful. The people who are most eager to translate are usually not in command of either human language — Nor do they seem to understand what they are translating.
However, as a result of these eager people, I find that at least the Turkish translations of pretty much any software I touch is just awful. Whatever Danish translations I have seen did not fare much better, but, at least, they were tolerable owing to the greater commonality of structure between Danish and English.
I think everyone's energy is better spent on actually making sure their programs handle content, including names of external resources/references, in different languages well, rather than giving me error messages in some Frankenstein language, or letting me specify command line options whose mnemonics do not match their descriptions etc, or presenting menus that contain of strings of words that really do not convey any meaning.
I have felt this way for the last for many decades now ... Even when I was patching IBM PC keyboard drivers with hex editors so people at various places could type reports in WordStar, and create charts in Harvard Graphics.
So, my unpopular advice is to put your energy elsewhere ...
For example, use exception objects so the user of your library (who is likely a programmer and will understand "Directory not found" much more readily than "Kütük bulunamadı") can deduce in a human-language independent way what happened, and what message to show the user. I haven't looked closely at MooX::Options, but I notice there is at least one string croak.
Here is an actual error message from an IBM product:
Belirtilen kütük örüntüsüyle eşleşen hiçbir kütük bulunamadı
You can ask every one of the almost 200 million Turkic people on earth what a "kütük örüntüsü" is, and only the person who actually came up with this non-sensical string of characters will be able to tell you that it corresponds to "file pattern". What, then, do they gain by using the phrase "kütük örüntüsü" versus "file pattern"? Nothing.
However, they lose the ability to communicate with, and, also, compete with, programmers in the English speaking world.
PS: Apologies for all Turkish examples, but I feel most comfortable drawing abominable examples based on my native language.

Language requirements for AI development [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Why is Lisp used for AI?
What makes a language suitable for Artificial Intelligence development?
I've heard that LISP and Prolog are widely used in this field. What features make them suitable for AI?
Overall I would say the main thing I see about languages "preferred" for AI is that they have high order programming along with many tools for abstraction.
It is high order programming (aka functions as first class objects) that tends to be a defining characteristic of most AI languages http://en.wikipedia.org/wiki/Higher-order_programming that I can see. That article is a stub and it leaves out Prolog http://en.wikipedia.org/wiki/Prolog which allows high order "predicates".
But basically high order programming is the idea that you can pass a function around like a variable. Surprisingly a lot of the scripting languages have functions as first class objects as well. LISP/Prolog are a given as AI languages. But some of the others might be surprising. I have seen several AI books for Python. One of them is http://www.nltk.org/book. Also I have seen some for Ruby and Perl. If you study more about LISP you will recognize a lot of its features are similar to modern scripting languages. However LISP came out in 1958...so it really was ahead of its time.
There are AI libraries for Java. And in Java you can sort of hack functions as first class objects using methods on classes, it is harder/less convenient than LISP but possible. In C and C++ you have function pointers, although again they are much more of a bother than LISP.
Once you have functions as first class objects, you can program much more generically than is otherwise possible. Without functions as first class objects, you might have to construct sum(array), product(array) to perform the different operations. But with functions as first class objects you could compute accumulate(array, +) and accumulate(array, *). You could even do accumulate(array, getDataElement, operation). Since AI is so ill defined that type of flexibility is a great help. Now you can build much more generic code that is much easier to extend in ways that were not originally even conceived.
And Lambda (now finding its way all over the place) becomes a way to save typing so that you don't have to define every function. In the previous example, instead of having to make getDataElement(arrayelement) { return arrayelement.GPA } somewhere you can just say accumulate(array, lambda element: return element.GPA, +). So you don't have to pollute your namespace with tons of functions to only be called once or twice.
If you go back in time to 1958, basically your choices were LISP, Fortran, or Assembly. Compared to Fortran LISP was much more flexible (unfortunately also less efficient) and offered much better means of abstraction. In addition to functions as first class objects, it also had dynamic typing, garbage collection, etc. (stuff any scripting language has today). Now there are more choices to use as a language, although LISP benefited from being first and becoming the language that everyone happened to use for AI. Now look at Ruby/Python/Perl/JavaScript/Java/C#/and even the latest proposed standard for C you start to see features from LISP sneaking in (map/reduce, lambdas, garbage collection, etc.). LISP was way ahead of its time in the 1950's.
Even now LISP still maintains a few aces in the hole over most of the competition. The macro systems in LISP are really advanced. In C you can go and extend the language with library calls or simple macros (basically a text substitution). In LISP you can define new language elements (think your own if statement, now think your own custom language for defining GUIs). Overall LISP languages still offer ways of abstraction that the mainstream languages still haven't caught up with. Sure you can define your own custom compiler for C and add all the language constructs you want, but no one does that really. In LISP the programmer can do that easily via Macros. Also LISP is compiled and per the programming language shootout, it is more efficient than Perl, Python, and Ruby in general.
Prolog basically is a logic language made for representing facts and rules. What are expert systems but collections of rules and facts. Since it is very convenient to represent a bunch of rules in Prolog, there is an obvious synergy there with expert systems.
Now I think using LISP/Prolog for every AI problem is not a given. In fact just look at the multitude of Machine Learning/Data Mining libraries available for Java. However when you are prototyping a new system or are experimenting because you don't know what you are doing, it is way easier to do it with a scripting language than a statically typed one. LISP was the earliest languages to have all these features we take for granted. Basically there was no competition at all at first.
Also in general academia seems to like functional languages a lot. So it doesn't hurt that LISP is functional. Although now you have ML, Haskell, OCaml, etc. on that front as well (some of these languages support multiple paradigms...).
The main calling card of both Lisp and Prolog in this particular field is that they support metaprogramming concepts like lambdas. The reason that is important is that it helps when you want to roll your own programming language within a programming language, like you will commonly want to do for writing expert system rules.
To do this well in a lower-level imperative language like C, it is generally best to just create a separate compiler or language library for your new (expert system rule) language, so you can write your rules in the new language and your actions in C. This is the principle behind things like CLIPS.
The two main things you want are the ability to do experimental programming and the ability to do unconventional programming.
When you're doing AI, you by definition don't really know what you're doing. (If you did, it wouldn't be AI, would it?) This means you want a language where you can quickly try things and change them. I haven't found any language I like better than Common Lisp for that, personally.
Similarly, you're doing something not quite conventional. Prolog is already an unconventional language, and Lisp has macros that can transform the language tremendously.
What do you mean by "AI"? The field is so broad as to make this question unanswerable. What applications are you looking at?
LISP was used because it was better than FORTRAN. Prolog was used, too, but no one remembers that. This was when people believed that symbol-based approaches were the way to go, before it was understood how hard the sensing and expression layers are.
But modern "AI" (machine vision, planners, hell, Google's uncanny ability to know what you 'meant') is done in more efficient programming languages that are more sustainable for a large team to develop in. This usually means C++ these days--but it's not like anyone thinks of C++ as a good language for AI.
Hell, you can do a lot of what was called "AI" in the 70s in MATLAB. No one's ever called MATLAB "a good language for AI" before, have they?
Functional programming languages are easier to parallelise due to their stateless nature. There seems to already be a subject about it with some good answers here: Advantages of stateless programming?
As said, its also generally simpler to build programs that generate programs in LISP due to the simplicity of the language, but this is only relevant to certain areas of AI such as evolutionary computation.
Edit:
Ok, I'll try and explain a bit about why parallelism is important to AI using Symbolic AI as an example, as its probably the area of AI that I understand best. Basically its what everyone was using back in the day when LISP was invented, and the Physical Symbol Hypothesis on which it is based is more or less the same way you would go about calculating and modelling stuff in LISP code. This link explains a bit about it:
http://www.cs.st-andrews.ac.uk/~mkw/IC_Group/What_is_Symbolic_AI_full.html
So basically the idea is that you create a model of your environment, then searching through it to find a solution. One of the simplest to algorithms to implement is a breadth first search, which is an exhaustive search of all possible states. While producing an optimal result, it is usually prohibitively time consuming. One way to optimise this is by using a heuristic (A* being an example), another is to divide the work between CPUs.
Due to statelessness, in theory, any node you expand in your search could be ran in a separate thread without the complexity or overhead involved in locking shared data. In general, assuming the hardware can support it, then the more highly you can parallelise a task the faster you will get your result. An example of this could be the folding#home project, which distributes work over many GPUs to find optimal protein folding configurations (that may not have anything to do with LISP, but is relevant to parallelism).
As far as I know from LISP is that is a Functional Programming Language, and with it you are able to make "programs that make programs. I don't know if my answer suits your needs, see above links for more information.
Pattern matching constructs with instantiation (or the ability to easily construct pattern matching code) are a big plus. Pattern matching is not totally necessary to do A.I., but it can sure simplify the code for many A.I. tasks. I'm finding this also makes F# a convenient language for A.I.
Languages per se (without libraries) are suitable/comfortable for specific areas of research/investigation and/or learning/studying ("how to do the simplest things in the hardest way").
Suitability for commercial development is determined by availability of frameworks, libraries, development tools, communities of developers, adoption by companies. For ex., in internet you shall find support for any, even the most exotic issue/areas (including, of course, AI areas), for ex., in C# because it is mainstream.
BTW, what specifically is context of question? AI is so broad term.
Update:
Oooops, I really did not expect to draw attention and discussion to my answer.
Under ("how to do the simplest things in the hardest way"), I mean that studying and learning, as well as academic R&D objectives/techniques/approaches/methodology do not coincide with objectives of (commercial) development.
In student (or even academic) projects one can write tons of code which would probably require one line of code in commercial RAD (using of component/service/feature of framework or library).
Because..! oooh!
Because, there is no sense to entangle/develop any discussion without first agreeing on common definitions of terms... which are subjective and depend on context... and are not so easy to be formulate in general/abstract context.
And this is inter-disciplinary matter of whole areas of different sciences
The question is broad (philosophical) and evasively formulated... without beginning and end... having no definitive answers without of context and definitions...
Are we going to develop here some spec proposal?

About Dijkstra's paper

I am reading Coders at Work.
I came across this paragraph in Donald Knuth's interview.
Seibel: It seems a lot of the people I’ve talked to had direct access to a machine when they were starting out. Yet Dijkstra has a paper I’m sure you’re familiar with, where he basically says we shouldn’t let computer-science students touch a machine for the first few years of their training; they should spend all their time manipulating symbols.
I want link to that paper. Which one paper is that? (He wrote too many :-)
Maybe this one?
Excerpt, from near the end:
Before we part, I would like to invite you to consider the following way of doing justice to computing's radical novelty in an introductory programming course.
On the one hand, we teach what looks like the predicate calculus, but we do it very differently from the philosophers. In order to train the novice programmer in the manipulation of uninterpreted formulae, we teach it more as boolean algebra, familiarizing the student with all algebraic properties of the logical connectives. To further sever the links to intuition, we rename the values {true, false} of the boolean domain as {black, white}.
On the other hand, we teach a simple, clean, imperative programming language, with a skip and a multiple assignment as basic statements, with a block structure for local variables, the semicolon as operator for statement composition, a nice alternative construct, a nice repetition and, if so desired, a procedure call. To this we add a minimum of data types, say booleans, integers, characters and strings. The essential thing is that, for whatever we introduce, the corresponding semantics is defined by the proof rules that go with it.
Right from the beginning, and all through the course, we stress that the programmer's task is not just to write down a program, but that his main task is to give a formal proof that the program he proposes meets the equally formal functional specification. While designing proofs and programs hand in hand, the student gets ample opportunity to perfect his manipulative agility with the predicate calculus. Finally, in order to drive home the message that this introductory programming course is primarily a course in formal mathematics, we see to it that the programming language in question has not been implemented on campus so that students are protected from the temptation to test their programs. And this concludes the sketch of my proposal for an introductory programming course for freshmen.
I found a manuscript of Dijkstra's "Cruelty" lecture.

Suggested content for a lunch-time "Introduction to Scala" talk

I'm going to be giving a short (30-40 mins) lunch-time talk on Scala to technical staff at my company. I'd like some suggestions for what would be the most appropriate content. Most people attending will have experience in Java and/or C# (plus various other languages).
What are the key things to cover? I'd like to give a brief introduction to the Scala syntax so that people don't feel lost when looking at code examples. I'll also cover some of the history behind the language and its designers. What would help people get the most out of the talk?
People are almost certainly coming to talk to get an answer to the question, "Why should I use Scala?" Anything you can provide to help them answer that will be valuable.
Keep the discussion of the history and the personalities behind Scala to a minimum.
A whirlwind tour of the syntax is useful, but keep it short.
Spend a good chunk of the talk demonstrating examples and comparisons to Java. Show cases where Scala shines. You should literally be running and executing code so that people get a real, hands-on feel for how things work.
Make sure to cover weaknesses, too! Provide an objective and balanced overview.
I gave a similar talk - mostly to those with a Java background. I felt that taking a piece of real Java (about 30 lines) and iteratively adding scala features worked pretty well. The 30 lines of Java eventually ended up as 6 (six!) of scala. The point being (of course) that 6 lines are more readable and maintainable than 30.
I converted the scala to line-by-line Java equivalent and then introduced:
Type inference
Option
Closures
Pattern-matching (on lists)
Type aliases
Tail recursion
I found that this segment took quite a long time because the audience were very interested in the minutiae of scala's syntax (especially around function-expressions). Before undertaking the pattern-matching bit, I had a slide explaining the various things you could use in a match.
Tough. One has to balance the new and the familiar. For instance:
Talk about traits, how they differ from interfaces and multiple inheritance. Note that most methods in all of Scala collections can actually be found on the trait Traversable, which has a single abstract method: foreach.
Speak of functions and partial functions, show map/filter/foreach, and how they make use of functions.
Talk about pattern matching -- show how unapply is used to enable representation independence, while at the same time case classes make the common case easy.
Above all AVOID any topic that might be difficult to understand quickly, or you may waste time on them. For example of great topics I wouldn't talk about: self types, variance, for-comprehensions.
Pick more topics than you have time for. Let the public steer the conversation towards the topcis they are more interested in. If anyone starts to boggle down a topic too much, say you'll be pleased to explain it in more details later, and ask if they would mind if you moved to another topic. On the other hand, if everyone seems to be picking up on one thing in particular, stay with it. Otherwise, it might feel like you want to hide something.
I gave a presentation on re-writing Java classes in Scala. It has lots of examples of Java -> Scala and (hopefully) makes the gains obvious. Feel free to borrow any content you want... presentation took 1hr 10minutes so you might want to cut some stuff out.
Presentation: http://www.colinhowe.co.uk/downloads/rewriting-java-in-scala.ppt
Video: http://skillsmatter.com/podcast/java-jee/re-writing-java-classes-in-scala-and-making-your-code-lovely
You could do worse than running through Jonas Bonér's presentation, Pragmatic Real-World Scala. Perhaps skip some advanced topics in there on different applications of traits and self-type annotations.