I am trying to make chat bot. I searched for some solutions and programs to help me.
Can someone tell me if Program-o uses natural language processing?
I have searched on google but i didn't find the answer.
Program-O is basically the engine that uses recursive pattern-matching on AIML to find a suitable response.
The answer given here explains in a bit more detail NLP in AIML
The pertinent paragraph being:
If by "natural language processing" you mean what is commonly called a "learning bot," the ALICE (AIML) bot does not meet the definition. The ALICE program (whose "brain" is the AIML scripting language) is a pattern-matching program. It searches a fairly large database - usually about 40,000 entries - for a phrase or term that matches one in the input, then selects a reply from the set designated by the closest match. It neither writes to its own files or generates spontaneous output. It doesn't "learn" by itself. Any changes or new information must be hard-coded into the AIML files by the botmaster.
Related
I am trying to localize a CPAN module MooX::Options using Locale::TextDomain after having read "On the state of I18N in perl".
In the discussion in the pull request the question came up how to deal with messages not originating in the module itself, but in a dependency. In this specific case, when you specify an option on the command line which is not defined anywhere in the code, you'll get the warning:
Unknown option: xyz
originating in the module Getopt::Long, which in itself is not localized yet.
The question is how to deal with these. I see basically three strategies:
Ignore them, which I find dissatisfactory.
Try to someway or other catch all the corner cases and messages in the module I'm currently localizing (in this case MooX::Options), and this way working around the missing localization in the dependent modules. This option seems brittle, as I'd have to constantly adapt to changes in the base modules. Sometimes, it might by next to impossible to catch messages, as they're written to output streams directly by the modules (as is the case in this example).
Try to localize the dependent modules themselves. This option seems hard to achieve, as different projects might use different I18N tools and strategies themselves and the dependency graph might be huge.
All in all, I think this problem is more general and not that specific to perl and cpan modules. So, I'm interessted in your thoughts, strategies and approaches.
I have rather strong opionions on the idea of translating computing terms, and most people disagree with my views, so take what I am saying with a grain of salt.
I do not understand the point of internationalizing a library for parsing command line options unless you want to further ghettoize what is already a small group of users of said library.
Would wget be more useful to Turkish users if instead it was called wal or wgetir? Or, instead of wget --mirror, should Turkish users write getir --ayna? What about that w?
If you just translate the messages, what is the point of outputting a help message in response to wget -h when the Turkish equivalent would be wget -y?
The fact is, almost all attempts at translating programming related terms I have seen are simply awful. The people who are most eager to translate are usually not in command of either human language — Nor do they seem to understand what they are translating.
However, as a result of these eager people, I find that at least the Turkish translations of pretty much any software I touch is just awful. Whatever Danish translations I have seen did not fare much better, but, at least, they were tolerable owing to the greater commonality of structure between Danish and English.
I think everyone's energy is better spent on actually making sure their programs handle content, including names of external resources/references, in different languages well, rather than giving me error messages in some Frankenstein language, or letting me specify command line options whose mnemonics do not match their descriptions etc, or presenting menus that contain of strings of words that really do not convey any meaning.
I have felt this way for the last for many decades now ... Even when I was patching IBM PC keyboard drivers with hex editors so people at various places could type reports in WordStar, and create charts in Harvard Graphics.
So, my unpopular advice is to put your energy elsewhere ...
For example, use exception objects so the user of your library (who is likely a programmer and will understand "Directory not found" much more readily than "Kütük bulunamadı") can deduce in a human-language independent way what happened, and what message to show the user. I haven't looked closely at MooX::Options, but I notice there is at least one string croak.
Here is an actual error message from an IBM product:
Belirtilen kütük örüntüsüyle eşleşen hiçbir kütük bulunamadı
You can ask every one of the almost 200 million Turkic people on earth what a "kütük örüntüsü" is, and only the person who actually came up with this non-sensical string of characters will be able to tell you that it corresponds to "file pattern". What, then, do they gain by using the phrase "kütük örüntüsü" versus "file pattern"? Nothing.
However, they lose the ability to communicate with, and, also, compete with, programmers in the English speaking world.
PS: Apologies for all Turkish examples, but I feel most comfortable drawing abominable examples based on my native language.
I've read quite a few articles giving a bit of background information on how Facebook implemented their Graph Search. All of which seem to just glance over the actual implementation details of the parser they are using.
Such as https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-graph-search-beta/10151240856103920
From that page:
We combined various parsing techniques to build a substring parser:
suppose a user inputs, say, "friends New York" and that we have
defined a comprehensive set of all the potential page titles our
system can handle. Our parser could then generate exactly the Graph
Search titles that contain the user's input, including things like
"friends who live in New York" and "friends who have visited New
York." If we could find a way to appropriately rank those suggested
titles for the Graph Search typeahead, we would have a good start.
I'm really interested in learning about the methods one would use to tackle this problem. What Algorithm / Techniques would be used to write such a system ?
Any links would be much appreciated too.
I was thinking about implementing something similar.. wanted to ask Q here on SO and found that this is already asked..
Here is what I have been thinking to start with -
Assume facebook search engine "knows" about the underlying data store (a complex graph). So the search engine understands key words like "Friends", "Relative" and other such relationships and does not treat them like a trivial word in english language.
In such case, a good idea could be to parse the user input (using client side javascript) to a JSON and send it over to the search engine .. a couple of benefits .. the parsing can be done on client side, save network bandwidth by not sending unwanted data, server side handling for the parsed input as JSON is way better..etc
Lets call this JSON fbJSON.. because apart from being a JSON .. it adheres to a certain format.. You can create a spec for your format.. such that the JSON that is sent over to search engine necessarily contains some information.. this can make life a bit easier .. just like we have geoJSON etc..
Use an NLP program to parse the user input into fbJSON [I still have to think about this]
This is a broad approach upon which i m embarking upon.. the only bottleneck is point #4..because I do not have much experience with NLPs..
I have googled (well, DuckDuckGo'ed, actually) till I'm blue in the face, but cannot find a list of language codes of the type en-GB or fr-CA anywhere.
There are excellent resources about the components, in particular the W3C I18n page, but I was hoping for a simple alphabetical listing, fairly canonical if possible (something like this one). Cannot find.
Can anyone point me in the right direction? Many thanks!
There are several language code systems and several region code systems, as well as their combinations. As you refer to a W3C page, I presume that you are referring to the system defined in BCP 47. That system is orthogonal in the sense that codes like en-GB and fr-CA simply combine a language code and a region code. This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).
So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.
The specifications that you have found define the general principles and also the authoritative sources on different “subtags” (like primary language code and region code). For the most important parts, the official registration authority maintains the three- and two-letter ISO 639 codes for languages, and the ISO site contains the two-letter ISO 3166 codes for regions. The lists are quite readable, and I see no reason to consider using other than these primary resources, especially regarding possible changes.
There are 2 components in play here :
The language tag which is generally defined by ISO 639-1 alpha-2
The region tag which is generally defined by ISO 3166-1 alpha-2
You can mix and match languages and regions in whichever combination makes sense to you so there is no list of all possibilities.
BTW, you're effectively using a BCP47 tag, which defines the standards for each locale segment.
Unicode maintains such a list :
http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/index.html
Even better, you can have it in an XML format (ideal to parse the list) and with also the usual writing systems used by each language :
http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
(look in /LanguageData)
One solution would be to parse this list, it would give you all of the keys needed to create the list you are looking for.
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
I think you can take it from here http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html
This can be found at Unicode's Common Locale Data Repository. Specifically, a JSON file of this information is available in their cldr-json repo
We have a working list that we work off of for language code/language name referencing for Localizejs. Hope that helps
List of Language Codes in YAML or JSON?
List of primary language subtags, with common region subtags for each language (based on population of language speakers in each region):
https://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html
For example, for English:
en-US (320,000,000)
en-IN (250,000,000)
en-NG (110,000,000)
en-PK (100,000,000)
en-PH (68,000,000)
en-GB (64,000,000)
(Jukka K. Korpela and tigrish give good explanations for why any combination of language + region code is valid, but it might be helpful to have a list of codes most likely to be in actual use. s-f's link has such useful information sorted by region, so it might also be helpful to have this information sorted by language.)
Let's say I have a title string, written in different languages.
Is there way to check which language is each string?
I have not played with it but you should look at NSLinguisticTagger and its - (NSOrthography *)orthographyAtIndex:(NSUInteger)charIndex effectiveRange:(NSRangePointer)effectiveRange method. From the NSOrthography docs:
The NSOrthography class describes the linguistic content of a piece of
text, typically used for the purposes of spelling and grammar
checking.
An NSOrthography instance describes:
Which scripts the text contains. A dominant language and possibly
other languages for each of these scripts. A dominant script and
language for the text as a whole. Scripts are uniformly described by
standard four-letter tags (Latn, Grek, Cyrl, etc.) with the supertags
Jpan and Kore typically used for Japanese and Korean text, Hans and
Hant for Chinese text; the tag Zyyy is used if a specific script
cannot be identified. See Internationalization Programming Topics for
more information on internationalization.
Languages are uniformly described by BCP-47 tags , preferably in
canonical form; the tag und is used if a specific language cannot be
determined.
You can simply use the Google Transalate REST API to find the language.
And you can use something like RestKit to make the REST requests to the google servers.
You could use N-gram sampling frequencies techniques. I am not an expert, but they are rumored to work well in practice.
See netspeak and papers like this etc etc.
There's Google translation API available that allows language conversation. I am sure there must be something you will find that returns matched language for your string. See Google Translate APIs for details.
I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have
not settled, because we have not
anticipated, the question which will
be raised by the unenvisaged case when
it occurs; whether some degree of
peace in the park is to be sacrificed
to, or defended against, those
children whose pleasure or interest it
is to use these things. When the
unenvisaged case does arise, we
confront the issues at stake and can
then settle the question by choosing
between the competing interests in the
way which best satisfies us. In
doing... Page 86
Appears in 15 books from 1968-2003
This would be a world fit for
"mechanical" jurisprudence. Plainly
this world is not our world; human
legislators can have no such knowledge
of all the possible combinations of
circumstances which the future may
bring. This inability to anticipate
brings with it a relative
indeterminacy of aim. When we are bold
enough to frame some general rule of
conduct (eg, a rule that no vehicle
may be taken into the park), the
language used in this context fixes
necessary conditions which anything
must satisfy... Page 86
Appears in 8 books from 1968-2000
more
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't
Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.
In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.
This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.
If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?
Google scholar certainly has the information about citing from paper to paper maybe from book to book too.