What type of NLP method to choose? - facebook

so I'm going to build a prototype of a Social Web application that:
- Incorporates Facebook data of users (working hours, house and work office)
to create a web app so that friends and friends of friends that have similar routes can drive/bike with each other.
However, in order for this app to be useful it should be able to extract keywords (e.g. working hours, or if someone has to work later (and he/she posts this on Facebook). Now I'm reading a lot of methods but I don't know which one to choose:
- Sentiment analysis
- Lexical analysis
- Syntactic parsing
Thanks in advance.

Ultimately what you want is a human-like intelligence that can read between the lines of all the posts to extract information. So, in general terms, you have the same Too Hard (currently) problem that everyone else in every branch of NLP faces. I'm just pointing that out because then you realize your question becomes which imperfect approximation should I use.
Personally, I'd start with a simple text matcher. Look for strings like "Starting work today at 9". Gather your core list of sentences.
Then you realize there are variations due to rephrasing. "Start work today at 9", "Starting today at 9", "9 is my start time today", etc. Bring in a sentence analyzer at this point, and instead of a string of ascii codes the sentence turns into a string of nouns, adjectives and verbs.
You also have synonyms: "Starting my job today at 9", "Starting down the office today at 9", "Starting work today an hour later than normal". WordNet (and semantic networks generally) can help a bit. The last example there, though, not only requires parsing a fairly complicated clause, but also knowing their usual start time is 8. (Oh, on all the above you needed to know if they meant 9am or 9pm...)
By this point you realize you are gathering lots of fuzzy data. That is when you bring in some machine learning, to try have it take care of discovering for you that one combination of the verb "start", the noun "work", the time-noun "today" and the number "9" is useful to you, and another isn't (e.g. "At work today learnt that new drama starts at 9. Forgot to set recorder. Aaarrggh!")

I think what you are looking for is a customized Name Entity Recognizer. NLTK could be a good starting point. However, the default NE chunker in NLTK is a maximum entropy chunker trained on the ACE corpus and has not been trained to recognize dates and times. So you need to train your own classifier if you want to do that.
The link below gives a neat and detailed explanation for the same.
http://mattshomepage.com/articles/2016/May/23/nltk_nec/
Also, there is a module called timex in nltk_contrib which might help you with your needs.
https://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py
Cheers!

Related

org-mode - Using tasks to automatically generate a weekly dinner menu

I have lots of trouble getting my meals done efficiently and I came up with an idea to try to make it better with Emacs Org-Mode.
I would like to have a task every Friday that repeated itself with .+1w (every seven days after you finish the task) that built my shopping list and looked like the following:
*TODO Shopping list
SCHEDULED: <2020-01-05 Fri .+1w>
12 eggs (Spanish omelet)
1 olive oil (spanish omelet)
Pasta (Spaghetti bolognese)
...
Of course, we need to have some file called Recipes.org that contained the list of ingredients for our daily meals and that looked like the following:
*Spaghetti bolognese
**Pasta
**Tomato sauce
*Spanish omelet
**Eggs
...
The content for the shopping list description must be generated automatically every week with some script that randomly picks seven recipes for these week dinners and concatenates all needed ingredients into our shopping list.
Is this already implemented in org-mode? Does anyone know how it may be implemented if it is not?
Thank you very much
While there is no specific 'random-recipe shopping-list generator' built in (that I know of), org-mode provides many different ways to organise this kind of information.
In general implementing your idea with a bit of elisp shouldn't be too difficult - once you have lists of ingredients for each recipe, you could sort and dedupe before turning them into checkbox lists.
You'd then write something up for the random weekly selections. In the end most of the work will be in compiling the recipes and ingredients into org format.
A quick search shows there is at least one existing package for creating shopping lists from recipes: shopping-lisp.el. Again, you'd need to translate a bunch of recipes into s-expressions and add the random choice element in elisp.
Here's a fun post from Sacha Chua on using org-mode and org-tables for tracking recipes and ingredients. Her blog is a great resource for emacs ideas.
Finally, let me just say that I love emacs, and use org-mode daily for checklists, calendars and generally organising my life.
Sometimes though, despite the temptation to do everything in org-mode, we have to concede that there may be better, purpose built tools out there for such things that might save us a lot of time and effort... for example the myriad of cooking websites and apps that come ready with recipes, shopping list functions, standardised measures, etc.
Then again, sometimes it's just more fun to build your own.

Assigning weights to intents

I'm just getting started with Watson Conversation and I ran into the following issue.
I have a general intent for #greetings (hello, hi, greetings....) and another intent for #general-issues (I have a problem, I have an issue, stuff's not working, etc.....). If the user says: "hello, I have a problem with your product.", it matches the #greetings intent and replies back, but in this case I want to match the #general-issue intent.
As per: https://console.bluemix.net/docs/services/conversation/dialog-overview.html#dialog-overview
I would expect the nodes at the top of the list to be matched first, so I have the #greetings node at the bottom of the dialogue tree, to give a chance for a "higher weight" node to be matched first, but it doesn't seem to work every time.
Is duplicating the #greeting intents in #general-issue the only solution here?
So, trying to help you based on my experience, you can use the intents[0].confidence as your favor.
For example:
In my example, I create one condition with:
intents[0].confidence > 0.75
So, Watson will recognizes this intent just if the user types something very similar to their trained examples for the #greetings Intent.
As you can see, works very well:
See more about building a Complex dialog using Watson Conversation.
See more about Confidence in Conversation here.
So here are two other approaches you can take.
Treat off topic as a contamination.
When building a conversational system, it's important to know what your end users are actually saying. So collect questions from real users.
You will find that not many people will say a greeting and a question. I personally haven't done the statistical chance over projects I've done, but at least anecdotal I have not seen it happen often.
Knowing this, you can try removing off topic / chit-chat from your intents. As it does not fully reflect the domains you want to train on.
To counter this, you can create a more detailed second workspace with off topic/chit-chat. If you do not get a good hit on the primary workspace, you can call out to the second one. You can improve this by adding chit-chat to counter examples in the primary workspace.
You can also mitigate this by simply wording your initial response to the user. For example, if your initial response is a hello, have the system also ask a question. Or have it progress the conversation where a hello becomes redundant.
Detect possible compounded intents.
At the moment, this is only easily possible at the application layer.
Setting alternate_intents to true will return the top 10 intents and their confidences.
Before going further, if the top intent < 0.2 then it needs more training (so no need to proceed).
If > 0.2 you can map these on a graph, you can visually see if the top two intents. For example:
To have your application see this, you can use the k-means algorithm to create two buckets (k=2). That would be relevant and irrelevant.
Once you see more then one that is relevant, you can take action to ignore the chit-chat/off-topic.
There is more details, and sample code here.

How to localize CPAN module and dependencies

I am trying to localize a CPAN module MooX::Options using Locale::TextDomain after having read "On the state of I18N in perl".
In the discussion in the pull request the question came up how to deal with messages not originating in the module itself, but in a dependency. In this specific case, when you specify an option on the command line which is not defined anywhere in the code, you'll get the warning:
Unknown option: xyz
originating in the module Getopt::Long, which in itself is not localized yet.
The question is how to deal with these. I see basically three strategies:
Ignore them, which I find dissatisfactory.
Try to someway or other catch all the corner cases and messages in the module I'm currently localizing (in this case MooX::Options), and this way working around the missing localization in the dependent modules. This option seems brittle, as I'd have to constantly adapt to changes in the base modules. Sometimes, it might by next to impossible to catch messages, as they're written to output streams directly by the modules (as is the case in this example).
Try to localize the dependent modules themselves. This option seems hard to achieve, as different projects might use different I18N tools and strategies themselves and the dependency graph might be huge.
All in all, I think this problem is more general and not that specific to perl and cpan modules. So, I'm interessted in your thoughts, strategies and approaches.
I have rather strong opionions on the idea of translating computing terms, and most people disagree with my views, so take what I am saying with a grain of salt.
I do not understand the point of internationalizing a library for parsing command line options unless you want to further ghettoize what is already a small group of users of said library.
Would wget be more useful to Turkish users if instead it was called wal or wgetir? Or, instead of wget --mirror, should Turkish users write getir --ayna? What about that w?
If you just translate the messages, what is the point of outputting a help message in response to wget -h when the Turkish equivalent would be wget -y?
The fact is, almost all attempts at translating programming related terms I have seen are simply awful. The people who are most eager to translate are usually not in command of either human language — Nor do they seem to understand what they are translating.
However, as a result of these eager people, I find that at least the Turkish translations of pretty much any software I touch is just awful. Whatever Danish translations I have seen did not fare much better, but, at least, they were tolerable owing to the greater commonality of structure between Danish and English.
I think everyone's energy is better spent on actually making sure their programs handle content, including names of external resources/references, in different languages well, rather than giving me error messages in some Frankenstein language, or letting me specify command line options whose mnemonics do not match their descriptions etc, or presenting menus that contain of strings of words that really do not convey any meaning.
I have felt this way for the last for many decades now ... Even when I was patching IBM PC keyboard drivers with hex editors so people at various places could type reports in WordStar, and create charts in Harvard Graphics.
So, my unpopular advice is to put your energy elsewhere ...
For example, use exception objects so the user of your library (who is likely a programmer and will understand "Directory not found" much more readily than "Kütük bulunamadı") can deduce in a human-language independent way what happened, and what message to show the user. I haven't looked closely at MooX::Options, but I notice there is at least one string croak.
Here is an actual error message from an IBM product:
Belirtilen kütük örüntüsüyle eşleşen hiçbir kütük bulunamadı
You can ask every one of the almost 200 million Turkic people on earth what a "kütük örüntüsü" is, and only the person who actually came up with this non-sensical string of characters will be able to tell you that it corresponds to "file pattern". What, then, do they gain by using the phrase "kütük örüntüsü" versus "file pattern"? Nothing.
However, they lose the ability to communicate with, and, also, compete with, programmers in the English speaking world.
PS: Apologies for all Turkish examples, but I feel most comfortable drawing abominable examples based on my native language.

Why we use reversed url identifier on Xcode?

Why do we use a reversed URL identifier like com.yourcompany.noname within Xcode?
Same as in Java - to uniquely identify ourselves. The assumption is that if you have a URL, no-one else would use the same string.
Now why it's reversed, it's guesswork, but I'd say the question is wrong: it's the hostnames that originally got it "wrong" starting with the most specific thing, and it perpetuated down the history. URL of the form http:com.yourcompany.noname/bigdir/littledir/file#fragment would make much more sense(*), where you start with the most global thing, and end up with the tiniest detail, just like when reading time, or arabic numerals.
(Most date formats also did this wrong - the only logically consistent format is YYYY/MM/DD, if we use numbers like we do, with the smallest unit on the right).
*) Also, the creator of the URL, if I remember correctly, is on the record for saying that his biggest regret is the two slashes. EDIT: found it
Let's think about it philosophically for a moment.
Consider the case of normal URLs, e.g. noname.yourcompany.com. The highest level domain for this URL is com, since it's included in a gigantic set of other URLs besides the one you're given. For instance apple.com and microsoft.com both belong to the com top level domain. Then, yourcompany is the next highest level domain, since it belongs to your company and not Apple or Microsoft, but may itself include subdomains of its own.
In this respect, we can see that when we follow what we call 'normal URLs' from top to bottom, we are actually reading right to left. In programming languages, when we're doing scope resolution, we want to read left to right, because that's the direction in which most of us write code, and we usually start from broad categories and narrow down when we're trying to find that one elusive function we might be looking for.
That's why, in a namespace scheme that's designed to be resemble Internet domains, we end up with names that look backwards. In a certain sense, it's the Web addresses that are "wrong".

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have
not settled, because we have not
anticipated, the question which will
be raised by the unenvisaged case when
it occurs; whether some degree of
peace in the park is to be sacrificed
to, or defended against, those
children whose pleasure or interest it
is to use these things. When the
unenvisaged case does arise, we
confront the issues at stake and can
then settle the question by choosing
between the competing interests in the
way which best satisfies us. In
doing...‎ Page 86
Appears in 15 books from 1968-2003
This would be a world fit for
"mechanical" jurisprudence. Plainly
this world is not our world; human
legislators can have no such knowledge
of all the possible combinations of
circumstances which the future may
bring. This inability to anticipate
brings with it a relative
indeterminacy of aim. When we are bold
enough to frame some general rule of
conduct (eg, a rule that no vehicle
may be taken into the park), the
language used in this context fixes
necessary conditions which anything
must satisfy...‎ Page 86
Appears in 8 books from 1968-2000
more
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't
Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.
In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.
This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.
If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?
Google scholar certainly has the information about citing from paper to paper maybe from book to book too.