How could I filter the result via wordnet synonyms /remove the negative synonyms from result - lucene.net

Suppose I am trying to build an app which returns synonyms via Wordnet based on Lucent.net
A customized Synonym Analyzer and WordNet Synonym Engine built completed and it does work.
A synonym dictionay that downloaded from http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz is currently I am using,
but unfortunately, the result is not what I have been expacting sometime. due to some synonyms are about negative and might
cause horrible experience (see 'black' i.e),so here's my problem need to solve:
How could I filter/remove the negative synonyms from result.
Thanks in advance. Any help is greatful ...

Related

ECL - showing differences results for debugging

i'm using Epsilon Comparison Language for the first time. I am writing a code in order to compare two models, in particular i want to show some information on the default output stream console when the code finds differences between the models. I want to visualize, for example, the name of the involved rule and the differences between the fields under investigation. When the comparison ends without differences i can visualize, for example, all that i need using the matchInfo variable in a "do" block. How can i solve the problem when the code find some differences? Thanks.
ECL does not provide any built-in differencing or difference visualisation capabilities. If you need such capabilities my suggestion would be to consider using EMFCompare.

Elasticsearch - is there a method to match using "almost ident"

I use Facebook and Google maps to get a full Geo Entities data values (country, city, street, zip...).
I store these values on my mongoDB,
I noticed that some locations are deffer in the way they were written on Face and on Google, for (an unreal) example Face wrote the name of 'Hawaii' with an 'e' - Haweii.
I use match_all fields (country + city + street...) to search for entities at the same location but since some are written a bit different i will not find them.
Is there a way make elasticsearch search for 'Hawaii' and any other option that sounds like Hawaii but written a bit different?
Thanks for any help!
Using Google API one can get a location's
full details
To match words that sound similar you can use the phonetic analyzer. You can also give fuzzy query a try to match words with spelling mistakes. None of them are fool proof though and may result in false positives. Guess you'll have to experiment a little to come up with a solution that best fits your need.
If you have a known set of differences between Facebook and Google maps, you could look at using Synonyms at either index time or query time to accommodate differences in the APIs; There are merits to taking either approach.

How to search for multiple tags around one location?

I'm trying to figure out what's the best solution to find all nodes of certain types around a given GPS-Location.
Let's say I want to get all cafes, pubs, restaurant and parks around a given point X.xx,Y.yy.
[out:json];(node[amenity][leisure](around:500,52.2740711,10.5222147););out;
This returns nothing because I think it searches for nodes that are both, amenity and leisure which is not possible.
[out:json];(node[amenity or leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity,leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity;leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity|leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity]|[leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity],[leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity];[leisure](around:500,52.2740711,10.5222147););out;
These solutions result in an error (400: Bad Request)
The only working solution I found is the following one which results in really long queries
[out:json];(node[amenity=cafe](around:500,52.2740711,10.5222147);node[leisure=park](around:500,52.2740711,10.5222147);node[amenity=pub](around:500,52.2740711,10.5222147);node[amenity=restaurant](around:500,52.2740711,10.5222147););out;
Isn't there an easier solution without multiple "around" statements?
EDIT:
Found This on which is a little bit shorter. But still multiple "around" statements.
[out:json];(node["leisure"~"park"](around:400,52.2784715,10.5249662);node["ameni‌​ty"~"cafe|pub|restaurant"](around:400,52.2784715,10.5249662););out;
What you're probably looking for is regular expression support for keys (not only values).
Here's an example based on your query above:
[out:json];
node[~"^(amenity|leisure)$"~"."](around:500,52.2740711,10.5222147);
out;
NB: Since version 0.7.54 (released in Q1/2017) Overpass API also supports filter criteria with 'or' conditions. See this example on how to use this new (if: ) filter.

Avoiding "text-search query contains only stop words or doesn't contain lexemes, ignored" in postgresql logs

I'm using the texticle (https://github.com/tenderlove/texticle) library to do full-text postgresql searches.
The library generates sql like the following:
to_tsvector('spanish', "games"."system"::text) ## plainto_tsquery('spanish', 'Genesis'::text)
If someone does a search for '&', then I get a warning in my logs:
text-search query contains only stop words or doesn't contain lexemes, ignored
How can I avoid this? Should I have the application know about the various stopwords and not send postgresql the query if the search term is comprised only of stopwords? Or can I tell postgresql to somehow ignore this warning?
the warning is perfectly descriptive. Basically you have to handle this somewhere in programming logic. If your searches are in a stored procedure you can test the value and see if it contains lexemes.
Otherwise you can set the logging level to ERROR so warnings do not get set, or you can handle it on your application level.
None of these are perfect. If I search for "the" I should get a similar warning so you probably want to be able to check after it is converted to a tsquery (i.e. in the database itself.
Otherwise, my advice is ignore the warning.
This function could be used to resolve if query contains only stop words:
numnode(plainto_tsquery('spanish','&')) = 0 )
personally i always sanitize user input. It should be considered evil for the most part. so in this case i would definitely do some filtering at the application level.

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.