Elasticsearch - is there a method to match using "almost ident" - mongodb

I use Facebook and Google maps to get a full Geo Entities data values (country, city, street, zip...).
I store these values on my mongoDB,
I noticed that some locations are deffer in the way they were written on Face and on Google, for (an unreal) example Face wrote the name of 'Hawaii' with an 'e' - Haweii.
I use match_all fields (country + city + street...) to search for entities at the same location but since some are written a bit different i will not find them.
Is there a way make elasticsearch search for 'Hawaii' and any other option that sounds like Hawaii but written a bit different?
Thanks for any help!
Using Google API one can get a location's
full details

To match words that sound similar you can use the phonetic analyzer. You can also give fuzzy query a try to match words with spelling mistakes. None of them are fool proof though and may result in false positives. Guess you'll have to experiment a little to come up with a solution that best fits your need.

If you have a known set of differences between Facebook and Google maps, you could look at using Synonyms at either index time or query time to accommodate differences in the APIs; There are merits to taking either approach.

Related

Does Algolia have a search with recommendation?

I was wondering if the Algolia service provides some kind of recommendations mechanism when doing a search.
I could not find in the API documentation anything related with providing the client with more refined and intelligent search alternatives based on the index data.
The scenario I am trying to describe is the following (this example is a bit over-the-top):
Given a user is searching for "red car" then the system provides more specific search alternatives and possibly related items that exist in the database (e.g. ferrari, red driving gloves, fast and furious soundtrack :-) )
Update
For future reference, I ended up doing a basic recommendation system using the text search capabilities of Algolia.
Summarising, when caris saved it's attributes color, speed, engine, etc, are used to create synonyms indexes, for example for engine ferrari in Engine index:
{
synonyms: ['red', 'ferrari', 'fast'],
value: 'ferrari'
}
Finally, the each index, must indicate the synonyms attribute for search and value as the returned result of a search.
Algolia does not provide that kind of "intelligence" out of the box.
Something you can do to approximate what you're looking for is using synonyms and a combination of other parameters:
Define synonyms groups such as "car,Ferrari,driving gloves", "red,dark red,tangerine,orange", ...
when sending a search query, set optionalWords to the list of words contained in that query. This will make each word of your query optional.
Also set removeStopWords to true so that words like "the", "a" (...) are ignored, to improve relevance.
With a well defined list of synonyms, this will make your original query interpreted as many other possibilities and thus increase the variety of possible results.
Be aware though that it could also impact the relevance of your results, as users might for instance not want to look for gloves when they search for a car!

How to search for multiple tags around one location?

I'm trying to figure out what's the best solution to find all nodes of certain types around a given GPS-Location.
Let's say I want to get all cafes, pubs, restaurant and parks around a given point X.xx,Y.yy.
[out:json];(node[amenity][leisure](around:500,52.2740711,10.5222147););out;
This returns nothing because I think it searches for nodes that are both, amenity and leisure which is not possible.
[out:json];(node[amenity or leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity,leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity;leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity|leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity]|[leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity],[leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity];[leisure](around:500,52.2740711,10.5222147););out;
These solutions result in an error (400: Bad Request)
The only working solution I found is the following one which results in really long queries
[out:json];(node[amenity=cafe](around:500,52.2740711,10.5222147);node[leisure=park](around:500,52.2740711,10.5222147);node[amenity=pub](around:500,52.2740711,10.5222147);node[amenity=restaurant](around:500,52.2740711,10.5222147););out;
Isn't there an easier solution without multiple "around" statements?
EDIT:
Found This on which is a little bit shorter. But still multiple "around" statements.
[out:json];(node["leisure"~"park"](around:400,52.2784715,10.5249662);node["ameni‌​ty"~"cafe|pub|restaurant"](around:400,52.2784715,10.5249662););out;
What you're probably looking for is regular expression support for keys (not only values).
Here's an example based on your query above:
[out:json];
node[~"^(amenity|leisure)$"~"."](around:500,52.2740711,10.5222147);
out;
NB: Since version 0.7.54 (released in Q1/2017) Overpass API also supports filter criteria with 'or' conditions. See this example on how to use this new (if: ) filter.

Which features should be added for NER in search result snippets

I want to cluster queries by help of the snippets of the search engine results they are currently returning. While using the noun phrases in the snippet worked well for Google results I felt that I should try a different approach for bing snippets and hence was going for Named Entity Extraction.
I have identified the following entities that can be extracted as of now using standard tools:
Person Names
Organisation Names
Locations
But I think I should be extracting more entities. Could anyone help me out here to identify more entities that may be useful?
This is an endless list, once you get to real data problems.
For example, dates are a common thing to extract. But for example booking codes such as airline tickets, or tracking codes such as parcels are something Google Mail already recognizes and extracts.
I don't think this is a very good question for a Q/A site. Plus, you may want to read more literature, and see what kind of data you can get - it clearly is data-driven what entities you want to extract. When analyzing log files, you might be interested in extracting host names, IPs, usernames and daemon/serivce names, for example.

Address Unification

I'm creating a business directory where I need to display results based on area and keywords. The problem is the scope might be across countries that have fairly irregular address structures. I currently have the following as form fields (and their respective database fields)
Fields (All required):
- Address 1
- Address 2
- Area <------key search criteria
- Keywords <------key search criteria
The problem is I'm not sure how reliable this setup is. I would have to rely on the data entry when searching to be relevant enough for it to work, and that goes against validating everything before inserting to the database. Is there a standard way of looking up areas across countries? And if so, how?
I decided to solve this by running (and verify) addresses via batch geocoding, which converts the addresses to 'geocodes' one can use with mapping plugins (there seems to be a lot of solutions in this regard. Google "batch geocode addresses"), although you may have to research further for accuracy. Though I initially started with OpenLayers for mapping I found leaflet faster to understand and deploy (with emphasis on mobile), Though I am talking from my own experience of learning and being able to implement in time.

Implement "Did you mean?" with Core Data

I'm working on an iOS app. I have a Core Data database with a lot of company names.
When the user insert a company name that does not exist, I would like to show "similar" company names. For example, if the user entered "Aple", I would like to show "Did you mean Apple?".
I know that the technique of finding strings that match a pattern approximately (rather than exactly) is called approximate string matching or, colloquially, fuzzy string searching.
In theory, there are many algorithms, more or less valid: the Levenshtein distance computing algorithm and so on.
But in practice, is there someone who has already implemented something similar that can be used easily with core data?
I found a solution. Use this NSString's category available on GitHub: NSString-DamerauLevenshtein.
Try looking at Soundex, I believe that is part of the core featureset for SQLite, if that is your underlying data store.