Implement "Did you mean?" with Core Data - iphone

I'm working on an iOS app. I have a Core Data database with a lot of company names.
When the user insert a company name that does not exist, I would like to show "similar" company names. For example, if the user entered "Aple", I would like to show "Did you mean Apple?".
I know that the technique of finding strings that match a pattern approximately (rather than exactly) is called approximate string matching or, colloquially, fuzzy string searching.
In theory, there are many algorithms, more or less valid: the Levenshtein distance computing algorithm and so on.
But in practice, is there someone who has already implemented something similar that can be used easily with core data?

I found a solution. Use this NSString's category available on GitHub: NSString-DamerauLevenshtein.

Try looking at Soundex, I believe that is part of the core featureset for SQLite, if that is your underlying data store.

Related

Elasticsearch - is there a method to match using "almost ident"

I use Facebook and Google maps to get a full Geo Entities data values (country, city, street, zip...).
I store these values on my mongoDB,
I noticed that some locations are deffer in the way they were written on Face and on Google, for (an unreal) example Face wrote the name of 'Hawaii' with an 'e' - Haweii.
I use match_all fields (country + city + street...) to search for entities at the same location but since some are written a bit different i will not find them.
Is there a way make elasticsearch search for 'Hawaii' and any other option that sounds like Hawaii but written a bit different?
Thanks for any help!
Using Google API one can get a location's
full details
To match words that sound similar you can use the phonetic analyzer. You can also give fuzzy query a try to match words with spelling mistakes. None of them are fool proof though and may result in false positives. Guess you'll have to experiment a little to come up with a solution that best fits your need.
If you have a known set of differences between Facebook and Google maps, you could look at using Synonyms at either index time or query time to accommodate differences in the APIs; There are merits to taking either approach.

Fuzzy string matching: which tool?

I have a large number of strings containing a product name and a few other properties (size, volume, age, etc). But the strings are not standardized at all. Product names might be misspelled, volume might be in a different notation (0.5l, 1/2 liter, 500ml, etc). The number of variations is limited though, there are for instance only a few hundred products. What tools can I use to analyze each string and tell me if it contains certain tokens? My guess is that some sort of learning mechanism would be useful, but I'm not sure which tools would offer just that. I've looked at ElasticSearch, but I'm not sure if that's the way to go. All my data is currently in a PostgreSQL db and I've looked at pg_grm as well. Again, not sure if that fits my need.
One solution I've been thinking about is maintaining a list of proper keywords and, per string, see if the string contains any of the keywords. I'm not sure if this would work and, if it would, how to efficiently and effectively implement it in postgresql
EDIT
Here are a few example lines I'm trying to extract keywords from:
wine Bardolo red 1L 12b 12%
La Tulipe, 13* box 3 bottles, 2005
Great Johnny Walker 7CL 22% red label
Wisky Jonny Walken .7 Red limited editon
I've done quite some searching by now but have yet to find a proper way to solve this problem.
I've used pg_trgm extension for similar task (I was comparing misspelled address lines and company names) along with clustering algorithm (may be not needed in your case).
It's done it's job with some data preparations (regexp replacements).
May be not very easy but I'm sure it's possible to solve your problem too. And index support in pg_trgm is great.

Database design decision: normalization or repetition?

I'm building a delivery system, by now, my design looks like that:
The problem is, very frequently, I'll need a structure (array, json, objects...) that looks like that (very hierarchical):
The problem with this, is that it creates a lot of repetition of StreetAddress, DeliveryPoint and Customer, since each Itinerary would create lots of them and itineraries looks very much like others.
The good part is that everything would be pretty with just a few joins.
With the first schema, it would be very weird to create the second structure, but its possible.
Any ideas on how to control the repetition and still get an easy to query schema for the above structure?
I'm using:
PostgreSQL 9.1
PHP 5.5
Symfony Framework Standard Edition 2.4.0-BETA1 (With Doctrine)
[In case anyone wants to know how did I draw the schemas: www.gliffy.com]
Repetition and normalization are not always opposing questions.
Here's the basic problem:
Normalization doesn't care about repetition per se but about functional dependency
Repetition is the wrong question. Functional dependency is the right question. In some of your cases, addresses are remarkably hard to determine functional dependencies regarding because there are so many conventions out there, and even if you did, you'd still run into formatting issues.
A simple way to get to the bottom of this is asking about reasons why a given piece of data may change. Good, normalized design limits the reasons why a given piece of data may need to change. Now, with that in mind, it looks like you need to store historical locations for customers and it looks to me like you might want to do something slightly different.
Instead of:
Delivery -> customer -> street address -> itinerary
It looks to me like it would make more sense to:
Customer -> street address
And
delivery -> itinerary -> street address
In this model you may have duplicate information and you may need to have dates in the street address indicating when it is valid to and from, but that doesn't strike me as a normalization problem especially given the normalization problems that addresses already pose. But from there you can easily track the customer the delivery was made to, while in your model it isn't clear you can track the street address or itinerary of a given delivery.

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf

Check if NSString contains a common first name on iPhone

I am wondering what the best approach would be to check whether or not a common first name is contained within an NSString on an iPhone app. I've got a sorted flat text file of ~5500 common American first names delimited by new lines. The NSString I am searching within for a name is not very long, most likely the size of a normal sentence.
My original plan was to load the sorted list into memory and then iterate over every word in the NSString performing a binary search of the list to determine whether or not that word was a common name.
Am I better off trying to put this name list into CoreData or a SQLite table and performing a query with that? My understanding is I would not have to load the entire list into memory if I went that route.
I am guessing this situation is a common problem with word dictionaries for word games, so I'm just wondering what the best practice is for fast lookups. Thanks!
SQLite sounds ideal for this in terms of both speed of lookup and minimising memory usage. It would also make it potentially possible to update the first name list over the internet if so desired.
Using Core Data (which is in effect an elabourate wrapper around SQLite) would be overkill in this instance, especially as you don't require the ORM like capabilities.
An NSSet might be useful as well. Dave DeLong's answer for another question demonstrates that NSSets have constant look-up times, i.e. O(1).
Load your names into an NSMutableSet one by one. This will be the slowest part but will only need to be done once. If your file is a simple line-delimited file of names, it may be easier to use the standard C library for reading the file, since line-by-line input is not well-supported by Cocoa.
After that, simply use [nameSet containsObject:name] to check whether it is in the list.
A couple of drawbacks to this approach:
The name you want to test must be in the same case as the name in the set, that is “paul” and “Paul” are different strings. You can circumvent this by converting all names to lowercase before inserting them into the set, and then also converting the name you want to check into lowercase before checking it against the set.
It might be easier just to go with the already-accepted answer.