Check if NSString contains a common first name on iPhone - iphone

I am wondering what the best approach would be to check whether or not a common first name is contained within an NSString on an iPhone app. I've got a sorted flat text file of ~5500 common American first names delimited by new lines. The NSString I am searching within for a name is not very long, most likely the size of a normal sentence.
My original plan was to load the sorted list into memory and then iterate over every word in the NSString performing a binary search of the list to determine whether or not that word was a common name.
Am I better off trying to put this name list into CoreData or a SQLite table and performing a query with that? My understanding is I would not have to load the entire list into memory if I went that route.
I am guessing this situation is a common problem with word dictionaries for word games, so I'm just wondering what the best practice is for fast lookups. Thanks!

SQLite sounds ideal for this in terms of both speed of lookup and minimising memory usage. It would also make it potentially possible to update the first name list over the internet if so desired.
Using Core Data (which is in effect an elabourate wrapper around SQLite) would be overkill in this instance, especially as you don't require the ORM like capabilities.

An NSSet might be useful as well. Dave DeLong's answer for another question demonstrates that NSSets have constant look-up times, i.e. O(1).
Load your names into an NSMutableSet one by one. This will be the slowest part but will only need to be done once. If your file is a simple line-delimited file of names, it may be easier to use the standard C library for reading the file, since line-by-line input is not well-supported by Cocoa.
After that, simply use [nameSet containsObject:name] to check whether it is in the list.
A couple of drawbacks to this approach:
The name you want to test must be in the same case as the name in the set, that is “paul” and “Paul” are different strings. You can circumvent this by converting all names to lowercase before inserting them into the set, and then also converting the name you want to check into lowercase before checking it against the set.
It might be easier just to go with the already-accepted answer.

Related

How to create and search in a language dictionary

I'm working on an Xcode project.
I need to add a dictionary of words, like English dictionary for example. Then, when required, I need to find some words in it.
I have been thinking about creating an array from a file to store all the words, then sort it in some way I don't know and at last do a binary search or something like this to find the word I'm looking for.
Do you think is a good idea?
Do you have any clue to sort the array?
Is binary search a good idea?
The best way would be to use a database (e.g. SQL-based), or CoreData to store it. That is unless you require the words to be entirely in the memory, which would be the case if you often listed all of them. But even then it can be solved by lazy-loading.

Thoughts on data model

For my app I thought of two different data models, but I cannot see which one would be the best both in performance and filesize. In my app I have to store Recipes, which will consist of an array with ingredients, an array with instructions, an array with tips and some properties to select some recipes (e.g. a rating, type of dish).
I thought of two different models. The first would be to convert the arrays to NSData and store them all in the Core Data model. As the array's are localized that means that there will be multiple arrays of the same kind in there (e.g. instructionsEN, instructionsFR, instructionsNL). As it is not necessary to query the arrays I'm happy with the fact that I have to convert the arrays to NSData.
The other model would be a core data that only contains the properties to filter a recipe, and an identifier to a .plist file that is stored in the main bundle or the documents directory (as some of these files will be created by us, and some are created by the user). This .plist file will contain all the instructions, ingredients etc. Again, there are multiple arrays for the same kind for different localizations.
I hope you can help my with making my decision which of these options would be best in terms of performance and diskspace. I would also appreciate it if you could think of a different solution.
If you're going to Core Data, you should generally go all the way. In that case, you would have an NSManagedObject Ingredient. I would probably put a method on Ingredient like stringValueForLocale: that would take care of returning me the best value. This means that a given ingredient can be translated once and is reusable for all recipes.
You would then have a Component entity that would have an Ingredient, a quantity value and a unit. A Recipe would have a 1:M property components that would point to these. Component should likely have an englishDescription as well, which would return a printable value like "1/4c sugar" while frenchDescription might print "50g de sucre" (note the volume/mass conversion there; Component is probably where you'd manage this.)
Instructions are a bit different, since they are less likely to be reusable. I guess you might get lucky and "Beat eggs to hard peaks." might show up in several recipes, but unless you're going to actively look for those kinds of reuse, it's probably more trouble than it's worth. Instructions are also the natural place to address cultural differences. In France, eggs are often stored at room temperature. In America, they are always refrigerated. To correctly translate a French recipe to American English, you sometimes have to include an extra step like "bring eggs to room temperature." (But it depends on the recipe, since it doesn't always matter.) It generally makes sense to do this in the instructions rather than in the Ingredients.
I'd probably create an Instructions entity with stringValuesForLocale: (that would return an array of strings). Then you could do some profiling and decide whether to break this up into separate LocalizedInstructions entities so that you didn't have to fault all of the localization. The advantage of this design is that you can change you mind later about the internal database layout, and it doesn't impact higher levels. In either case, however, I'd probably store the actual instructions as an NSData encoding an NSArray. It's probably not worth the trouble and cost of creating a bunch of individual LocalizedInstruction entities.

Implement "Did you mean?" with Core Data

I'm working on an iOS app. I have a Core Data database with a lot of company names.
When the user insert a company name that does not exist, I would like to show "similar" company names. For example, if the user entered "Aple", I would like to show "Did you mean Apple?".
I know that the technique of finding strings that match a pattern approximately (rather than exactly) is called approximate string matching or, colloquially, fuzzy string searching.
In theory, there are many algorithms, more or less valid: the Levenshtein distance computing algorithm and so on.
But in practice, is there someone who has already implemented something similar that can be used easily with core data?
I found a solution. Use this NSString's category available on GitHub: NSString-DamerauLevenshtein.
Try looking at Soundex, I believe that is part of the core featureset for SQLite, if that is your underlying data store.

Fastest method of checking if multiple different strings are a substring of a 2nd string

Context:
I'm creating a program which will sort and rename my media files which are named e.g. The.Office.s04e03.DIVX.WaREZKiNG.avi into an organized folder structure, which will consist of a list of folders for each TV Series, each folder will have a list of folders for the seasons, and those folders will contain the media files.
The problem:
I am unsure as to what the best method for reading a file name and determining what part of that name is the TV Show. For e.g. In "The.Office.s04e03.DIVX.WaREZKiNG.avi", The Office is the name of the series. I decided to have a list of all TV Shows and to check if each TV Show is a substring in the file name, but as far as I know this means I have to check every single series against the name for every file.
My question: How should I determine if a string contains one of many other strings?
Thanks
The Aho-Corsasick algorithm[1] efficiently solves the "does this possibly long string exactly contain any of these many short strings" problem.
However, I suspect this isn't really the problem you want to solve. It seems to me that you want something to extract the likely components from a string that is in one of possibly many different formats. I suspect that having a few different regexps for likely providers, video formats, season/episode markers, perhaps a database of show names, etc, is really what you want. Then you can independently run these different 'information extractors' on your filenames to pull out their structure.
[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
It depends on the overall structure of the filenames in general, for instance is the series name always first? If so a tree structure work well. Is there a standard marking between words (period in your example) if so you can split the string on those and create a case-insensitive hashtable of interesting words to boost performance.
However extracting seasons and episodes becomes more difficult, a simple solution would be to implement an algorithm to handle each format you uncover, although by using hints you could create an interesting parser if you wanted too. (Likely overkill however)

Optimizing a Cocoa/Objective-C search

I'm performing a search of a large plist file which contains dictionaries, tens of thousands of them, each with 2 key/string pairs. My search algorithms goes through the dictionaries, and when it finds a text match in either of the strings in the dictionary, the contents of the dictionary are inserted. Here is how it works:
NSDictionary *eachEntry;
NSArray *rawGlossaryArray = [[NSArray alloc] initWithContentsOfFile:thePath]; // this contains the contents of the plist
for (eachEntry in rawGlossaryArray)
{
GlossaryEntry *anEntry = [[GlossaryEntry alloc] initWithDictionary:eachEntry];
NSRange titleResultsRange = [anEntry.title rangeOfString:filterString options:NSCaseInsensitiveSearch];
NSRange defResultsRange = [anEntry.definition rangeOfString:filterString options:NSCaseInsensitiveSearch];
if (titleResultsRange.length > 0 || defResultsRange.length > 0) {
// store that item in the glossary dictionary with the name as the key
[glossaryDictionary setObject:anEntry forKey:anEntry.title];
}
[anEntry release];
}
Each time a search is performed, there is a delay of around 3-4 seconds in my iPhone app (on the device at least; everything runs pretty quickly in the simulator). Can anyone advise on how I might optimize this search?
Without looking at the data set I can't be sure, but if you profile it you are spending the vast percentage of your time in -rangeOfString:options:. If that is the case you will not be able to improve performance without fundamentally changing the data structure you are using to store your data.
You might want to construct some sort trie with strings and substrings pointing to the objects. It is much more complicated thing to setup, and insertions into it will be more expensive, but lookup would be very fast. Given that you are serializing out the structure anyway expensive inserts should not be much of an issue.
That just cries out for using a database, that you pre-populate and put into the application.
A few suggestions:
You're doing a lot of allocing and releasing in that loop. Could you create a single GlossaryEntry before the loop, then just reload it's contents inside the loop? This would avoid a bunch of alloc/releases.
Rather than loading the file each time, could you lazy load it once and keep it cached in memory (maybe in a singleton type object)? Generally this isn't a good idea on the iPhone, but you could have some code in your "didReceiveMemoryWarning" handler that would free the cache if it became an issue.
You should run your application is Instruments, and see what the bottleneck really is. Performance optimizations in the blind are really difficult, and we have tools to make them clear, and the tools are good too!
There's also the possibility that this isn't optimizable. I'm not sure if it's actually hanging the UI in your app or just taking a long time. If it's blocking the UI you need to get out of the main thread to do this work. Same with any significant work to keep an app responsive.
try the following, and see if you get any improvement:
1) use
- (NSRange)rangeOfString:(NSString *)aString options:(NSStringCompareOptions)mask
and as mask, pass the value NSLiteralSearch. This may speedup search considerably as described in the Apple documentation (String Programming Guide for Cocoa):
NSLiteralSearch Performs a byte-for-byte comparison. Differing literal sequences (such as composed character sequences) that would otherwise be considered equivalent are considered not to match. Using this option can speed some operations dramatically.
2) From the documentation (String Programming Guide for Cocoa):
If you simply want to determine whether a string contains a given pattern, you can use a predicate:
BOOL match = [myPredicate evaluateWithObject:myString];
For more about predicates, see Predicate Programming Guide.
You're probably getting the best performance you're likely to get, given your current data structures. You need to change how you're accessing the data, in order to get better performance.
Suggestions, in no particular order:
Don't create your GlossaryEntry objects in a loop while you're filtering them. Rather than storing the data in a Property List, just archive your array of GlossaryEntry objects. See the NSCoding documentation.
Rather than searching through tens of thousands of strings at every keystroke, generate an index of common substrings (maybe 2 or 3 letters), and create an NSDictionary that maps from that common substring to the set of results to use as an index. You can create the index at build time, rather than at run-time. If you can slice up your data set into several smaller pieces, the linear search for matching strings will be considerably faster.
Store your data in an SQLite database, and use SQL to query it - probably overkill for just this problem, but allows for more sophisticated searches in the future, if you'll need them.
If creating a simple index doesn't work well enough, you'll need to create a search tree style data structure.
You should profile it in instruments to find where the bottleneck actually is. If I had to guess, I would say the bottleneck would be [[NSArray alloc] initWithContentsOfFile:thePath].
Having said that, you'd probably get the best performance by storing the data in an sqlite database (which you would search with SQL) instead of using a plist.