What is the fastest way to search through strings in Objective-C? - iphone

I am implementing a sort of autocomplete for an iOS app. The data I am using for the autocomplete values is a comma-separated text file with about 100,000 strings. This is what I am doing now:
Read the text file, and create an NSArray with 100,000 NSString.
As the user types, do [array containsObject:text]
Surely there is a better/faster way to do this lookup. Any thoughts?

Absolutely, there is! It's not "in Objective-C" though: most likely, you would need to code it yourself.
The idea is to convert your list of string to a suffix tree, a data structure that lets you search by prefix very quickly. Searching for possible completions in a suffix tree are very fast, but the structure itself is not easy to build. A quick search on the internet revealed that there is no readily available implementation in Objective C, but you may be able to port an implementation in another language, use a C implementation, or even write your own if you are not particularly pressed for time.
Perhaps an easier approach would be to sort your strings alphabetically, and run a binary search on the prefix that has been entered so far. Though not as efficient as a suffix tree, the sorted array approach will be acceptable for 100K strings, because you get to the right spot in under seventeen checks.

The simplest is probably binary search. See -[NSArray indexOfObject:inSortedRange:options:usingComparator:].
In particular, I'd try something like this:
Pre-sort the array that you save to the file
When you load the array, possibly #selector(compare:) (if you are worried about it being accidentally unsorted or the Unicode sort order changing for some edge cases). This should be approximately O(n) assuming the array is mostly sorted already.
To find the first potential match, [array indexOfObject:searchString inSortedRange:(NSRange){0,[array count]} options:NSBinarySearchingInsertionIndex|NSBinarySearchingFirstEqual usingComparator:#selector(compare:)]
Walk down the array until the entries no longer contain searchString as a prefix. You probably want to do case/diacritic/width-insensitive comparisons to determine whether it is a prefix (NSAnchoredSearch|NSCaseInsensitiveSearch|NSDiacriticInsensitiveSearch|NSWidthInsensitiveSearch)
This may not "correctly" handle all locales (Turkish in particular), but neither will replacing compare: with localizedCompare:, nor will naïve string-folding. (It is only 9 lines long, but took about a day of work time to get right and has about 40 lines of code and 200 lines of test, so I probably shouldn't share it here.)

Related

Lisp - Extracting info from a list of comma separated values

I've tried searching this but have yet to find something that suits anything close to my needs. I'm trying to create a Autocad LISP that takes a text file, which is a list of comma-separated values, and place a block at coordinates defined by the list. BUT, only for items on the list where the last entry starts with "HP"
So that's sounds a bit complex, but the text file is basically a UTM survey output, and looks like this:
1000,Easting,Northing,Elevation,Identifier
1001,Easting,Northing,Elevation,Identifier
Etc.
The identifier is a variety of values, but I want to extract the Northing,Easting,Elevation, and insert a block (this last part I've got) at that location when the identifier begins with "HP". The list can be long and the number of HPs can be 1 or 5000. I'm assuming there's a "for x=1:end, do" type of loop than can be made that reuses the same variables over and over.
I'm a newbie to LISP so I'm stuck in that spot between "here are I've-never-programmed-before tutorials to make hello world" and "here is a library of the 3000 different commands in alphabetical order"
I believe the functions you are needing to solve this question are open, read-line or read-char, close,strlen, and substr. The first four functions relate to AutoLisp writing and reading a file. The last two functions manipulate the string variables that were pulled from the file. With them, you can find the "HP" within the text. To loop through the same code, three come to my mind: repeat, while, and foreach.
For a list of variables to quickly reference with their descriptions, here's a good starting point. This particular page has the information broken up by category instead of alphabetical order.
https://help.solidworks.com/2022/English/api/draftsightlispreference/html/lisp_functions_overview.htm
Here are a few tutorials where AutoLisp code is used to write and read other files:
https://www.afralisp.net/autolisp/tutorials/file-handling.php
https://www.afralisp.net/autolisp/tutorials/external-data.php
Lastly, here's an example of AutoLisp writing and reading attributes from and to blocks.
https://github.com/GitHubUser5376/AttributeImportExport
You can use Lee-Mac's Reacd-CSV function to get a list of the csv values.
And for the "HP" detection yes you might have to go through(using loop options mentioned above like while, repeat,foreach) each and use
(substr Identifier 1 2)
to validate

Is there a Regex to search for all arrays that are present in a project?

I am working on a large project in Xcode. I'm wanting to search, using the Find Navigator (See Below), for all arrays regardless of their name. I only care about any array that has this format, someArray[index].
Some Examples That Should Match
people[12]
section[0].rows[0]
Should Not Match
people[index]
section[section].row[row]
The regex should only return arrays, it should not return any dictionaries or other types that are not a subscripted array.
Why am I doing this? Well, it appears there have been some issues within our app where devs have not properly handled index out of bounds errors or nil values. There are far too many arrays for me to manually go through line by line to find them, so this is the best option I've come up with and it may not even be possible. If anyone has other recommendations, please feel free to share.
You can create a regex to match any word followed by another word with optional period enclosed by brackets. Something like:
\w+\[\w+(\.\w+)?\]
For more info about the regex above you can check this link
For numbers only use \d+ instead:
\w+\[\d+\]
For more info about the regex above you can check this link

Tokenize the mmaped string using strtok

I have mmaped a huge file containing 4k * 4k floats. Since it was an text file, I need to mmap it as char string and use. Now I need to parse floats and write into 2d array. If I tokenize it using strtok, it will not allow me to do since mmapped string is not modifiable. If I copy the string into std::string and then tokenize using getline function, it let me to do it but I feel I will lose the performance got from mmap. How do I optimally solve this problem ??
You can try some different solutions, but you will have to benchmark to find out which one is the best for you. It's not always clear that mmap()ing a file and processing the memory-mapped pages directly is the best solution. Especially if you make a single sequential pass through the file, a loop that read()s pieces at a time into a buffer can be faster, even if you use madvise() together with mmap(). Again, benchmark if you want to know what is fastest for you.
Some solutions you might try:
mmap() with MAP_WRITE and MAP_PRIVATE and then use your existing strtok() code. This will allow strtok() to write the NUL bytes it wants to write, without having those changes be reflected in the file. If you choose this solution, you should probably call madvise(MADV_DONTNEED) on the parts of the file you have already processed, else memory usage will grow linearly.
Implement your own variant of strtok() that returns the length of the matched token instead of a NUL-terminated string. It's not difficult, using memchr(). This way you don't need to modify the memory. You might then need to pass the resulting tokens to functions that take a string and a length instead of a NUL-terminated string. There aren't many such functions in the C library, but even so you might be able to get away with calling functions like strtod() if the tokens are guaranteed to end in some non-digit delimiter. Or you can copy them into a small stack-allocated buffer (they're floats, they can't be that long, right?)
Use a read()-and-process loop instead of mmap().

Perl module for text comparison

Can anyone suggest a Perl module which can compare two strings and return a degree to which they match? I searched CPAN extensively, and although there are similar modules like String::Approx and Data::Compare, they are not what I am looking for. Suppose I have two strings : I love you, and I boht you. I want functionality which will compare these two strings, taking into account numerous parameters, the matching of words in correct order (love as the first word in a string should not "match" love as the 4th word in the 2nd string, even though both strings have that word), words not matching but spelt almost similarly (like say love and loge), number of words, etc and return an index, say a number from 0 to 1 on a scale of 1, representing the degree of similarity between the two strings. Is there any such Perl module?
There are many such modules. Often, though, you'll have to make use of them in some special way to account for your own assumptions. Most of the string comparison tools like this just implement some algorithm for comparing one string to another. Most assume that if you have specific policy decisions to make, you'll code them yourself.
Personally, I am not sure I'd recommend Text::Levenshtein because of bugs and lack of ut8 support. I don't have a better recommendation either, though.
However, these searches will reveal lots of potential modules you could look into and determine what works best for your purpose (based on the names of common algorithms for doing this sort of thing):
https://metacpan.org/search?q=levenshtein
https://metacpan.org/search?q=wagner+fischer
https://metacpan.org/search?q=edit+distance
If you're interested in spoken similarities, you can also look into phonetic comparisons:
https://metacpan.org/search?q=phonetic
https://metacpan.org/search?q=soundex
https://metacpan.org/search?q=metaphone

Test Suite for Double Metaphone?

I've translated Double-Metaphone into ActionScript3 and I want to test it (obviously) before I release the source to ... um ... the open.
I'm looking for a long list of names with the primary and secondary codes. Google does not find anything except one list with pairs of names (presumably they should match).
Thanks
You could find someone else's double metaphone implementation, run it on the same long list of words, and compare the results to your own.
For long lists of words, I like infochimps. They have lots of word lists, like this one of 350,000 english words or this one of place names, and many more.
Here are implementations you can compare your results against. Here is an online example, except that it tests only one word at a time - I guess you'll have to download and run one of the scripts to test a large list of words.
For each word, two codes will be returned; you'll probably want to test that both codes returned match the ones returned of another implementation. You probably know that the reference implementation is here with full source code here, but including the links anyway for others' benefit.