Searching Array for similar spelled words/objects - iphone

I am looking for a solution similar to Google's "Did You Mean: Word"
I have an array of vehicles entered as
2011ChevroletMalibu
2011FordF150
2009FordProbe
etc...
In my app I have three Textfields.
Year Make Model .
When the user types in 2011 Chevrolet Malabu (notice that Malibu is misspelled) and hits search...
Id like to reply "Did you mean: 2011 Chevrolet Malibu".
Anyone have any suggestions on how to "search for similar"? Thanks!

AFAIK that is not any function the iOS SDK would solve for you - however, there are many algorithms available doing just that.
Look out for:
Phonetic Soundex String Comparison
Levenshtein Distance
Oliver 1993

Louie, I think it is not very simple. You need use some phonetic search. If your app use data provided by a webservice and you have a mssqlserver>=2000 behind this webservice, you can use SOUNDEX function in your searchs. But if you want, implement your own phonetic search, it is a big challenge.

Look at soundx "hashes" or if you have the CPU for it the levenshtein distance. Soundx will be cheaper to compute, the levenshtein distance will give better results.

Related

Scanning invoices using OCR in swift

I am currently working on scanning invoices with OCR scanning. All invoices use the "OCRB" font, and have the same formatting.
The bottom of a sample invoice looks like this
This is what the user needs to scan.
I have tried many different libraries to detect what I want. But most libraries doesn't give me the correct result. The best result came from Firebase ML Vision text recognition.
But the resulting output I get is this:
I can calculate if the values are correct, except for the amount, presented in the middle. In this case it's presented as "3557 00" but if the user moves the camera a bit further to the right, the result I get is "557 00". Since both MLKit and other libraries cuts around the word, I have no idea if the full sum is presented or not.
If I would get a single space before the word, I could get that there is a full "word", in this case a sum.
Anyone has any ideas of how what library to use to get the best result?

Returning the number of results?

I'm looking to return the number of results found in an ajax fashion on Algolia instant search.
A little field saying something like "There are X number of results" and refines as the characters are typed.
I've read you utilise 'nbHits' but i'm unsure of how to go about it.. Being from a design background.
Thanks for help in advance.
The instantsearch.js stats widget shows the number of results and speed of the search. If you don't want to use the widget, I believe you can still use {{nbHits}} inside of your template wherever you want the number to print.
Very easy when you know how, Thanks for pointing me in the right direction Josh.
This works:
search.addWidget(
instantsearch.widgets.stats({
container: '.no-of-results'
})
);

Unicode character usage statistics [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I am looking for some statistical data on the usage of Unicode characters in textual documents (with any markup). Googling brought no results.
Background: I am currently developing a finite state machine-based text processing tool. Statistical data on characters might help searching for the right transitions. For instance latin characters are probably most used so it might make sense to check for those first.
Did anyone by chance gathered or saw such statistics?
(I'm not focused on specific languages or locales. Think general-purpose parser like an XML parser.)
To sum up current findings and ideas:
Tom Christiansen gathered such statistics for PubMed Open Access Corpus (see this question). I have asked if he could share these statistics, waiting for the answer.
As #Boldewyn and #nwellnhof suggested, I could run the analysis of the complete Wikipedia dump or CommonCrawl data. I think these are good suggestions, I'll probably go with the CommonCrawl.
So sorry, this is not an answer, but a good research direction.
UPDATE: I have written a small Hadoop job and ran it on one of the CommonCrawl segments. I have posted my results in a spreadsheet here. Below are the first 50 characters:
0x000020 14627262
0x000065 7492745 e
0x000061 5144406 a
0x000069 4791953 i
0x00006f 4717551 o
0x000074 4566615 t
0x00006e 4296796 n
0x000072 4293069 r
0x000073 4025542 s
0x00000a 3140215
0x00006c 2841723 l
0x000064 2132449 d
0x000063 2026755 c
0x000075 1927266 u
0x000068 1793540 h
0x00006d 1628606 m
0x00fffd 1579150
0x000067 1279990 g
0x000070 1277983 p
0x000066 997775 f
0x000079 949434 y
0x000062 851830 b
0x00002e 844102 .
0x000030 822410 0
0x0000a0 797309
0x000053 718313 S
0x000076 691534 v
0x000077 682472 w
0x000031 648470 1
0x000041 624279 #
0x00006b 555419 k
0x000032 548220 2
0x00002c 513342 ,
0x00002d 510054 -
0x000043 498244 C
0x000054 495323 T
0x000045 455061 E
0x00004d 426545 M
0x000050 423790 P
0x000049 405276 I
0x000052 393218 R
0x000044 381975 D
0x00004c 365834 L
0x000042 353770 B
0x000033 334689 E
0x00004e 325299 N
0x000029 302497 /
0x000028 301057 (
0x000035 298087 5
0x000046 295148 F
To be honest, I have no idea if these results are representative. As I said, I only analysed one segment. Looks quite plausible for me. One can also easily spot that the markup is already stripped off - so the distribution is not directly suitable for my XML parser. But it gives valuable hints on which character ranges to check first.
The link to http://emojitracker.com/ in the near-duplicate question I personally think is the most promising resource for this. I have not examined the sources (I don't speak Ruby) but from a real-time Twitter feed of character frequencies, I would expect quite a different result than from static web pages, and probably a radically different language distribution (I see lots more Arabic and Turkish on Twitter than in my otherwise ordinary life). It's probably not exactly what you are looking for, but if we just look at the title of your question (which probably most visitors will have followed to get here) then that is what I would suggest as the answer.
Of course, this begs the question what kind of usage you attempt to model. For static XML, which you seem to be after, maybe the Common Crawl set is a better starting point after all. Text coming out of an editorial process (however informal) looks quite different from spontaneous text.
Out of the suggested options so far, Wikipedia (and/or Wiktionary) is probably the easiest, since it's small enough for local download, far better standardized than a random web dump (all UTF-8, all properly tagged, most of it properly tagged by language and proofread for markup errors, orthography, and occasionally facts), and yet large enough (and probably already overkill by an order of magnitude or more) to give you credible statistics. But again, if the domain is different than the domain you actually want to model, they will probably be wrong nevertheless.

iPhone: Turning latitude/longitude into "major cross-streets"

Using the MKReverseGeocoder or GoogleAPI or MapKit...
Is there a simple way to turn a latitude/longitude into "nearest major cross-streets"?
A user might not have any idea where "12345 Pineapple" is located... so I want to show something like "Pineapple and Main"... or (larger, major roads) like "US-140 and Hwy 76".
I don't really care what "major" is defined as... perhaps any road with higher speed limits... or more than 3 lanes... etc.
I don't really care what "close by" is defines as... perhaps within 0-10 miles... or just "closest found".
There is not a built in method to do this with MapKit or the Google API.
See this SO question for some reading about this in relation to the GoogleMaps API:
Is there a way to find the nearest cross streets for an address?

Lucene .Net Searching with TermVector

in Lucene.Net,i am creating the document for searching a word and want to display before 10 words and after 10 words.i have used TermVector.
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content", content,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS);
Can anyone help me how to find out the keyword position and extract nearest 15 words.
please send some code.
Thanks
Ashish
Ashish,
Check out the following link.
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
Hitesh.
You should be looking at Lucene highlighter, it extracts a snippet of text surrounding the query term. This link gives an example.