Sqlite for iOS - Accent (tilde) insensitive match in an fts table - iphone

I enabled fts in sqlite for iphone and tried this and works, although very slow:
SELECT field FROM table_fts WHERE replace(replace(replace(replace(replace(lower(field), 'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') LIKE replace(replace(replace(replace(replace(lower('%string%'), 'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u')
But it does not work when I want to use MATCH, it does not bring me results and there is no error
SELECT field FROM table_fts WHERE replace(replace(replace(replace(replace(lower(field), 'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') MATCH replace(replace(replace(replace(replace(lower('string'), 'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u')
Is there any error or is there any other approach where I can make a tilde insensitive search?. I looked answers in the web with no success.

Two approaches:
First, you can violate normal-form and add columns to your table containing ASCII-only representation of your searchable fields. Furthermore, before doing a search against this secondary search column, you also remove international characters from the string that being searched for, too (that way you're looking for ASCII-only string in a field with the ASCII-only representation).
By the way, if you want a more general purpose conversion of international characters with ASCII, you can try something like:
- (NSString *)replaceInternationalCharactersIn:(NSString *)text
{
NSData *stringData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
return [[[NSString alloc] initWithData:stringData encoding:NSASCIIStringEncoding] autorelease];
}
Second, you could presumably use sqlite3_create_function() to write your own function (that presumably invokes a permutation of the above) that you can use right in your SQL statements themselves. See the SQLite documentation.
Update:
By the way, given that you're doing FTS, the sqlite3_create_function() approach is probably not possible, but it strikes me that you could either do FTS on the field containing the ASCII-only string, or write your own tokenizer that does something along those lines.

Related

Like does not work with Russian symbols? (Swift, SQLite)

I use like with lowercaseString and Russian symbols but LOWER doesn't convert them to lowercase in the query. I tried to create my own function but it didn't work for me. How to solve this problem?
Having studied the documentation of SQLite, I learned that you need to connect the ICU library. How can this be done in this plugin?
Library: stephencelis/SQLite.swift (https://github.com/stephencelis/SQLite.swift)
Thanks for help.
// in name value: ПРИВЕТ from database
let search_name = "Привет"
user.filter(name.lowercaseString.like("%" + search_name.lowercased() + "%"))
SQLite LOWER is only for ASCII. If you want to get case insensitive for Russian (or any other symbols besides ASCII), use FTS3/FTS4 https://www.sqlite.org/fts3.html (or FTS5 https://www.sqlite.org/fts5.html).
SQLite.swift has the corresponding full text search modules https://github.com/stephencelis/SQLite.swift/blob/master/Documentation/Index.md#full-text-search
To use it in your project with existing database, you should make connection to virtual table via FTS module and filter the query using .match
// CREATE VIRTUAL TABLE "table" USING fts4("row0", "row1"), if not exists
try db.run(table.create(.FTS4(row0, row1), ifNotExists: true))
// SELECT * FROM "table" WHERE "row0" MATCH 'textToMatch*'
try db.prepare(table.filter(row0.match("\(textToMatch)*")))
// SELECT * FROM "table" WHERE "any row" MATCH 'textToMatch*'
try db.prepare(table.match("\(textToMatch)*")))

supporting internationalization for NSString's

I have a bunch of the following line of code:
[NSString stringWithFormat:#"%# and %#", subject.title, secondsubject.title];
[NSString stringWithFormat:#"%# and %d others", subject.title, [newsfeeditem count] - 1];
and a lot more in the app. Basically I am building a news feed style like facebook where it has string constants. blah liked blah. Where/how should I do these string constants so it's easy to do for internationalization? Should I have a file just for storing string constants?
See the String Resources section of the Resource Programming Guide. The key section for this particular problem is "Formatting String Resources."
You'd have something like:
[NSString stringWithFormat:NSLocalizedString(#"%1$# and %2$#", #"two nouns combined by 'and'"),
subject.title, secondsubject.title];
The %1$# is the location of the first substitution. This lets you rearrange the text. Then you would have string resource files like:
English:
/* two nouns combined by 'and' */
"%1$# and %2$#" = "%1$# and %2$#";
Spanish:
/* two nouns combined by 'and' */
"%1$# and %2$#" = "%1$# y %2$#";
You need to be very thoughtful about these kinds of combinations. First, you can never build up a sentence out of parts of sentences in a translatable way. You're almost always need to translate the entire message in one go. What I mean is that you can't have one string that says #"I'm going to delete" and another string that says #"%# and %#" and glue them together. The word order is too variable between languages.
Similarly, complex lists of things can cause all kinds of headaches due to various agreement rules. Some languages have special plural rules, gender agreements, and similar issues. As much as possible, keep your messages simple, short, and static.
But the above tools are very useful for solving the problem you're discussing. See the docs for more details.

Problem while using NSPredicate

Sql query:
select * from test_mart
where replace(replace(replace(replace(replace(replace(lower(name),'+'),'_'),'the '),' the'),'a '),' a')='tariq'
I can fire following query very easy, if I have to use simply Sqlite... but In current project I am using Core Data so not familiar about NSPredicate much.
The functionality talks about removing all BUT alphanumeric characters, which means removing special characters.
The characters that should be valid in the comparison would be
ABCDEFGHIJKLMNOPQRESTUVWXYZ1234567890
But we should not fail the comparison for the following characters
:;,~`!##$%^&*()_-+="'/?.>,<|\
Or for the following words
'the' 'an' 'a'
Some examples:
'Walmart' would be seen as the same payee as 'Wal-Mart'
'The Shoe Store' would be seen as the same payee as 'Shoe Store'
'Domino's Pizza' would be seen as the same payee as 'Dominos Pizza'
'Test Payee;' would be seen as the same payee as 'Test Payee'
Can any one suggest appropriate Predicates/Regular Expression ?
Thanks
I would have an extra field in the data base which would be a processed version of the original with all the irrelevant characters stripped out. Then use that for comparisons.
You might want to look at the soundex algorithm which may suite your purposes better... Soundex
It seems to me that you would want to normalize your data before it every gets set into the core data store. So if you're given "Wal-Mart", normalize it to "walmart" once, and then save it. Then you won't be doing all of this expensive on-the-fly comparison many many times.
The normalization would be fairly simple, given your rules:
Strip the words "a", "an", and "the"
Remove punctuation

Make Lucene index a value and store another

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.
If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, #"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}
You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.

Extracting data(strings) from a string large string

A long time ago I had to extract data from a string, and I went with a while loop that went through the whole string char by char extracting bits of data that I need. It wasn't very efficient but it worked.
In my latest app I would like to try and do it in the way that a good engineer would do it. Are there ways to search the string for an expression? or a sub string maybe?
For example out of the html in the string, there is a line that will contain a team name.
<td width="25%"><span class="teamname">Blue Bombers</span></td>
Is there a call I can do that would find the "teamname" and then extract the teamname from between the > <.
I could go char by char saving the last 10 chars to a string until the string equals "teamname", then keep going until i hit the > save everything i get until i again hit a <. but i guess thats taking the easy inefficient way.
Many Thanks
-Code
You can get the range of string "class" using NSRange, then do your logic... it will probably reduce the character searching..
Your code should be like follows,
if ([substring rangeOfString:#"class"].location != NSNotFound) {
// "class" was found
else {
// "class" was not found
}
If that's the only part of the string you're interested in and then just find a starting point like "teamname" via -rangeOfString:. If there's more than one occurrence then make repeated calls with -rageOfString:options:range:.
If you need more comprehensive parsing, however..
If this string is actual XHTML then you may be able to use one of the various XML parsers, e.g. TouchXML, and then find what you need via DOM lookups. However if (as seems likely) it's not pure XHTML then this is unlikely to help. In that case you might try loading up the HTML in an offscreen UIWebView and using JavaScript calls to find specific elements.