I want to create a Full Text Search that accepts emojis on the query, or another type of index to search on text. For example, I have this text: Playa 🌊🌞🌴 #CobolIquique h' and PostgreSQL parse it weirdly on the emojis.
Debugging, Using SELECT * FROM ts_debug('english','Playa 🌊🌞🌴 #CobolIquique h'); I have the following result:
And I don't know why the token is considered an space symbol. If I debug the parser SELECT * FROM ts_parse('default', 'Playa 🌊🌞🌴 #CobolIquique h'); I just get the same tokens and with the tokens types ts_token_type('default') there is not a emoji type (or something similar). So, How can I create a parser to split the string correctly with the spaces and doesn't consider emojis as blank spaces? or How can I create a text index that can use emojis on the queries?
To create a new parser, which is different from default one, you should be a C programmer and you should write your own PostgreSQL extension. This extension should define the following functions:
start_function();
gettoken_function();
end_function();
lextypes_function();
headline_function(); // optional
As an example you can examine pg_tsparser module.
Related
I'm using postgres full text search for (amoung other things) to provide autocomplete functionality for usernames and tags. However, I'd like autocomplete to match the column value 'dashed-tag-example' against a ts_query like 'dashedtag:*'.
My understanding is that, to do this without duplicating the column in my table I need to create a dictionary along the lines of the simple dictionary that strips charachters like '-'. Is it possible to create such a dictionary using SQL (i.e. something I could put in a rails migration)?
It seems like it should somehow be possible to define a dictionary (or do I need a parser?) that uses postgres's regexp substition functions but I can't seem to find any examples online of how to create a dictionary (parser?) like that. Is this possible? How?
The dictionary is too late; you would need a different parser, which would require writing C code.
The simple and pragmatic solution is to use replace() to strip the - when you construct the tsvector.
You don't need to create a new column for that, simply search like this:
SELECT ... FROM ...
WHERE to_tsvector('english', replace(col, '-', ''))
## to_tsquery('english', replace('search-string', '-' ''));
I would like to use a postgres tsquery on a column that has strings that all contain numbers, like this:
FRUIT-239476234
If I try to make a tsquery out of this:
select to_tsquery('FRUIT-239476234');
What I get is:
'fruit' & '-239476234'
I want to be able to search by just the numeric portion of this value like so:
239476234
It seems that it is unable to match this because it is interpreting my hyphen as a "negative sign" and doesn't think 239476234 matches -239476234. How can I tell postgres to treat all of my characters as text and not try to be smart about numbers and hyphens?
An answer from the future. Once version 13 of PostgreSQL is released, you will be able to do use the dict_int module to do this.
create extension dict_int ;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 100, ABSVAL=true);
ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR int WITH intdict;
select to_tsquery('FRUIT-239476234');
to_tsquery
-----------------------
'fruit' & '239476234'
But you would probably be better off creating your own TEXT SEARCH DICTIONARY as well as copying the 'english' CONFIGURATION and modifying the copy, rather than modifying the default ones in place. Otherwise you have the risk that upgrading will silently lose your changes.
If you don't want to wait for v13, you could back-patch this change and compile into your own version of the extension for a prior server.
This is done by the text search parser, which is not configurable (short of writing your own parser in C, which is supported).
The simplest solution is to pre-process all search strings by replacing - with a space.
The Full text search of postgres includes some of these functions to search: plainto_tsquery, to_tsquery and to_tsvector .
I don't get the difference between it, the results contain the same words always, but in tsvector it is detached with the number of position of that word.
SELECT plainto_tsquery('simple', 'The & Fat & Rats');
result will be like this:
plainto_tsquery: 'fat' & 'rat'
to_tsquery: 'fat' & 'rat'
to_tsvector: 'fat':2 'rat':3
I have tried longer queries, but i haven't found a bigger difference than that.
I already read the documentation, but I didnt get the difference there either.
I am happy for any help.
"plainto_tsquery" takes a phrase in plain English (or in this case plain "simple"--although your question is not consistent. "simple" does not strip out the word 'the', the way you show, unless you made nonstandard modifications to it) and converts it to a tsquery. Since "&" is punctuation, it gets ignored. But then it adds '&' in between the words, because that is what "plainto_tsquery" does. So those changes are not visible, because you chose a poor example to feed to plainto_tsquery.
"to_tsquery" compiles the query you gave it into the structure used for searching. But then, because you are selecting it rather than using it with a ts query operator, it converts it back to text again so it can display it. It requires that what you feed it already looks mostly like a tsquery (for example, has boolean operators between each word), otherwise it throws an error. Surely you noticed that when you tried longer queries?
"to_tsvector" creates a tsvector. This is not a tsquery, rather it is what the tsquery gets applied to.
Background
I have search indexes containing Greek characters. Many people don't know how to type Greek so they enter something called "beta-code". Beta-code can be converted into Greek. For example, beta-code "NO/MOU" would be converted to "νόμου". Characters such as a slash or parenthesis is used to indicate an accent.
Desired Behavior
I want users to be able to search using either beta-code or text in the Greek script. I figured out that the Whoosh Variations class provides the mechanism I need and it almost solves my problem.
Problem
The Variation class works well except for when a slash or a parenthesis are used to indicate an accent in a users' query. The problem is the query are parsed such that the the special characters used to denote the accent result in the words being split up. For example, a search for "NO/MOU" results in the Variations class being asked to find variations of "no" and "mou" instead of "NO/MOU".
Question
Is there a way to influence how the query is parsed such that slashes and parentheses are included in the search words (i.e. that a search for "NO/MOU" results in a search for a token of ""NO/MOU" instead of "no" and "mou")?
The search parser uses a Tokenizer class for breaking up the search string into individual terms. Whoosh will use the class that is associated with the schema. For example, the case below, the SimpleAnalyzer() will be used when searching the "content" field.
Schema( verse_id = NUMERIC(unique=True, stored=True),
content = TEXT(analyzer=SimpleAnalyzer()) )
By default, the SimpleAnalyzer() uses the following regular expression to tokenize search terms: "\w+(.?\w+)*"
To use a different regular expression, assign the first argument to the SimpleAnalyzer to another regular expression. For example, to include beta-code characters (slashes, parentheses, etc.) in tokens, use the following SimpleAnalyzer:
SimpleAnalyzer( rcompile(r"[\w/*()=\+|&']+(\.?[\w/*()=\+|&']+)*") )
Searches will now allow terms to include the special beta-code characters and the Variations class will be able to convert the term to the unicode version.
I have a tableview (linked to a database) and a search bar. When I type something in the search bar, I do a quick search in the database and display the results as I type.
The query looks like this:
SELECT * FROM MyTable WHERE name LIKE '%NAME%'
Everything works fine as long as I use only ASCII characters. What I want is to type ASCII characters and to match their equivalent with diacritics. For instance, if I type "Alizee" I would expect it to match "Alizée".
Is there a way to do make the query locale-insensitive? I've red about the COLLATE option in SQL, but there seems to be of no use with SQLite.I've also red that iPhone SDK 3.0 has "Localized collation" but I was unable to find any documentation about what this means...
Thank you.
There are a few options for solving this:
Replacing all accented chars in the
query before executing it, e.g.
"Psychédélices" => "Psychedelices"
"À contre-courant" => "A contre-courant"
"Tempête" => "Tempete"
etc.
but this only works for the input so
you must not have accented chars in
the database itself. Simple solution but
far from perfect.
Using a 3rd party library, namely ICU (links below). Not sure if it's the best choice for iPhone though.
Writing one or more custom C functions that will do the comparison. More in the links below.
A few posts here on StackOverflow that discuss the various options:
How to sort text in sqlite3 with specified locale?
Case-insensitive UTF-8 string collation for SQLite (C/C++)
How to implement the accent/diacritic insensitive search in Sqlite?
Also a couple of external links:
SQLite and native UNICODE LIKE support in C/C++
sqlite case and accent insensitive searches
I'm not sure about SQL, but I think you can definitely use the NSDiacriticInsensitivePredicateOption to compare in-memory NSStrings.
An example would be an NSArray full of the strings you're searching over. You could just iterate over the array comparing strings using the NSDiacriticInsensitivePredicateOption as your comparison option and displaying the successful matches.