Postgres full text search dictionary strip special charachters - postgresql

I'm using postgres full text search for (amoung other things) to provide autocomplete functionality for usernames and tags. However, I'd like autocomplete to match the column value 'dashed-tag-example' against a ts_query like 'dashedtag:*'.
My understanding is that, to do this without duplicating the column in my table I need to create a dictionary along the lines of the simple dictionary that strips charachters like '-'. Is it possible to create such a dictionary using SQL (i.e. something I could put in a rails migration)?
It seems like it should somehow be possible to define a dictionary (or do I need a parser?) that uses postgres's regexp substition functions but I can't seem to find any examples online of how to create a dictionary (parser?) like that. Is this possible? How?

The dictionary is too late; you would need a different parser, which would require writing C code.
The simple and pragmatic solution is to use replace() to strip the - when you construct the tsvector.
You don't need to create a new column for that, simply search like this:
SELECT ... FROM ...
WHERE to_tsvector('english', replace(col, '-', ''))
## to_tsquery('english', replace('search-string', '-' ''));

Related

How can I use tsvector on a string with numbers?

I would like to use a postgres tsquery on a column that has strings that all contain numbers, like this:
FRUIT-239476234
If I try to make a tsquery out of this:
select to_tsquery('FRUIT-239476234');
What I get is:
'fruit' & '-239476234'
I want to be able to search by just the numeric portion of this value like so:
239476234
It seems that it is unable to match this because it is interpreting my hyphen as a "negative sign" and doesn't think 239476234 matches -239476234. How can I tell postgres to treat all of my characters as text and not try to be smart about numbers and hyphens?
An answer from the future. Once version 13 of PostgreSQL is released, you will be able to do use the dict_int module to do this.
create extension dict_int ;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 100, ABSVAL=true);
ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR int WITH intdict;
select to_tsquery('FRUIT-239476234');
to_tsquery
-----------------------
'fruit' & '239476234'
But you would probably be better off creating your own TEXT SEARCH DICTIONARY as well as copying the 'english' CONFIGURATION and modifying the copy, rather than modifying the default ones in place. Otherwise you have the risk that upgrading will silently lose your changes.
If you don't want to wait for v13, you could back-patch this change and compile into your own version of the extension for a prior server.
This is done by the text search parser, which is not configurable (short of writing your own parser in C, which is supported).
The simplest solution is to pre-process all search strings by replacing - with a space.

Postgres - difference between to_tsquery, to_tsvector and plainto_tsquery

The Full text search of postgres includes some of these functions to search: plainto_tsquery, to_tsquery and to_tsvector .
I don't get the difference between it, the results contain the same words always, but in tsvector it is detached with the number of position of that word.
SELECT plainto_tsquery('simple', 'The & Fat & Rats');
result will be like this:
plainto_tsquery: 'fat' & 'rat'
to_tsquery: 'fat' & 'rat'
to_tsvector: 'fat':2 'rat':3
I have tried longer queries, but i haven't found a bigger difference than that.
I already read the documentation, but I didnt get the difference there either.
I am happy for any help.
"plainto_tsquery" takes a phrase in plain English (or in this case plain "simple"--although your question is not consistent. "simple" does not strip out the word 'the', the way you show, unless you made nonstandard modifications to it) and converts it to a tsquery. Since "&" is punctuation, it gets ignored. But then it adds '&' in between the words, because that is what "plainto_tsquery" does. So those changes are not visible, because you chose a poor example to feed to plainto_tsquery.
"to_tsquery" compiles the query you gave it into the structure used for searching. But then, because you are selecting it rather than using it with a ts query operator, it converts it back to text again so it can display it. It requires that what you feed it already looks mostly like a tsquery (for example, has boolean operators between each word), otherwise it throws an error. Surely you noticed that when you tried longer queries?
"to_tsvector" creates a tsvector. This is not a tsquery, rather it is what the tsquery gets applied to.

Postgres - Full Text Search to accept emojis

I want to create a Full Text Search that accepts emojis on the query, or another type of index to search on text. For example, I have this text: Playa 🌊🌞🌴 #CobolIquique h' and PostgreSQL parse it weirdly on the emojis.
Debugging, Using SELECT * FROM ts_debug('english','Playa 🌊🌞🌴 #CobolIquique h'); I have the following result:
And I don't know why the token is considered an space symbol. If I debug the parser SELECT * FROM ts_parse('default', 'Playa 🌊🌞🌴 #CobolIquique h'); I just get the same tokens and with the tokens types ts_token_type('default') there is not a emoji type (or something similar). So, How can I create a parser to split the string correctly with the spaces and doesn't consider emojis as blank spaces? or How can I create a text index that can use emojis on the queries?
To create a new parser, which is different from default one, you should be a C programmer and you should write your own PostgreSQL extension. This extension should define the following functions:
start_function();
gettoken_function();
end_function();
lextypes_function();
headline_function(); // optional
As an example you can examine pg_tsparser module.

Postgres: LIKE query against a string returns no results, though results exist

I feel like I'm missing something really simple here. I'm trying to get a list of results from Postgres, and I know the rows exist since I can see them when I do a simple SELECT *. However, when I try to do a LIKE query, I get nothing back.
Sample data:
{
id: 1,
userId: 2,
cache: '{"dataset":"{\"id\":4,\"name\":\"Test\",\"directory\":\"/data/"...'
};
Example:
SELECT cache FROM storage_table WHERE cache LIKE '%"dataset":"{\"id\":4%';
Is there some escaping I need?
The LIKE operator supports escaping of wildcards (e.g. so that you can search for an underscore or % sign). The default escape character is a backslash.
In order to tell the LIKE operator that you do not want to use the backslash as an ESCAPE character you need to define a different one, e.g. the ~
SELECT cache
FROM storage_table
WHERE cache LIKE '%"dataset":"{\"id\":4%' ESCAPE '~';
You can use any character that does not appear in your data, ESCAPE '#' or ESCAPE '§'
SQLFiddle example: http://sqlfiddle.com/#!15/7703f/1
But you should seriously consider upgrading to a more recent Postgres version that fully supports JSON. Not only will your queries be more robust, they can also be a lot faster due to the ability to index a JSON document.

Table query in iPhone app

I have a tableview (linked to a database) and a search bar. When I type something in the search bar, I do a quick search in the database and display the results as I type.
The query looks like this:
SELECT * FROM MyTable WHERE name LIKE '%NAME%'
Everything works fine as long as I use only ASCII characters. What I want is to type ASCII characters and to match their equivalent with diacritics. For instance, if I type "Alizee" I would expect it to match "Alizée".
Is there a way to do make the query locale-insensitive? I've red about the COLLATE option in SQL, but there seems to be of no use with SQLite.I've also red that iPhone SDK 3.0 has "Localized collation" but I was unable to find any documentation about what this means...
Thank you.
There are a few options for solving this:
Replacing all accented chars in the
query before executing it, e.g.
"Psychédélices" => "Psychedelices"
"À contre-courant" => "A contre-courant"
"Tempête" => "Tempete"
etc.
but this only works for the input so
you must not have accented chars in
the database itself. Simple solution but
far from perfect.
Using a 3rd party library, namely ICU (links below). Not sure if it's the best choice for iPhone though.
Writing one or more custom C functions that will do the comparison. More in the links below.
A few posts here on StackOverflow that discuss the various options:
How to sort text in sqlite3 with specified locale?
Case-insensitive UTF-8 string collation for SQLite (C/C++)
How to implement the accent/diacritic insensitive search in Sqlite?
Also a couple of external links:
SQLite and native UNICODE LIKE support in C/C++
sqlite case and accent insensitive searches
I'm not sure about SQL, but I think you can definitely use the NSDiacriticInsensitivePredicateOption to compare in-memory NSStrings.
An example would be an NSArray full of the strings you're searching over. You could just iterate over the array comparing strings using the NSDiacriticInsensitivePredicateOption as your comparison option and displaying the successful matches.