Weird behaviour with postgresql tsvector and tsquery around emails

Weird behaviour with postgresql tsvector and tsquery around emails - postgresql

I've been playing around with postgresqls text search capability and I've encountered what I consider weird behaviour. This is on postgresql 8.3 so it may not be current behaviour:
select to_tsvector('some#email.com') ## to_tsquery('some#email.com:*');
select to_tsvector('some#email.com') ## to_tsquery('some#email.c:*');
The first query matches but the second fails...
does anyone know what is going on here?
I've tried escaping the # and . characters but no luck

Related

postgresql fulltext returning wrong results

I'm using postgresql full text tsvector column.
But I found a problem:
When I search for "calça"
The results contains the following results:
1- calça red
2- calça blue
3- calçado red
Why "calçado" is being returned when I search for "calça" ?
Is there any configuration so I can solve this?
Thanks.

It isn't just a matter that one string contains the other. The Portuguese stemmer thinks this is the way they should be stemmed. If you turn the longer word into 'calçadot', for example, it no longer stems it, because (presumably) 'adot' is not recognized as a Portuguese suffix which ought to be removed the way 'ado' is.
If you don't want stemming at all, then you could change the config to 'simple', which doesn't stem. But at that point, maybe you don't want full text search at all, and could just use LIKE instead with a pg_trgm index.
If it is just this particular word that you don't want stemmed, I think you can set up a synonym dictionary which will map calçado to itself, which will bypass stemming.

PostgreSQL lower() and upper() not working with some Turkish characters

In PostgreSQL 11.4, I realized that lower() and upper() methods are not working with some of Turkish characters i.e. 'İ' and 'ı'.
select lower('İ'); -- returns 'İ' instead of 'i'
select upper('ı'); -- returns 'ı' instead of 'I'
Although the methods works properly with the other Turkish characters i.e. "ç, ö, ş, ğ, ü", there seems to be not a proper workaround except from replace or some of the procedures. So, I am wondering whether or not this is a bug and it will be solved in the next version of PostgreSQL. If not yet, is there a smarter workaround regarding to this problem at least for now?
Note: I use Windows 10, but I am not sure if the problem is directly related to Windows.

Odd to_tsquery results for s:* and t:*

I was experimenting with PostgreSQL's text search feature - particularly with the normalization function to_tsquery.
I was using english dictionary(config) and for some reason s and t won't normalize. I understand why i and a would not, but s and t? Interesting.
Are they matched to single space and tab?
Here is the query:
select
to_tsquery('english', 'a:*') as for_a,
to_tsquery('english', 's:*') as for_s,
to_tsquery('english', 't:*') as for_t,
to_tsquery('english', 'u:*') as for_u
fiddle just in case.
You would see 'u:*' is returning as 'u:*' and 'a:*' is not returning anything.

The letters s and t are considered stop words in the english text search dictionary, therefore they get discarded. You can read the stop word list under tsearch_data/english.stop in the postgres shared folder, which you can locate by typing pg_config --sharedir
With pg 11 on ubuntu/debian/mint, that would be
cat /usr/share/postgresql/11/tsearch_data/english.stop
Quoting from the docs,
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching.
It is best to discard english grammar and think of words in a programmatic and logical way as described above. Full text search does not try to infer context based on sentence structuring so it has no use for these words. After all, it's called full text search and not natural language search.
As to how they arrived on the conclusion to add s and t to the stop word list, statistical analysis must have revealed these characters to be noise.

PQgetvalue() strips spaces from result of string_agg()

I have a GNU C++ project that uses the PostgreSQL API and for some reason, it strips spaces from the result of a certain query. Other environments (psql and pgAdmin) don't. The query is:
SELECT string_agg(my_varchar, ', ') FROM my table;
Notice the space after the comma in the delimiter. Instead of 1046976, 1046977 being returned by PQgetvalue(), I get 1046976,1046977. Just for kicks, I tried changing the delimiter to silly things like string_agg(my_varchar, ',:) ' and string_agg(my_varchar, ', :)'. It doesn't strip the space if the space is in the middle of the delimiter.
Again, I don't have this problem if I do the same queries in db browsers like psql and pgAdmin; they don't strip the space in any of those queries.
Yes, I considered the possibility that because the columns from which they extract are varchars, but the data are 7-bit integers, the engine might be confused. I changed the query to something that is truly a varchar, but the spaces were still stripped.
Looking at https://www.postgresql.org/docs/9.4/static/functions-aggregate.html, I see that string_agg() expects its arguments to be texts or byteas. Well, I never got an error, but to be sure, I tried string_agg(my_varchar::text, ', '::text). It didn't make a difference.
I don't know a great deal about this API, but it doesn't appear to connect to the db with any options, so I don't think there's much to say about the configuration.
I'm running this in GNU C++ v4.9.2 on Debian 8.10. The PostgreSQL engine and API are 9.4.

sphinx dash in author names causing problems when searching

I've read all the posts about dashes and tried pretty much everything mentioned in them, yet cannot figure out a strange problem I'm having.
For example, I have an author name like this:
Arturo Pérez-Reverte
A search for 'pérez-reverte' will not turn up anything, nor will 'pérez-reverte' so escaping the dash is not the issue.
But a search for 'spider-man' will return hits, proving that the dash seems to be working.
However, a search for 'perez reverte' also finds a hit because it searches each word separately and finds the 'reverte' in 'perez-reverte' (but doesn't seem to find the 'perez').
A search for either 'pérez' or 'perez' finds the same number of documents, suggesting that the accent is not an issue (I do have a charset_table which accounts for accented characters).
So I'm very confused as to what's happening here. It if it isn't the accent and it isn't the dash, what could it be?
I don't have any ignore_chars set, I'm using UTF-8 and have a charset_table to treat accented characters as regular characters.
The only difference between these two terms is that one of them is a title (spider-man) and the other an author, but they are both part of the same Sphinx index declaration, so I don't see that as an issue in any way.
Any help would be greatly appreciated.

After much fighting with it, I found out that even though my database is all UTF-8 with the proper collation I needed to add this in sphinx.conf for everything to work properly:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER SET utf8
After doing that, and having the proper charset_table, everything seems to be working fine.
Hope this helps someone else.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse