Search with Turkish characters - postgresql

I have problem on db search with like and elastic search in Turkish upper and lower case.
For example I have posts table which contains post titled 'DENEME YAZI'.
If I run this query:
select * from posts where title like '%deneme%';
or:
select * from posts where title like '%YAZI%';
I get correct result but if I run:
select * from posts where title like '%yazı%';
it doesn't return any record. My database encoding is tr_TR.UTF-8.
How can I get correct results without entering exact word?

You must use ILIKE for case insensitive matches:
select * from posts where title ilike '%yazı%';
However, there is the additional complication of peculiar rules in the Turkish locale. Upper case of 'ı' is 'I'. But not the other way round. Lower case of 'I' is 'i':
db=# SELECT lower(upper('ı'));
lower
-------
i
You could solve that by applying upper() on either side of the LIKE expression:
select upper('DENEME YAZI') like ('%' || upper('yazı') || '%');

Applying just a single UPPER (or LOWER) on either side of the expression is not a solution. You should handle problematic Turkish characters (ıI-iİ) by yourself.
İ and i are the same letters in Turkish alphabet.
I and ı are the same letters in Turkish alphabet.
But even using UTF-8, Latin5, Windows 1254 Encoding and collation settings in postgre
UPPER('İ') returns 'İ' OK
UPPER('i') return 'I' Not OK
UPPER('I') returns 'I' OK
UPPER('ı') return 'İ' Not OK
so
SELECT ... FROM ... WHERE ... UPPER('İZMİR') like UPPER('izmir') return false
SELECT ... FROM ... WHERE ... UPPER('ISPARTA') like UPPER('ısparta') return false.
Here's some more precise but not perfect solution because of performance issues
SELECT ... FROM ... WHERE ...
UPPER(REPLACE(REPLACE(COLUMNX, 'i', 'İ'), 'ı', 'I')) = UPPER(REPLACE(REPLACE(myvalue,
'i', 'İ'), 'ı', 'I'))
or
SELECT ... FROM ... WHERE ...
UPPER(TRANSLATE('COLUMNX','ıi','Iİ')) = UPPER(TRANSLATE(myvalue,'ıi','Iİ'))

Related

Postgres: Retrieve first n words from column

I know that I can do a text search in Postgres with TextSearch and get some result with
select ts_headline('german',content, tq, 'MaxFragments=4, MinWords=5, MaxWords=12,
ShortWord=3, StartSel = <strong>, StopSel = </strong>') as highlight, ...
FROM to_tsquery('german', 'test') tq ...
Is there a similar way to apply to content the same limitations? i.e. to get directly up to 12 words from the column content.
You could use regular expressions:
SELECT (regexp_match(
regexp_replace(content, '[^\w\s]+', ' ', 'g'),
'^\s*((?:\w+\s+){9}\w+)'
))[1] FROM ...
That will first replace everything that is not a space or alphanumerical character with a space and then return the first 10 words.

How to find/replace weird whitespace in string

I find in my sql database string whit weird whitespace which cannot be replace like REPLACE(string, ' ', '') RTRIM and cant it even find with string = '% %'. This space is even transfered to new table when using SELECT string INTO
If i select this string in managment studio and copy that is seems is normal space and when everything is works but cant do nothing directly from database. What else can i do? Its some kind of error or can i try some special character for this?
First, you must identify the character.
You can do that by using a tally table (or a cte) and the Unicode function:
The following script will return a table with two columns: one contains a char and the other it's unicode value:
DECLARE #Str nvarchar(100) = N'This is a string containing 1 number and some words.';
with Tally(n) as
(
SELECT TOP(LEN(#str)) ROW_NUMBER() OVER(ORDER BY ##SPID)
FROM sys.objects a
--CROSS JOIN sys.objects b -- (unremark if there are not enough rows in the tally cte)
)
SELECT SUBSTRING(#str, n, 1) As TheChar,
UNICODE(SUBSTRING(#str, n, 1)) As TheCode
FROM Tally
WHERE n <= LEN(#str)
You can also add a condition to the where clause to only include "special" chars:
AND SUBSTRING(#str, n, 1) NOT LIKE '[a-zA-Z0-9]'
Then you can replace it using it's unicode value using nchar (I've used 32 in this example since it's unicode "regular" space:
SELECT REPLACE(#str, NCHAR(32), '|')
Result:
This|is|a|string|containing|1|number|and|some|words.

TSQL: Special (National) char detection (SQL Server2016)

I found a lot of techniques to detect aka special chars($%##) avail on English keyboard, however I see that some national char like á works differently, though my LIKE condition should get it, as it should select anything but a-z1-9, what is the trick here: In sample below I'm missing my special á. I'm on TSQL 2016 with default settings in US.
;WITH cte AS (SELECT 'Euro a€' St UNION SELECT 'adgkjb$' St UNION SELECT 'Bravo Endá' St)
SELECT * FROM cte WHERE St LIKE '%[^a-zA-Z0-9 ]%'
St
adgkjb$
Euro a€
SELECT CAST(N'€' AS VARBINARY(8)) --0xAC20
SELECT CAST(N'á' AS VARBINARY(8)) --0xE100
SQL Server appears to be helping with ranges of characters due to the default collation. If you explicitly list all of the valid characters it will work as desired. Alternatively, you can force a collation on the pattern match that won't interpret the pattern a containing non-ASCII characters.
-- Explicit pattern for "bad" characters.
declare #InvalidCharactersPattern as VarChar(100) = '%[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ]%';
-- Query some sample data with the explicit pattern and an explicitly specified collation.
select Sample,
case when Sample like #InvalidCharactersPattern then 'Bad Character' else 'Okay' end as ExplicitStatus,
case when Sample like '%[^a-zA-Z0-9 ]%' collate Latin1_General_100_BIN
then 'Bad Character' else 'Okay' end as CollationStatus
from ( values ( 'a' ), ( 'A' ), ( 'á' ), ( 'Foo' ), ( 'F&o' ), ( '$%^&' ) ) as Samples( Sample );
-- Server collation.
select ServerProperty( 'collation' ) as ServerCollation;
-- Available collations.
select name, description
from sys.fn_helpcollations()
order by name;

Postgres reverse LIKE lookup indexing and performance

We have a musicians table containing records with multiple string fields, say:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Sting", "", "Bass"
"Ringo", "Starr", "Drums"
"Paul", "McCartney", "Bass"
I want to pass postgres a long string, say:
"It is known that Jimi liked to set light to his guitar and smash up
all the drums while on stage."
and i want to get returned the fields that have any matches - preferably in order of the most matches first:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Ringo", "Starr", "Drums"
because i need the search to be case insensitive, i'm constructing a query like this...
select * from musicians where lowercase_string like '%'||firstname||'%' or lowercase_string like '%'||lastname||'%' or lowercase_string like '%'||instrument||'%'
and then looping through (in ruby in my case) to capture the result with the most matches.
this is however very slow in the sql stage (1 minute+).
i've tried adding lower-case GIN index using pg_trgm as suggested here - but it's not helping - presumably because the like query is back to front?
Thanks!
With my testing, it seems that no trigram index could help your query at all. And no other index type could possibly speed up an (I)LIKE / FTS based search.
I should mention that all of the queries below use the trigram indexes, when they are queried "reversed": when the table contains the document (which is indexed), and your parameter is the query. The (I)LIKE variant variant f.ex. 2-3 times faster with it.
These the queries I've tested:
select *
from musicians
where :input_string ilike '%' || firstname || '%'
or :input_string ilike '%' || lastname || '%'
or :input_string ilike '%' || instrument || '%'
At first, FTS seemed a great idea, but my testing shows that even without ranking, it is 60-100 times slower than the (I)LIKE variant. (So even, when you don't have to post-process results with these methods, these are not worth it).
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
However, ORDER BY rank doesn't slow down that much further: it is 70-120 times slower than the (I)LIKE variant.
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
order by ts_rank(to_tsvector(:input_string), plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
Then, for a last effort, I tried the (fairly new) "word similarity" operators of the trigram module: <% and %> (available from PostgreSQL 9.6).
select *
from musicians
where :input_string %> firstname
or :input_string %> lastname
or :input_string %> instrument
select *
from musicians
where firstname <% :input_string
or lastname <% :input_string
or instrument <% :input_string
These were somewhat faster then FTS: around 50-70 times slower than the (I)LIKE variant.
(Partially working) rextester: it is run against PostgreSQL 9.5, so the 9.6 operators obviously won't run here.
Update: IF full word match is enough for you, you can actually reverse your query, to be able to use indexes. You'll need to "parse" your query (aka. "long string") though:
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(ls, '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where firstname ilike word
or lastname ilike word
or instrument ilike word
group by musicians.id
Note: I parsed the query for every complete word. You can have some other logic there, or it can even be parsed in client side.
The default, btree index shines here, as it is much faster than the trigram index with (I)LIKE (we won't need them anyway, as we are looking for complete word match here):
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(lower(ls), '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where lower(firstname) = word
or lower(lastname) = word
or lower(instrument) = word
group by musicians.id
http://rextester.com/PSABJ6745
You could even get the match count with something like
sum((lower(firstname) = word)::int
+ (lower(lastname) = word)::int
+ (lower(instrument) = word)::int)
The ilike option with match ordering:
with long_string (ls) as (values
('It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage.')
)
select musicians.*, matches
from
musicians
cross join
long_string
cross join lateral
(select
(ls ilike format ('%%%s%%', first_name) and first_name != '')::int +
(ls ilike format ('%%%s%%', last_name) and last_name != '')::int +
(ls ilike format ('%%%s%%', instrument) and instrument != '')::int
as matches
) m
where matches > 0
order by matches desc
;
first_name | last_name | instrument | matches
------------+-----------+------------+---------
Jimi | Hendrix | Guitar | 2
Phil | Collins | Drums | 1
Ringo | Starr | Drums | 1

TRIM not works with lines and tabs of a xpath in PostgreSQL?

With this query
SELECT trim(title) FROM (
SELECT
unnest( xpath('//p[#class="secTitle1"]', xmlText )::varchar[] ) AS title
FROM t1
) as t2
and XML input text with lines and spaces,
<root>
...
<p class="x">
text text
text text
</p><p> ...</p>
...
</root>
The trim() have no effect (!). It is a PostgreSQL bug? How to apply fn:normalize-space() with the XPath? I need something like "WHERE title is not null"? (Oracle is simpler...) How to do this simple query with PostreSQL?
Workaround
I need a well-configured build-in function, not a workaround... But I need to work and to show results, so I am using regular expression...
SELECT id, TRIM(regexp_replace(tit, E'[\\n\\r\\t ]+', ' ', 'g')) AS tit
FROM (
SELECT
id, -- xpath returns array of 1, 2, or more strings
unnest( xpath('//p[#class="secTitle1"]', texto )::VARCHAR[] ) AS tit
FROM t
) AS tmp
So, a "only simple space trim" is not friendly, not util (!).
EDIT after #mu comment
I try
SELECT id, TRIM(tit, E'\\n\\r\\t') AS tit
and
SELECT id, TRIM(tit, '\n\r\t') AS tit
both NOT WORKs.
QUESTION REMAINS:
there are no TRIM-option or postgresql configuration to say to TRIM work as it is required?
can I use normalize-space() at xpath? How?
I am using PostgreSQL 9.1, need to upgrade?
It works in 9.2, and it works on 8.4 too.
postgres=# select trim(unnest(string_to_array(e'\t\tHello\n\t\tHello\n\t\tHello', e'\n')), e'\t');
btrim
-------
Hello
Hello
Hello
(3 rows)
your regexp replace any char \n or \r or \t, but trim working with string "\n\r\t". It has different meaning than you expect.