How can I ignore accented characters when string matching in SPARQL - unicode

I have no idea to how to compare different labels without taking accents into account.
The next query doesn't return the place because "Ibáñez" has accents in Spanish DBpedia, but it has different accents in my data source.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT DISTINCT ?iri
WHERE {
?iri rdfs:label ?label .
?label bif:contains "'Blasco Ibañez'" .
?iri ?location ?city .
FILTER (?location = <http://dbpedia.org/ontology/location> || <http://dbpedia.org/ontology/wikiPageWikiLink>) .
?city bif:contains "valencia"
} LIMIT 100
Is there a way to not to take account of the accents?

The issue is the current configuration of the Spanish DBpedia endpoint. (You may find the query I used to check their configuration interesting.)
Their virtuoso.ini must be adjusted to include --
[I18N]
XAnyNormalization=3
-- as described in the documentation of the INI file, and as further discussed in the article about "normalization of UNICODE3 accented chars in free-text index and queries", as cited in comments by #StanislavKralin.
(Note -- as of this writing, there's a typo in the doc; the section about "WideFileNames = 1/2/3/0" should say it's about "XAnyNormalization = 1/2/3/0")

Related

Prefix/wildcard searches with 'websearch_to_tsquery' in PostgreSQL Full Text Search?

I'm currently using the websearch_to_tsquery function for full text search in PostgreSQL. It all works well except for the fact that I no longer seem to be able to do partial matches.
SELECT ts_headline('english', q.\"Content\", websearch_to_tsquery('english', {request.Text}), 'MaxFragments=3,MaxWords=25,MinWords=2') Highlight, *
FROM (
SELECT ts_rank_cd(f.\"SearchVector\", websearch_to_tsquery('english', {request.Text})) AS Rank, *
FROM public.\"FileExtracts\" f, websearch_to_tsquery('english', {request.Text}) as tsq
WHERE f.\"SearchVector\" ## tsq
ORDER BY rank DESC
) q
Searches for customer work but cust* and cust:* do not.
I've had a look through the documentation and a number of articles but I can't find a lot of info on it. I haven't worked with it before so hopefully it's just something simple that I'm doing wrong?
You can't do this with websearch_to_tsquery but you can do it with to_tsquery (because ts_query allows to add a :* wildcard) and add the websearch syntax yourself in in your backend.
For example in a node.js environment you could do smth. like this:
let trimmedSearch = req.query.search.trim()
let searchArray = trimmedSearch.split(/\s+/) //split on every whitespace and remove whitespace
let searchWithStar = searchArray.join(' & ' ) + ':*' //join word back together adds AND sign in between an star on last word
let escapedSearch = yourEscapeFunction(searchWithStar)
and than use it in your SQL
search_column ## to_tsquery('english', ${escapedSearch})
You need to write the tsquery directly if you want to use partial matching. plainto_tsquery doesn't pass through partial match notation either, so what were you doing before you switched to websearch_to_tsquery?
Anything that applies a stemmer is going to have hard time handling partial match. What is it supposed to do, take off the notation, stem the part, then add it back on again? Not do stemming on the whole string? Not do stemming on just the token containing the partial match indicator? And how would it even know partial match was intended, rather than just being another piece of punctuation?
To add something on top of the other good answers here, you can also compose your query with both websearch_to_tsquery and to_tsquery to have everything from both worlds:
select * from your_table where ts_vector_col ## to_tsquery('simple', websearch_to_tsquery('simple', 'partial query')::text || ':*')
Another solution I have come up with is to do the text transform as part of the query so building the tsquery looks like this
to_tsquery(concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & '), ':*'));
(trim) Removes leading/trailing whitespace
(regexp_replace) Splits the search string on non word chars and adds trailing wildcards to each term, then ANDs the terms (:* & )
(concat) Adds a trailing wildcard to the final term
(to_tsquery) Converts to a ts_query
You can test the string manipulation by running
SELECT concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & ', 'gm'), ':*')
the result should be
all:* & the:* & search:* & terms:* & here:*
So you have multi word partial matches e.g. searching spi ma would return results matching spider man

How to use regex capture groups in postgres stored procedures (if possible at all)?

In a system, I'm using a standard urn (RFC8141) as one of the fields. From that urn, one can derive a unique identifier. The weird thing about the urns described in RFC8141 is that you can have two different urns which are equal.
In order to check for unique keys, I need to extract different parts of the urn that make a unique key. To do so, I have this regex (Regex which matches URN by rfc8141):
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}[^-]):(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
which results in a five named capture groups (nid, nss, rcomponent, qcomponent en fcomponent). Only the nid and nss are important to check for uniqueness/equality. Or: even if the components change, as long as nid and nss are the same, two items/records are equal (no matter the values of the components). nid is checked case-insensitive, nss is checked case-sensitive.
Now, in order to check for uniqueness/equality, I'm defining a 'cleaned urn', which is the primary key. I've added a trigger, so I can extract the different capture groups. What I'd like to do is:
extract the nid and nss (see regex) of the urn
capture them by name. This is where I don't know how to do it: how can I capture these two capture groups in a postgresql stored procedure?
add them as 'cleaned urn', lowercasing nid (so to have case-insensitivity on that part) and url-encoding or url-decoding the string (one of the two, it doesn't matter, as long as it's consistent). (I'm also not sure if there's is a url encode/decode function in Postgres, but I that'll be another question once the previous one is solved :) ).
Example:
all these urns are equal/equivalent (and I want the primary key to be urn:example:a123,z456):
urn:example:a123,z456
URN:example:a123,z456
urn:EXAMPLE:a123,z456
urn:example:a123,z456?+abc (?+ denotes the start of the rcomponent)
urn:example:a123,z456?=xyz/something (?= denotes the start of the qcomponent)
urn:example:a123,z456#789 (# denotes the start of the fcomponent)
urn:example:a123%2Cz456
URN:EXAMPLE:a123%2cz456
urn:example:A123,z456 and urn:Example:A123,z456 both have key urn:example:A123,z456, which is different from the previous examples (because of the case-sensitiveness of the A123,z456).
just for completeness: urn:example:a123,z456?=xyz/something is different from urn:example:a123,z456/something?=xyz: everything after ?= (or ?+ or #) can be omitted, so the /something is part of the primary key in the latter case, but not in the former. (That's what the regex is actually capturing already.)
== EDIT 1: unnamed capture groups ==
with unnamed capture groups, this would be doing the same:
select
g[2] as nid,
g[3] as nss,
g[4] as rcomp,
g[5] as qcomp,
g[6] as fcomp
from (
select regexp_matches('uRn:example:a123,z456?=xyz/something',
'\A(urn:(?!urn:)([a-z0-9][a-z0-9-]{1,31}[^-]):((?:[-a-z0-9()+,.:=#;$_!*''&~\/]|%[0-9a-f]{2})+)(?:\?\+(.*?))?(?:\?=(.*?))?(?:#(.*?))?)$', 'i')
g)
as ar;
(g[1] is the full match, which I don't need)
I updated the query:
case insensitive matching should be done as flag
no capturing groups (postgres seems to have issues with names capturing groups)
and did a select on the array, splitting the array into columns.
Named capture don't seem to be supported and there seem to be some issues with the greedy/lazy lookup and negative lookahead. So, here's a solution that works fine:
DO $$
BEGIN
if not exists (SELECT 1 FROM pg_type WHERE typname = 'urn') then
CREATE TYPE urn AS (nid text, nss text, rcomp text, qcomp text, fcomp text);
end if;
END
$$;
CREATE or REPLACE FUNCTION spliturn(urnstring text)
RETURNS urn as $$
DECLARE
urn urn;
urnregex text = concat(
'\A(urn:(?!urn:)',
'([a-z0-9][a-z0-9-]{1,31}[^-]):',
'((?:[-a-z0-9()+,.:=#;$_!*''&~\/]|%[0-9a-f]{2})+)',
'(?:\?\+(.*?))??',
'(?:\?=(.*?))??',
'(?:#(.*?))??',
')$');
BEGIN
select
lower(g[2]) as nid,
g[3] as nss,
g[4] as rcomp,
g[5] as qcomp,
g[6] as fcomp
into urn
from (select regexp_matches(urnstring, urnregex, 'i')
g) as ar;
RETURN urn;
END;
$$ language 'plpgsql' immutable strict;
note
no named groups (?<...>)
indicate case insensitive search with a flag
replacement of \z with $ to match the end of the string
escaping a quote with another quote ('') to allow for quotes
the double ?? for non-greedy search (Table 9-14)

How to keep the upper case and lower case letters in a column alias in the results in Redshift

In Redshift we are trying to give more meaningful aliases to the columns we are returning from the queries as we are importing the results into TABLEAU, the issue is that RedShift turns all the letter to lower case ones, i.e. from "Event Date" it then returns "event date", any idea on how to work this one out to keep the alias given?
I know I'm a bit late to the party but for anyone else looking, you can enable case sensitivity, so if you want to return a column with camel casing for example
SET enable_case_sensitive_identifier TO true;
Then in your query wrap what you want to return the column as in double quotes
SELECT column AS "thisName"
Or as per OP's example
SELECT a.event_date AS "Event Date"
https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html
Edit: To have this behaviour as default for the cluster you will need to create/update a parameter group in Configurations => Workload Management. You can't change the settings for the default parameter group. Note, you will need to reboot the cluster after applying the parameter group for the changes to take effect.
No, you cannot do this in Redshift. all columns are lowercase only.
You can enforce upper case only by using
set describe_field_name_in_uppercase to on;
Also see the examples here https://docs.aws.amazon.com/redshift/latest/dg/r_names.html you can see that the upper case characters are returned as lower case. and it says "identifiers are case-insensitive and are folded to lowercase in the database"
You can of course rename the column to include uppercase within Tableau.
I was going through AWS docs for redshift and looks like INTCAP function can solve your use case
For reference => https://docs.aws.amazon.com/redshift/latest/dg/r_INITCAP.html
Brief description (copied)
The INITCAP function makes the first letter of each word in a string uppercase, and any subsequent letters are made (or left) lowercase. Therefore, it is important to understand which characters (other than space characters) function as word separators. A word separator character is any non-alphanumeric character, including punctuation marks, symbols, and control characters. All of the following characters are word separators:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
And in your case you have declared field name as event_date which will convert to Event_Date.
And next you can use REPLACE function to replace underscore '_'
For reference => https://docs.aws.amazon.com/redshift/latest/dg/r_REPLACE.html
You need to put
set describe_field_name_in_uppercase to on;
in your Tableau's Initial SQL.

PostgreSQL prevent non-matching tsqueries from matching tsvector

Given the following query:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('cats ate');
This query will return true as a result. Now, what if I don't want "cats" to also match the word "cat", is there any way I can prevent this?
Also, is there any way I can make sure that the tsquery matches the entire string in that particular order (e.g. the "cats ate" is counted as a single token rather than two). At the moment the following query will also match:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('ate cats');
cat matching cats is due to english stemming, english being probably your default text search configuration. See the result of show default_text_search_config to be sure.
It can be avoided by using the simple configuration. Try the function calls with explicit text configurations:
select to_tsvector('simple', 'fat cat ate rat') ## plainto_tsquery('simple', 'cats ate');
Or change it with:
set default_text_search_config='simple';

Use multiple words in FullText Search input string

I have basic stored procedure that performs a full text search against 3 columns in a table by passing in a #Keyword parameter. It works fine with one word but falls over when I try pass in more than one word. I'm not sure why. The error says:
Syntax error near 'search item' in the full-text search condition 'this is a search item'
SELECT S.[SeriesID],
S.[Name] as 'SeriesName',
P.[PackageID],
P.[Name]
FROM [Series] S
INNER JOIN [PackageSeries] PS ON S.[SeriesID] = PS.[PackageID]
INNER JOIN [Package] P ON PS.[PackageID] = P.[PackageID]
WHERE CONTAINS ((S.[Name],S.[Description], S.[Keywords]),#Keywords)
AND (S.[IsActive] = 1) AND (P.[IsActive] = 1)
ORDER BY [Name] ASC
You will have to do some pre-processing on your #Keyword parameter before passing it into the SQL statement. SQL expects that keyword searches will be separated by boolean logic or surrounded in quotes. So, if you are searching for the phrase, it will have to be in quotes:
SET #Keyword = '"this is a search item"'
If you want to search for all the words then you'll need something like
SET #Keyword = '"this" AND "is" AND "a" AND "search" AND "item"'
For more information, see the T-SQL CONTAINS syntax, looking in particular at the Examples section.
As an additional note, be sure to replace the double-quote character (with a space) so you don't mess up your full-text query. See this question for details on how to do that: SQL Server Full Text Search Escape Characters?
Further to Aaron's answer, provided you are using SQL Server 2016 or greater (130), you could use the in-built string fuctions to pre-process your input string. E.g.
SELECT
#QueryString = ISNULL(STRING_AGG('"' + value + '*"', ' AND '), '""')
FROM
STRING_SPLIT(#Keywords, ' ');
Which will produce a query string you can pass to CONTAINS or FREETEXT that looks like this:
'"this*" AND "is*" AND "a*" AND "search*" AND "item*"'
or, when #Keywords is null:
""