Postgres Full-Text Search with Hyphen and Numerals - postgresql

I have observed what seems to me an odd behavior of Postgres' to_tsvector function.
SELECT to_tsvector('english', 'abc-xyz');
returns
'abc':2 'abc-xyz':1 'xyz':3
However,
SELECT to_tsvector('english', 'abc-001');
returns
'-001':2 'abc':1
Why not something like this?
'abc':2 'abc-001':1 '001':3
And what should I do to be able to search by the numeric portion alone, without the hyphen?

Seems the text search parser identifies the hyphen followed by digits to be the sign of a signed integer. Debug with ts_debug():
SELECT * FROM ts_debug('english', 'abc-001');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | abc | {simple} | simple | {abc}
int | Signed integer | -001 | {simple} | simple | {-001}
Other text search configurations (like 'simple' instead of 'english') won't help as the parser itself is "at fault" here (debatable).
A simple way around it (other than modifying the parser, which I never tried) would to pre-process strings and replace hyphens with m-dash (—) or just blanks to make sure those are identified as "Space symbols". (Actual signed integers lose their negative sign in the process.)
SELECT to_tsvector('english', translate('abc-001', '-', '—'))
## to_tsquery ('english', '001'); -- true now
db<>fiddle here

This can be circumvented with PG13's dict-int addon's absval option. See the official documentation.
But in case you're stuck with an earlier PG version, here's the generalized version of a "number or negative number" workaround in a query.
select regexp_replace($$'test' & '1':* & '2'$$::tsquery::text,
'''([.\d]+''(:\*)?)', '(''\1 | ''-\1)', 'g')::tsquery;
This results in:
'test' & ( '1':* | '-1':* ) & ( '2' | '-2' )
It replaces lexemes that look like positive numbers with "number or negative number" kind of subqueries.
The double cast ::tsquery::text is just there to show how you would pass a tsquery casted to text.
Note that it handles prefix matching numeric lexemes as well.

Related

How to remove multiple characters between 2 special characters in a column in SSIS expression

I want to remove the multiple characters starting from '#' till the ';' in derived column expression in SSIS.
For example,
my input column values are,
and want the output as,
Note: Length after '#' is not fixed.
Already tried in SQL but want to do it via SSIS derived column expression.
First of all: Please do not post pictures. We prefer copy-and-pastable sample data. And please try to provide a minimal, complete and reproducible example, best served as DDL, INSERT and code as I do it here for you.
And just to mention this: If you control the input, you should not mix information within one string... If this is needed, try to use a "real" text container like XML or JSON.
SQL-Server is not meant for string manipulation. There is no RegEx or repeated/nested pattern matching. So we would have to use a recursive / procedural / looping approach. But - if performance is not so important - you might use a XML hack.
--DDL and INSERT
DECLARE #tbl TABLE(ID INT IDENTITY,YourString VARCHAR(1000));
INSERT INTO #tbl VALUES('Here is one without')
,('One#some comment;in here')
,('Two comments#some comment;in here#here is the second;and some more text')
--The query
SELECT t.ID
,t.YourString
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML) SeeTheIntermediateXML
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML).value('.','nvarchar(max)') CleanedValue
FROM #tbl t
The result
+----+-------------------------------------------------------------------------+-----------------------------------------+
| ID | YourString | CleanedValue |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 1 | Here is one without | Here is one without |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 2 | One#some comment;in here | One in here |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 3 | Two comments#some comment;in here#here is the second;and some more text | Two comments in here and some more text |
+----+-------------------------------------------------------------------------+-----------------------------------------+
The idea in short:
Using some string methods we can wrap your unwanted text in XML comments.
Look at this
Two comments<!--some comment--> in here<!--here is the second--> and some more text
Reading this XML with .value() the content will be returned without the comments.
Hint 1: Use '-->;' in your replacement to keep the semi-colon as delimiter.
Hint 2: If there might be a semi-colon ; somewhere else in your string, you would see the --> in the result. In this case you'd need a third REPLACE() against the resulting string.

Is there a way to see raw string values using SQL / presto SQL / athena?

Edit after asked to better specify my need:
TL;DR: How to show whitespace escaped characters (such as /r) in the Athena console when performing a query? So this: "abcdef/r" instead of this "abcdef ".
I have a dataset with a column that contains some strings of variable length, all of them with a trailing whitespace.
Now, since I had analyzed this data before, using python, I know that this whitespace is a \r; however, if in Athena I SELECT my_column, it obviously doesn't show the escaped whitespace.
Essentially, what I'm trying to achieve:
my_column | ..
----------+--------
abcdef\r | ..
ghijkl\r | ..
What I'm getting instead:
my_column | ..
----------+--------
abcdef | ..
ghijkl | ..
If you're asking why would I want that, it's just to avoid having to parse this data through python if I ever incur in this situation again, so that I can immediately know if there's any weird escaped characters in my strings.
Any help is much appreciated.

PostgreSQL tuple format

Is there any document describing the tuple format that PostgreSQL server adheres to? The official documentation appears arcane about this.
A single tuple seems simple enough to figure out, but when it comes to arrays of tuples, arrays of composite tuples, and finally nested arrays of composite tuples, it is impossible to be certain about the format simply by looking at the output.
I am asking this following my initial attempt at implementing pg-tuple, a parser that's still missing today, to be able to parse PostgreSQL tuples within Node.js
Examples
create type type_A as (
a int,
b text
);
with a simple text: (1,hello)
with a complex text: (1,"hello world!")
create type type_B as (
c type_A,
d type_A[]
);
simple-value array: {"(2,two)","(3,three)"}
for type_B[] we can get:
{"(\"(7,inner)\",\"{\"\"(88,eight-1)\"\",\"\"(99,nine-2)\"\"}\")","(\"(77,inner)\",\"{\"\"(888,eight-3)\"\",\"\"(999,nine-4)\"\"}\")"}
It gets even more complex for multi-dimensional arrays of composite types.
UPDATE
Since it feels like there is no specification at all, I have started working on reversing it. Not sure if it can be done fully though, because from some initial examples it is often unclear what formatting rules are applied.
As Nick posted, according to docs:
the whitespace will be ignored if the field type is integer, but not
if it is text.
and
The composite output routine will put double quotes around field
values if they are empty strings or contain parentheses, commas,
double quotes, backslashes, or white space.
and
Double quotes and backslashes embedded in field values will be
doubled.
and now quoting Nick himself:
nested elements are converted to strings, and then quoted / escaped
like any other string
I give shorted example below, comfortably compared against its nested value:
a=# create table playground (t text, ta text[],f float,fa float[]);
CREATE TABLE
a=# insert into playground select 'space here',array['','bs\'],8.0,array[null,8.1];
INSERT 0 1
a=# insert into playground select 'no_space',array[null,'nospace'],9.0,array[9.1,8.0];
INSERT 0 1
a=# select playground,* from playground;
playground | t | ta | f | fa
---------------------------------------------------+------------+----------------+---+------------
("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
(no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
If you go for deeper nested quoting, look at:
a=# select nested,* from (select playground,* from playground) nested;
nested | playground | t | ta | f | fa
-------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+------------+----------------+---+------------
("(""space here"",""{"""""""",""""bs\\\\\\\\""""}"",8,""{NULL,8.1}"")","space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | ("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
("(no_space,""{NULL,nospace}"",9,""{9.1,8}"")",no_space,"{NULL,nospace}",9,"{9.1,8}") | (no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
As you can see, the output again follows rules the above.
This way in short answers to your questions would be:
why array is normally presented inside double-quotes, while an empty array is suddenly an open value? (text representation of empty array does not contain comma or space or etc)
why a single " is suddenly presented as \""? (text representation of 'one\ two', according to rules above is "one\\ two", and text representation of the last is ""one\\\\two"" and it is just what you get)
why unicode-formatted text is changing the escaping for \? How can we tell the difference then? (According to docs,
PostgreSQL also accepts "escape" string constants, which are an
extension to the SQL standard. An escape string constant is specified
by writing the letter E (upper or lower case) just before the opening
single quote
), so it is not unicode text, but the the way you tell postgres that it should interpret escapes in text not as symbols, but as escapes. Eg E'\'' will be interpreted as ' and '\'' will make it wait for closing ' to be interpreted. In you example E'\\ text' the text represent of it will be "\\ text" - we add backslsh for backslash and take value in double quotes - all as described in online docs.
the way that { and } are escaped is not always clear (I could not anwer this question, because it was not clear itself)

Search text between symbol

I have this text (taken from concatenated field row)
Astronomic Event 2013/1434H - Aceh ....
How do We search it by 2013 or 1434h keywords?
I have tried below code but it resulting no row.
to_tsvector result:
'2013/1434h':8,12 'aceh':1 'bin.....
Sample Case:
WITH sample_table as
(SELECT to_tsvector('Astronomic Event 2013/1434H - Aceh') sample_content)
SELECT *
FROM sample_table, to_tsquery('2013') q
WHERE sample_content ## q
How do We search it by 2013 or 1434h keywords?
It seems like you want to replace:
to_tsquery('1434h') q
with:
to_tsquery('1434h | 2013') q
http://www.postgresql.org/docs/current/static/functions-textsearch.html
Side note: the to_tsquery() syntax is extremely capricious. It doesn't allow for much if any fantasy, and many of the assumptions in Postgres are everything but end-user friendly.
More often than not, you'll be better off using plainto_tsquery(), which allows any amount of garbage to be thrown at it. Thus, consider pre-processing the string before issuing the query. For instance, you could split the string, and OR the original parts together:
where sc.text_index ## (plainto_tsquery('1434h') || plainto_tsquery('2013'))
Doing so will make your code a bit more complex, but it won't rely on your users needing to understand that (contrary to what they're accustomed to in Google) they should enter 'quick & brown & fox & jumps & lazy & dog' instead of plain 'The quick brown fox jumps over the lazy dog'.
Edit: I ended up actually trying your sample query, and it seems you're actually running into a parser issue:
# SELECT alias, description, token FROM ts_debug('Astronomic Event 2013/1434H - Aceh');
alias | description | token
-----------+-------------------+------------
asciiword | Word, all ASCII | Astronomic
blank | Space symbols |
asciiword | Word, all ASCII | Event
blank | Space symbols |
file | File or path name | 2013/1434H
blank | Space symbols |
blank | Space symbols | -
asciiword | Word, all ASCII | Aceh
(8 rows)
http://www.postgresql.org/docs/current/static/textsearch-parsers.html
It looks like you might need to write (or find) and configure an app-specific parser. Having never done this personally, the best I can do is to highlight that Postgres allows this and includes a sample:
http://www.postgresql.org/docs/current/static/test-parser.html
Alternatively, change your tsvector-related trigger so that it matches e.g. \d{4}/\d+[a-zA-Z] or whatever seems most appropriate, and adds spaces accordingly, before converting it to a tsvector. Something as simple as the following might do the trick if you never need to store file names:
SELECT alias, description, token
FROM ts_debug(replace('Astronomic Event 2013/1434H - Aceh', '/', ' / '));

bytea type & nulls, Postgres

I'm using a bytea type in PostgreSQL, which, to my understanding, contains just a series of bytes. However, I can't get it to play well with nulls. For example:
=# select length(E'aa\x00aa'::bytea);
length
--------
2
(1 row)
I was expecting 5. Also:
=# select md5(E'aa\x00aa'::bytea);
md5
----------------------------------
4124bc0a9335c27f086f24ba207a4912
(1 row)
That's the MD5 of "aa", not "aa\x00aa". Clearly, I'm Doing It Wrong, but I don't know what I'm doing wrong. I'm also on an older version of Postgres (8.1.11) for reasons outside of my control. (I'll see if this behaves the same on the latest Postgres as soon as I get home...)
Try this:
# select length(E'aa\\000aa'::bytea);
length
--------
5
Updated: Why the original didn't work? First, understand the difference between one slash and two:
pg=# select E'aa\055aa', length(E'aa\055aa') ;
?column? | length
----------+--------
aa-aa | 5
(1 row)
pg=# select E'aa\\055aa', length(E'aa\\055aa') ;
?column? | length
----------+--------
aa\055aa | 8
In the first case, I'm writing a literal string, 4 characters unescaped('a') and one escaped. The slash is consumed by the parser in a first pass, which converts the full \055
to a single char ('-' in this case).
In the second case, the first slash just escapes the second, the pair \\ is translated by the parser to a single \ and the 055 is seen as three characters.
Now, when converting a text to a bytea, escape characters (in a already parsed or produced text) are parsed/interpreted again! (Yes, this is confusing).
So, when I write
select E'aa\000aa'::bytea;
in the first parsing, the literal E'aa\000aa' is converted to an internal text with a null character in the third position (and depending on your postgresql version, the null character is interpreted as an EOS, and the text is assumed to be of length two - or in other versions an illegal string error is thrown).
Instead, when I write
select E'aa\\000aa'::bytea;
in the first parsing, the literal string "aa\000aa" (eight characters) is seen, and is asigned to a text; then in the casting to bytea, it is parsed again, and the sequence of characters '\000' is interpreted as a null byte.
IMO postgresql kind of sucks here.
You can use regular strings or dollar-quoted strings instead of escaped strings:
# select length('aa\000aa'::bytea);
length
════════
5
(1 row)
# select length($$aa\000aa$$::bytea);
length
════════
5
(1 row)
I think that dollar-quoted strings are a better option because, if the configuration parameter standard_conforming_strings is off, then PostgreSQL recognizes backslash escapes in both regular and escape string constants. However, as of PostgreSQL 9.1, the default is on, meaning that backslash escapes are recognized only in escape string constants.