Postgresql: how to make full text search ignore certain tokens?

Postgresql: how to make full text search ignore certain tokens? - postgresql

is there a magic function or operator to ignore some tokens?
select to_tsvector('the quick. brown fox') ## 'brown' -- returns true
select to_tsvector('the quick,brown fox') ## 'brown' -- returns true
select to_tsvector('the quick.brown fox') ## 'brown' -- returns false, should return true
select to_tsvector('the quick/brown fox') ## 'brown' -- returns false, should return true

I'm afraid that you are probably stuck. If you run your terms through ts_debug you will see that 'quick.brown' is parsed as a hostname and 'quick/brown' is parsed as filesystem path. The parser really isn't that clever sadly.
My only suggestion is that you preprocess your texts to convert these tokens to spaces. You could easily create a function in plpgsql to do that.
nicg=# select ts_debug('the quick.brown fox');
ts_debug
---------------------------------------------------------------------
(asciiword,"Word, all ASCII",the,{english_stem},english_stem,{})
(blank,"Space symbols"," ",{},,)
(host,Host,quick.brown,{simple},simple,{quick.brown})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",fox,{english_stem},english_stem,{fox})
(5 rows)
As you can see from the above you don't get tokens for quick and brown

Related

PostgreSQL ERROR: invalid input syntax for integer: "1e+06"

The full error message is:
ERROR: invalid input syntax for integer: "1e+06"
SQL state: 22P02
Context: In PL/R function sample
The query I'm using is:
WITH a as
(
SELECT a.tract_id_alias,
array_agg(a.pgid ORDER BY a.pgid) as pgids,
array_agg(a.sample_weight_geo ORDER BY a.pgid) as block_weights
FROM results_20161109.block_microdata_res_joined a
WHERE a.tract_id_alias in (66772, 66773, 66785, 66802, 66805, 66806, 66813)
AND a.bldg_count_res > 0
GROUP BY a.tract_id_alias
)
SELECT NULL::INTEGER agent_id,
a.tract_id_alias,
b.year,
unnest(shared.sample(a.pgids,
b.n_agents,
1 * b.year,
True,
a.block_weights)
) as pgid
FROM a
LEFT JOIN results_20161109.initial_agent_count_by_tract_res_11 b
ON a.tract_id_alias = b.tract_id_alias
ORDER BY b.year, a.tract_id_alias, pgid;
And the shared.sample function I'm using is:
CREATE OR REPLACE FUNCTION shared.sample(ids bigint[], size integer, seed integer DEFAULT 1, with_replacement boolean DEFAULT false, probabilities numeric[] DEFAULT NULL::numeric[])
RETURNS integer[] AS
$BODY$
set.seed(seed)
if (length(ids) == 1) {
s = rep(ids,size)
} else {
s = sample(ids,size, with_replacement,probabilities)
}
return(s)
$BODY$
LANGUAGE plr VOLATILE
COST 100;
ALTER FUNCTION shared.sample(bigint[], integer, integer, boolean, numeric[])
OWNER TO "server-superusers";
I'm pretty new to this stuff, so any help would be appreciated.

Not a problem of the function. Like the error messages says: The string '1e+06' cannot be cast to integer.
Obviously, the columns n_agents in your table results_20161109.initial_agent_count_by_tract_res_11 is not an integer column. Probably type text or varchar? (That info would help in your question.)
Either way, the assignment cast does not work for the target type integer. But it does for numeric:
Does not work:
SELECT '1e+06'::text::int; -- error as in question
Works:
SELECT '1e+06'::text::numeric::int;
If my assumptions hold, you can use this as stepping stone.
Replace b.n_agents in your query with b.n_agents::numeric::int.
It's your responsibility that numbers stay in integer range, or you get the next exception.
If that did not nail it, you need to look into function overloading:
Is there a way to disable function overloading in Postgres
And function type resolution:
PostgreSQL function call
The schema search path is relevant in many related cases, but you did schema-qualify all objects, so we can rule that out.
How does the search_path influence identifier resolution and the "current schema"
Your query generally looks good. I had a look and only found minor improvements:
SELECT NULL::int AS agent_id -- never omit the AS keyword for column alias
, a.tract_id_alias
, b.year
, s.pgid
FROM (
SELECT tract_id_alias
, array_agg(pgid) AS pgids
, array_agg(sample_weight_geo) AS block_weights
FROM ( -- use a subquery, cheaper than CTE
SELECT tract_id_alias
, pgid
, sample_weight_geo
FROM results_20161109.block_microdata_res_joined
WHERE tract_id_alias IN (66772, 66773, 66785, 66802, 66805, 66806, 66813)
AND bldg_count_res > 0
ORDER BY pgid -- sort once in a subquery. cheaper.
) sub
GROUP BY 1
) a
LEFT JOIN results_20161109.initial_agent_count_by_tract_res_11 b USING (tract_id_alias)
LEFT JOIN LATERAL
unnest(shared.sample(a.pgids
, b.n_agents
, b.year -- why "1 * b.year"?
, true
, a.block_weights)) s(pgid) ON true
ORDER BY b.year, a.tract_id_alias, s.pgid;

Strange result searching with to_tsquery under Postgresql

I got a strange result searching for an expression like pro-physik.de with tsquery.
If I ask for pro-physik:* by tsquery I want to get all entries starting with pro-physik. Unfortunately those entries with pro-physik.de are missing.
Here are 2 examples to demonstrate the problem:
Query 1:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-physik:*') = true
Result 1: false (should be true)
Query 2:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-p:*') = true
Result 2: true
Has anybody an idea how I could solve this problem?

The core of the problem is that the parser will parse pro-physik.de as a hostname:
SELECT alias, token FROM ts_debug('simple', 'pro-physik.de');
alias | token
-------+---------------
host | pro-physik.de
(1 row)
Compare this:
SELECT alias, token FROM ts_debug('simple', 'pro-physik-de');
alias | token
-----------------+---------------
asciihword | pro-physik-de
hword_asciipart | pro
blank | -
hword_asciipart | physik
blank | -
hword_asciipart | de
(6 rows)
Now pro-physik and pro-p are not hostnames, so you get
SELECT to_tsquery('simple', 'pro-physik:*');
to_tsquery
---------------------------------------
'pro-physik':* & 'pro':* & 'physik':*
(1 row)
SELECT to_tsquery('simple', 'pro-p:*');
to_tsquery
-----------------------------
'pro-p':* & 'pro':* & 'p':*
(1 row)
The first tsquery will not match because physik is not a prefix of pro-physik.de, and the second will match because pro-p, pre and p all three are prefixes.
As a workaround, use full text search like this:
select
to_tsvector('simple', replace('pro-physik.de', '.', ' ')) ##
to_tsquery('simple', replace('pro-physik:*', '.', ' '))

How to cast varchar to boolean

I have a variable 'x' which is varchar in staging table, but it is set to boolean in target table which has 'true' and 'false' values. How can I convert varchar to boolean in postgresql?

If the varchar column contains one of the strings (case-insensitive):
t, true, y, yes, on, 1
f, false, n, no, off, 0
you can simply cast it to boolean, e.g:
select 'true'::boolean, 'false'::boolean;
bool | bool
------+------
t | f
(1 row)
See SQLFiddle.

For Redshift, I had the best luck with the following:
SELECT DECODE(column_name,
'false', '0',
'true', '1'
)::integer::boolean from table_name;
This simply maps the varchar strings to '0' or '1' which Redshift can then cast first to integers, then finally to boolean.
A big advantage to this approach is that it can be expanded to include any additional strings which you would like to be mapped. i.e:
SELECT DECODE(column_name,
'false', '0',
'no', '0',
'true', '1',
'yes', '1'
)::integer::boolean from table_name;
You can read more about the DECODE method here.

In aws redshift unfortunately #klin answer doesn't work as mentioned by others. Inspired in the answer of #FoxMulder900, DECODE seems the way to go but there is no need to cast it to an integer first:
SELECT DECODE(original,
'true', true, -- decode true
'false', false, -- decode false
false -- an optional default value
) as_boolean FROM bar;
The following works:
WITH bar (original) AS
(SELECT 'false' UNION SELECT 'true' UNION SELECT 'null') -- dumb data
SELECT DECODE(original,
'true', true, -- decode true
'false', false, -- decode false
false -- an optional default value
) as_boolean FROM bar;
which gives:
original
as_boolean
false
false
null
false
true
true
I hope this helps redshift users.

For old PostgreSQL versions and in Redshift casting won't work but the following does:
SELECT boolin(textout('true'::varchar)), boolin(textout('false'::varchar));
See SQLFiddle also see the discussion on the PostgreSQL list.

If you can assume anything besides true is false, then you could use:
select
column_name = 'true' column_name_as_bool
from
table_name;

Search with Turkish characters

I have problem on db search with like and elastic search in Turkish upper and lower case.
For example I have posts table which contains post titled 'DENEME YAZI'.
If I run this query:
select * from posts where title like '%deneme%';
or:
select * from posts where title like '%YAZI%';
I get correct result but if I run:
select * from posts where title like '%yazı%';
it doesn't return any record. My database encoding is tr_TR.UTF-8.
How can I get correct results without entering exact word?

You must use ILIKE for case insensitive matches:
select * from posts where title ilike '%yazı%';
However, there is the additional complication of peculiar rules in the Turkish locale. Upper case of 'ı' is 'I'. But not the other way round. Lower case of 'I' is 'i':
db=# SELECT lower(upper('ı'));
lower
-------
i
You could solve that by applying upper() on either side of the LIKE expression:
select upper('DENEME YAZI') like ('%' || upper('yazı') || '%');

Applying just a single UPPER (or LOWER) on either side of the expression is not a solution. You should handle problematic Turkish characters (ıI-iİ) by yourself.
İ and i are the same letters in Turkish alphabet.
I and ı are the same letters in Turkish alphabet.
But even using UTF-8, Latin5, Windows 1254 Encoding and collation settings in postgre
UPPER('İ') returns 'İ' OK
UPPER('i') return 'I' Not OK
UPPER('I') returns 'I' OK
UPPER('ı') return 'İ' Not OK
so
SELECT ... FROM ... WHERE ... UPPER('İZMİR') like UPPER('izmir') return false
SELECT ... FROM ... WHERE ... UPPER('ISPARTA') like UPPER('ısparta') return false.
Here's some more precise but not perfect solution because of performance issues
SELECT ... FROM ... WHERE ...
UPPER(REPLACE(REPLACE(COLUMNX, 'i', 'İ'), 'ı', 'I')) = UPPER(REPLACE(REPLACE(myvalue,
'i', 'İ'), 'ı', 'I'))
or
SELECT ... FROM ... WHERE ...
UPPER(TRANSLATE('COLUMNX','ıi','Iİ')) = UPPER(TRANSLATE(myvalue,'ıi','Iİ'))

Modifying a SQL query within the Select statement

I have a stored proc that is called from my asp.net page. The field "RecordUpdated" returns TRUE or FALSE. I need that column to return YES (if true) or NO (if false)
How would I do that?
SET #SQL = 'SELECT RecID, Vendor_Number, Vendor_Name, Invoice_Number, Item_Number, RecordAddDate, RecordUpdated FROM Vendor_Invoice_Log'
SET #SQL = #SQL + ' ORDER BY Item_Number, Vendor_Number
PRINT #SQL

If all you are doing is trying to change TRUE to YES and FALSE to NO, you can do this:
select case
when recordupdated = true then 'YES'
when recordupdated = false then 'NO'
end recordupdates ...
Of course your code doesn't actually execute, so I am uncertain why you showed lines 2 and 3.

In situations where you need to return one or the other of something when doing a SELECT, the
CASE
WHEN true
THEN YES
WHEN false
THEN NO
END AS RecordUpdated
is pretty useful.

Sounds like a job for CASE()...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgresql: how to make full text search ignore certain tokens? - postgresql

Related

PostgreSQL ERROR: invalid input syntax for integer: "1e+06"

Strange result searching with to_tsquery under Postgresql

How to cast varchar to boolean

Search with Turkish characters

Modifying a SQL query within the Select statement

Categories

Resources