Optimization of search on concatenated firstname and lastname in PostgreSQL - postgresql

I've written a SQL query in Postgres which search for a user by both firstname and lastname. My question is simply if it can be optimized, since it will be used a lot.
CREATE INDEX users_firstname_special_idx ON users(firstname text_pattern_ops);
CREATE INDEX users_lastname_special_idx ON users(lastname text_pattern_ops);
SELECT id, firstname, lastname FROM users WHERE firstname || ' ' || lastname ILIKE ('%' || 'sen' || '%') LIMIT 25;
If I run an explain I get the followin output:
Limit (cost=0.00..1.05 rows=1 width=68)
-> Seq Scan on users (cost=0.00..1.05 rows=1 width=68)
Filter: (((firstname || ' '::text) || lastname) ~~* '%sen%'::text)
As I understand I should try and make postgrep skip the "Filter:"-thing. Is that correct?
Hope you guys have any suggestions.
Cheers.

If you have more than 1 % wildcards in a string, you need to use a trigram index.
In your case, however, you are doing something odd. You concatenate firstname and lastname, with a space in the middle. The search string '%sen%' is therefore only present in the firstname or the lastname and never including the space. A better solution would therefore be:
CREATE INDEX users_firstname_special_idx ON users USING gist (firstname gist_trgm_ops);
CREATE INDEX users_lastname_special_idx ON users USING gist (lastname gist_trgm_ops);
SELECT id, firstname || ' ' || lastname AS fullname
FROM users
WHERE firstname ILIKE ('%sen%') OR lastname ILIKE ('%sen%')
LIMIT 25;

You described situation exactly from PostgreSQL Documentation:
Indexes on Expressions

Related

How/When do views generate columns

Consider the following setup:
CREATE TABLE person (
first_name varchar,
last_name varchar,
age INT
);
INSERT INTO person (first_name, last_name, age)
VALUES ('pete', 'peterson', 16),
('john', 'johnson', 20),
('dick', 'dickson', 42),
('rob', 'robson', 30);
Create OR REPLACE VIEW adult_view AS
SELECT
first_name || ' ' || last_name as full_name,
age
FROM person;
If I run:
SELECT *
FROM adult_view
WHERE age > 18;
Will the view generate the full_name column for pete even though he gets filtered out?
Similarly, if I run:
SELECT age
FROM adult_view;
Will the view generate any full_name columns at all?
Concerning your first query:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT full_name
FROM adult_view
WHERE age > 18;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════
Seq Scan on laurenz.person
Output: (((person.first_name)::text || ' '::text) || (person.last_name)::text)
Filter: (person.age > 18)
Query Identifier: 5675263059379476127
So the table is scanned, people under 19 are filtered out, and then the result is calculated. So full_name won't be computed for Pete.
Concerning your second query:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT age
FROM adult_view
WHERE age > 18;
QUERY PLAN
════════════════════════════════════════
Seq Scan on laurenz.person
Output: person.age
Filter: (person.age > 18)
Query Identifier: -8981994317495194105
(4 rows)
full_name is not calculated at all.

PostgreSQL Trigram indexes vs btree

I have this searching query for my cake database, which is currently very slow and I'm looking to improve it. I am running PostgreSQL v. 9.6.
Table structure:
Table: Cakes
=====
id int
cake_id varchar
cake_short_name varchar
cake_full_name varchar
has_recipe boolean
createdAt Datetime
updatedAt DateTime
Table: CakeViews
=========
id int
cake_id varchar
createdAt Datetime
updatedAt DateTime
Query:
WITH myconstants (myquery, queryLiteral) as (
values ('%a%', 'a')
)
select
full_count,
cake_id,
cake_short_name,
cake_full_name,
has_recipe,
views
from (
select
count(*) OVER() AS full_count,
cake_id,
cake_short_name,
cake_full_name,
has_recipe
cast((select count(*) FROM "CakeViews" as cv where "createdAt" > CURRENT_DATE - 3 and c.cake_id = cv.cake_id) as integer) as views
from "Cakes" c, myconstants
where has_recipe = true
and (cake_full_name ilike myquery or cake_short_name ilike myquery)
or cake_full_name ilike lower(queryLiteral) or cake_short_name ilike lower(queryLiteral)) t, myconstants
order by views desc,
case
when cake_short_name ilike lower(queryLiteral) then 1
when cake_full_name ilike lower(queryLiteral) then 1
end,
case
when has_recipe = true and cake_short_name ilike myquery then length(cake_short_name)
when has_recipe = true and cake_full_name ilike myquery then length(cake_full_name)
end
limit 10
I have ideas for the following indices, but they don't speed up the query that much:
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_cakes_cake_short_name ON public."Cakes" (lower(cake_short_name) varchar_pattern_ops);
CREATE INDEX idx_cakes_cake_id ON public."Cakes" (cake_short_name);
CREATE INDEX idx_cakeviews_cake_id ON public."CakeViews" (cake_id);
CREATE INDEX idx_cakes_cake_short_name ON public."Cakes" USING gin (cake_short_name gin_trgm_ops);
CREATE INDEX idx_cakes_cake_full_name ON public."Cakes" USING gin (cake_full_name gin_trgm_ops);
Questions:
What indices would be better or which am I missing?
Is my query inefficient?
EDIT: Explain Analyze output: here
The query '%a%' doesn't contain any trigrams, so the index will not be useful there. It must scan the whole table. But if you used a longer query, then they might be useful.
The index on "CakeViews" (cake_id) would be better if it were on "CakeViews" (cake_id, "createdAt"). Except that none of your cakes seem to have any views, so if that is generally the case I guess it wouldn't matter.

Matching on postal code and city name - very slow in PostgreSQL

I am trying to update address fields in mytable with data from othertable.
If I match on postal codes and search for city names from othertable in mytable, it works reasonably fast. But as I don't have postal codes in all cases, I also want to look for names only in a 2nd query. This takes hours (>12h). Any ideas what I can do to speed up the query? Please note that indexing did not help. Index scans in (2) weren't faster.
Code for matching on postal code + name (1)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t.postal_code = t1.postal_code and t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Code for matching on name only (2)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Query plan 1:
"Update on mytable t1 (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" -> Merge Join (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" Merge Cond: (((t1.postal_code)::text = (othertable.postal_code)::text) AND (t1.country = othertable.country))"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.postal_code, t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.postal_code, othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"
Query plan 2:
"Update on mytable t1 (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" -> Merge Join (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" Merge Cond: (t1.country = othertable.country)"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"
In the second query, you are joining (more or less) on city name, but the othertable has several entries per city name, so you are updating mytable several times per record, with unpredictable value (which lat-long or other admin2/3 will be the last one to be updated?)
If othertable has entries without postal code, use them by adding an extra condition AND othertable.posalcode is null
Else, you will want to get a subset of othertable that returns one row per admin1 + country value. You would replace select * from othertable by the following query. Of course you might want to adjust it to get another lat/long/admin2-3 than the 1st one..
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country
Worst, the second query overwrite what was updated in the 1st query, so you must ignore these records by adding and mytable.postalcode is null
The entire query could be
UPDATE mytable t1
SET
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng
FROM (
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country) t
WHERE t1.country = t.country
AND upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
AND admin1 is null
AND mytable.postal_code is null;

LIKE search of joined and concatenated records is really slow (PostgreSQL)

I'm returning a unique list of id's from the users table, where specific columns in a related table (positions) contain a matching string.
The related table may have multiple records for each user record.
The query is taking a really really long time (its not scaleable), so I'm wondering if I'm structuring the query wrong in some fundamental way?
Users Table:
id | name
-----------
1 | frank
2 | kim
3 | jane
Positions Table:
id | user_id | title | company | description
--------------------------------------------------
1 | 1 | manager | apple | 'Managed a team of...'
2 | 1 | assistant | apple | 'Assisted the...'
3 | 2 | developer | huawei | 'Build a feature that...'
For example: I want to return the user's id if a related positions record contains "apple" in either the title, company or description columns.
Query:
select
distinct on (users.id) users.id,
users.name,
...
from users
where (
select
string_agg(distinct users.description, ', ') ||
string_agg(distinct users.title, ', ') ||
string_agg(distinct users.company, ', ')
from positions
where positions.users_id::int = users.id
group by positions.users_id::int) like '%apple%'
UPDATE
I like the idea of moving this into a join clause. But what I'm looking to do is filter users conditional on below. And I'm not sure how to do both in a join.
1) finding the keyword in title, company, description
or
2) finding the keyword with full-text search in an associated string version of a document in another table.
select
to_tsvector(string_agg(distinct documents.content, ', '))
from documents
where users.id = documents.user_id
group by documents.user_id) ## to_tsquery('apple')
So I was originally thinking it might look like,
select
distinct on (users.id) users.id,
users.name,
...
from users
where (
(select
string_agg(distinct users.description, ', ') ||
string_agg(distinct users.title, ', ') ||
string_agg(distinct users.company, ', ')
from positions
where positions.users_id::int = users.id
group by positions.users_id::int) like '%apple%')
or
(select
to_tsvector(string_agg(distinct documents.content, ', '))
from documents
where users.id = documents.user_id
group by documents.user_id) ## to_tsquery('apple'))
But then it was really slow - I can confirm the slowness is from the first condition, not the full-text search.
Might not be the best solution, but a quick option is:
SELECT DISTINCT ON ( u.id ) u.id,
u.name
FROM users u
JOIN positions p ON (
p.user_id = u.id
AND ( description || title || company )
LIKE '%apple%'
);
Basically got rid of the subquery, unnecessary string_agg usage, grouping on position table etc.
What it does is doing conditional join and removing duplicate is covered by distinct on.
PS! I used table aliases u and p to shorten the example
EDIT: adding also WHERE example as requested
SELECT DISTINCT ON ( u.id ) u.id,
u.name
FROM users u
JOIN positions p ON ( p.user_id = u.id )
WHERE ( p.description || p.title || p.company ) LIKE '%apple%'
OR ...your other conditions...;
EDIT2: new details revealed setting new requirements of the original question. So adding new example for updated ask:
Since you doing lookups to 2 different tables (positions and uploads) with OR condition then simple JOIN wouldn't work.
But both lookups are verification type lookups - only looking does %apple% exists, then you do not need to aggregate and group by and convert the data.
Using EXISTS that returns TRUE for first match found is what you seem to need anyway. So removing all unnecessary part and using with LIMIT 1 to return positive value if first match found and NULL if not (latter will make EXISTS to become FALSE) will give you same result.
So here is how you could solve it:
SELECT DISTINCT ON ( u.id ) u.id,
u.name
FROM users u
WHERE EXISTS (
SELECT 1
FROM positions p
WHERE p.users_id = u.id::int
AND ( description || title || company ) LIKE '%apple%'
LIMIT 1
)
OR EXISTS (
SELECT 1
FROM uploads up
WHERE up.user_id = u.id::int -- you had here reference to table 'document', but it doesn't exists in your example query, so I just added relation to 'upoads' table as you have in FROM, assuming 'content' column exists there
AND up.content LIKE '%apple%'
LIMIT 1
);
NB! in your example queries have references to tables/aliases like documents which doesn't reflect anywhere in the FROM part. So either you have cut in your example real query with wrong naming or you have made other way typo is something you need to verify and adjust my example query accordingly.

LATERAL JOIN not using trigram index

I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:
=> \d addresses
Table "public.addresses"
Column | Type | Modifiers
---------+------+-----------
address | text |
I also have a table of location data:
=> \d locations
Table "public.locations"
Column | Type | Modifiers
------------+------+-----------
id | text |
country | text |
postalcode | text |
latitude | text |
longitude | text |
Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:
EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
SELECT * FROM locations
WHERE address ilike '%' || postalcode || '%'
ORDER BY LENGTH(postalcode) DESC
LIMIT 1
) AS l ON true;
That gave the expected result, but it was slow. Here's the query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=18383.07..18540688323.77 rows=1008572 width=91)
-> Seq Scan on addresses a (cost=0.00..20997.72 rows=1008572 width=56)
-> Limit (cost=18383.07..18383.07 rows=1 width=35)
-> Sort (cost=18383.07..18391.93 rows=3547 width=35)
Sort Key: (length(locations.postalcode))
-> Seq Scan on locations (cost=0.00..18365.33 rows=3547 width=35)
Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))
I tried adding a gist trigram index to the address column, like mentioned at https://stackoverflow.com/a/13452528/36191, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.
CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);
I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER or LIMIT:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=39.35..129156073.06 rows=3577682241 width=86)
-> Seq Scan on locations (cost=0.00..12498.55 rows=709455 width=28)
-> Bitmap Heap Scan on addresses a (cost=39.35..131.60 rows=5043 width=58)
Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
-> Bitmap Index Scan on idx_address (cost=0.00..38.09 rows=5043 width=0)
Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?
Why?
The query cannot use the index on principal. You would need an index on the table locations, but the one you have is on the table addresses.
You can verify my claim by setting:
SET enable_seqscan = off;
(In your session only, and for debugging only. Never use it in production.) It's not like the index would be more expensive than a sequential scan, there is just no way for Postgres to use it for your query at all.
Aside: [INNER] JOIN ... ON true is just an awkward way of saying CROSS JOIN ...
Why is the index used after removing ORDER and LIMIT?
Because Postgres can rewrite this simple form to:
SELECT *
FROM addresses a
JOIN locations l ON a.address ILIKE '%' || l.postalcode || '%';
You'll see the exact same query plan. (At least I do in my tests on Postgres 9.5.)
Solution
You need an index on locations.postalcode. And while using LIKE or ILIKE you would also need to bring the indexed expression (postalcode) to the left side of the operator. ILIKE is implemented with the operator ~~* and this operator has no COMMUTATOR (a logical necessity), so it's not possible to flip operands around. Detailed explanation in these related answers:
Can PostgreSQL index array columns?
PostgreSQL - text Array contains value similar to
Is there a way to usefully index a text column containing regex patterns?
A solution is to use the trigram similarity operator % or its inverse, the distance operator <-> in a nearest neighbour query instead (each is commutator for itself, so operands can switch places freely):
SELECT *
FROM addresses a
JOIN LATERAL (
SELECT *
FROM locations
ORDER BY postalcode <-> a.address
LIMIT 1
) l ON address ILIKE '%' || postalcode || '%';
Find the most similar postalcode for each address, and then check if that postalcode actually matches fully.
This way, a longer postalcode will be preferred automatically since it's more similar (smaller distance) than a shorter postalcode that also matches.
A bit of uncertainty remains. Depending on possible postal codes, there could be false positives due to matching trigrams in other parts of the string. There is not enough information in the question to say more.
Here, [INNER] JOIN instead of CROSS JOIN makes sense, since we add an actual join condition.
The manual:
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.
So:
CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);
It's a far shot, but how does the following alternative perform?
SELECT DISTINCT ON ((x.a).address) (x.a).*, l.*
FROM (
SELECT a, l.id AS lid, LENGTH(l.postalcode) AS pclen
FROM addresses a
LEFT JOIN locations l ON (a.address ilike '%' || l.postalcode || '%') -- this should be fast, but produce many rows
) x
LEFT JOIN locations l ON (l.id = x.lid)
ORDER BY (x.a).address, pclen DESC -- this is where it will be slow, as it'll have to sort the entire results, to filter them by DISTINCT ON
It can work if you turn the lateral join inside out. But even then it might still be really slow
SELECT DISTINCT ON (address) *
FROM (
SELECT *
FROM locations
,LATERAL(
SELECT * FROM addresses
WHERE address ilike '%' || postalcode || '%'
OFFSET 0 -- force fencing, might be redundant
) a
) q
ORDER BY address, LENGTH(postalcode) DESC
The downside is that you can implement paging only on postalcodes, not addresses.