Optimize PostgreSQL query with levenshtein() function - postgresql

I have a table with approximately 7 million records. The table has a first_name and last_name column which I want to search on using the levenshtein() distance function.
select levenshtein('JOHN', first_name) as fn_distance,
levenshtein('DOE', last_name) as ln_distance,
id,
first_name as "firstName",
last_name as "lastName"
from person
where first_name is not null
and last_name is not null
and levenshtein('JOHN', first_name) <= 2
and levenshtein('DOE', last_name) <= 2
order by 1, 2
limit 50;
The above search is slow (4 - 5 secs), what can I do to improve performance? Should a create indexes on the two columns, or something else?
After I added indexes below:
create index first_name_idx on person using gin (first_name gin_trgm_ops);
create index last_name_idx on person using gin(last_name gin_trgm_ops);
The query now takes ~11 secs. :(
New query:
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and similarity('JOHN', first_name) >= 0.2
and similarity('DOW', last_name) >= 0.2
order by 1 desc, 2 desc, npi
limit 50;

There is no built-in index type that supports levenshtein distances. I'm not aware of any 3rd party index implementations to do so either.
Another string similarity measure, trigram similarity, does have an index method to support it. Maybe you can switch to using that measure instead.
You need to write the query using the % operator, not the similarity function. So it would look something like this:
set pg_trgm.similarity_threshold TO 0.2;
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and 'JOHN' % first_name
and 'DOW' % last_name
order by 1, 2, npi
limit 50;
But note that 0.2 is very low cutoff, and the lower the cutoff the less efficient the index.

Related

filter taking to much time in posgresdb on gender field

I have one table with 100M plus rows which looks like this
Create table member (
id bigint,
gender text,
//..other fields
primary key (id)
);
Now the gender field has two possible value 'M' or 'F'
Whenever I am using the gender field then it's taking to much time I have indexes on other fields like id, member details, mobile number
select
count(1) filter (where mod.is_active and m.gender = 'M') as male,
count(1) filter (where mod.is_active and m.gender = 'F') as female
from member_other_details mod
inner join member m on m.id = mod.member_id
This query is taking hrs to complete
How can I optimize this?
Personnally i would execute this query
select m.gender,count(*)
from member_other_details mod inner join member m on m.id = mod.member_id
where mod.is_active
group by m.gender

How to fill the null values using last fill for the same id in PSQL?

I have a data frame in PostgreSQL as follows and I want the latest record for each id, if any latest record for each id contains a NULL value in any column, I want to replace it with the next latest value within the same column
data
id ingdt code gender address
1 27-10-2018 NULL NULL street1
1 24-10-2018 1234 NULL street2
1 20-08-2017 3245 M street2
2 24-09-2018 NULL F Astreet
2 24-10-2018 2857 F Bstreet
3 24-08-2018 3489 M NULL
3 22-08-2018 5802 M Cstreet
Expected Output
final_output
id ingdt code gender address
1 27-10-2018 1234 M street1
2 24-10-2018 2857 F Bstreet
3 24-08-2018 3489 M Cstreet
Tried
insert into final_output select * from (
(select code, id from data where code != null order by ingdt limit 1) x join
(select gender, id from data where gender != null order by ingdt limit 1) y join
(select address, id from data where address != null order by ingdt limit 1)z on y.id=x.id)
demo:db<>fiddle
Using window functions can help you:
SELECT DISTINCT
id,
max(ingdt) OVER (PARTITION BY id),
first_value(code) OVER (PARTITION BY id ORDER BY code IS NULL, ingdt DESC) AS code,
first_value(gender) OVER (PARTITION BY id ORDER BY gender IS NULL, ingdt DESC) AS gender,
first_value(address) OVER (PARTITION BY id ORDER BY address IS NULL, ingdt DESC) AS address
FROM mytable
ORDER BY id
Explaining first_value(...) OVER (...):
A window function can group your rows into separate frames. This is done by the key word PARTITION BY. In this case I am generating frames for each id.
Now I am checking whether or not the value of a column is NULL. This gives me true or false. I am sorting this result like any boolean column with false first (meaning NOT NULL). If there are many NOT NULL rows, the latest row is taken (ingdt DESC). This ordering is done for every single frame separately as well.
first_value() calculates the first value of the ordered frame.

Using ANY inside HAVING clause in Postgres?

Let's say I've the following schema :
CREATE TABLE author(
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE article(
id SERIAL PRIMARY KEY,
rating NUMERIC NOT NULL,
author_id INTEGER NOT NULL REFERENCES author
);
CREATE INDEX ON article(author_id);
I would like to fetch all authors and their top 5 articles if there exists atleast one article of the author with rating > 4.
It was tempting to write this:
SELECT au.id AS author,
json_agg(ar.*) AS articles
FROM
author au
JOIN LATERAL
(SELECT *
FROM article
WHERE author_id = au.id
ORDER BY rating DESC LIMIT 5) ar ON (TRUE)
GROUP BY au.id
HAVING any(ar.rating) > 4;
While any(ar.rating) > 4 looks like a filter expression on each group, any(ar.rating) is not an aggregated value. So, it seems reasonable for Postgres to reject this query. Is it possible to write the query with HAVING?
As an alternative, I write this query to fetch the results :
SELECT au.id AS author,
json_agg(ar.*) AS articles
FROM
(SELECT au.*
FROM author au
WHERE EXISTS
(SELECT 1
FROM article
WHERE rating > 4 AND author_id = au.id)) au
JOIN LATERAL
(SELECT *
FROM article
WHERE author_id = au.id
ORDER BY rating DESC LIMIT 5) ar ON (TRUE)
GROUP BY au.id;
This however doesn't combine both the grouping and checking for the existence of an article with rating > 4 in a single step. Is there a better way to write this query?
If you insist on using ANY you have to use array_agg to aggregate that column into an array.
HAVING
4< ANY(Array_Agg(ar.rating))
But if any is higher than 4 it also means that the maximum is higher that 4 so more readable will be
HAVING
4 < Max(ar.rating)

Postgresql. select SUM value from arrays

Condition:
There are two tables with arrays.
Note food.integer and price.food_id specified array.
CREATE TABLE food (
id integer[] NOT NULL,
name character varying(255),
);
INSERT INTO food VALUES ('{1}', 'Apple');
INSERT INTO food VALUES ('{1,1}', 'Orange');
INSERT INTO food VALUES ('{1,2}', 'banana');
and
CREATE TABLE price (
id bigint NOT NULL,
food_id integer[],
value double precision DEFAULT 0
);
INSERT INTO price VALUES (44, '{1}', 500);
INSERT INTO price VALUES (55, '{1,1}', 100);
INSERT INTO price VALUES (66, '{1,2}', 200);
Need to get the sum value of all the products from table food.
Please help make a sql query.
ANSWER:
{1} - Apple - 800 (500+100+200)
What about this:
select
name,
sum(value)
from
(select unnest(id) as food_id, name from food) food_cte
join (select distinct id, unnest(food_id) as food_id, value from price) price_cte using (food_id)
group by
name
It is difficult to understand your question, but this query at least returns 800 for Apple.
try the following command,
SELECT F.ID,F.NAME,SUM(P.VALUE) FROM FOOD F,PRICE P WHERE F.ID=P.FOOT_ID;

Indexing to reduce cost of SORT

I have this table:
TopScores
Username char(255)
Score int
DateAdded datetime2
which will have a lot of rows.
I run the following query (code for a stored procedure) against it to get the top 5 high scorers, and the score for a particular Username preceded by the person directly above them in position and the person below:
WITH Rankings
AS (SELECT Row_Number() OVER (ORDER BY Score DESC, DateAdded DESC) AS Pos,
--if score same, latest date higher
Username,
Score
FROM TopScores)
SELECT TOP 5 Pos,
Username,
Score
FROM Rankings
UNION ALL
SELECT Pos,
Username,
Score
FROM Rankings
WHERE Pos BETWEEN (SELECT Pos
FROM Rankings
WHERE Username = #User) - 1 AND (SELECT Pos
FROM Rankings
WHERE Username = #User) + 1
I had to index the table so I added clustered: ci_TopScores(Username) first and nonclustered: nci_TopScores(Dateadded, Score).
Query plan showed that clustered was completely ignored (before I created the nonclustered I tested and it was used by the query), and logical reads were more (as compared to a table scan without any index).
Sort was the highest costing operator. So I adjusted indexes to clustered: ci_TopScores(Score desc, Dateadded desc) and nonclustered: nci_TopScores(Username).
Still sort costs the same. Nonclustered: nci_TopScores(Username) is completely ignored again.
How can I avoid the high cost of sort and index this table effectively?
The CTE does not use Username so not a surprise it does not use that index.
A CTE is just syntax. You are evaluating that CTE 4 times.
Try a #temp so it is only evaluated once.
But you need to think about the indexes.
I would skip the RowNumber and just put an iden pk on the #temp to serve as pos
I would skip any other indexes on #temp
For TopScores an index on Score desc, DateAdded desc, Username asc will help
But it won't help if it is fragmented
That is an index that will fragment when you insert
insert into #temp (Score, DateAdded, Username)
select Score, DateAdded, Username
from TopScores
order by Score desc, DateAdded desc, Username asc
select top 5 *
from #temp
order by pos
union
select three.*
from #temp
join #temp as three
on #temp.UserName = #user
and abs(three.pos - #temp.pos) <= 1
So what if there is table scan on #temp UserName.
One scan does not take as long as create one index.
That index would be severely fragmented anyway.