Postgresql Slow on custom function, php but fast if directly input on psql using text search with gin index - postgresql

I have 3 tables Person, Names, and Notes. Each person has multiple name and has optional notes. I have full text search on some columns on names and notes (see below), they are working perfectly if the word I search with is in the result set or is in the db, this is for custom function, php, and psql. The problem now is that when the word I search is not present in the db the query gets super slow in php and custom function but still fast on psql. On psql it's less than 1s, others are more than 10s.
Tables:
Person | id, birthday
Name | person_id, name, fs_name
Notes | person_id, note, fs_note
Beside PK and FK index, Gin index on fs_name and fs_note.
Function/Query
create or replace function queryNameFunc (TEXT)
returns TABLE(id int, name TEXT) as $$
select id, name
from person_name pnr
inner join person pr on (pnr.person_id=pr.id)
left join personal_notes psr on (psr.person_id = pr.id)
where pr.id in
(select distinct(id)
from person_name pn
inner join person p on (p.id = pn.person_id)
left join personal_notes ps on (ps.person_id = p.id)
where tname ## to_tsquery($1)
limit 20);
$$ language SQL;
The where condition is trimmed down in here, so for example if I do 'john & james' on $1 and the data is on the db then results is fast but if 'john and james' are not in db then its slow. This got slower as I have 1M records on person and 3M+ on names (all dummy records). Any idea on how to fix this? I tried restarting the server, restarting postgresql.

The database has to preprare the inner query before it has any knowledge about the parameter. This might result in a bad queryplan. To avoid this problem in a function, use the plpgsql-language and use EXECUTE inside the function:
CREATE OR REPLACE FUNCTION queryNameFunc (TEXT) RETURNS TABLE(id INT, name TEXT) AS $$
BEGIN
RETURN QUERY EXECUTE '
SELECT
id,
name
FROM
person_name pnr
INNER JOIN person pr ON (pnr.person_id=pr.id)
LEFT JOIN personal_notes psr ON (psr.person_id = pr.id)
WHERE
pr.id IN(
SELECT
DISTINCT(id)
FROM
person_name pn
INNER JOIN person p ON (p.id = pn.person_id)
LEFT JOIN personal_notes ps ON (ps.person_id = p.id)
WHERE tname ## to_tsquery($1)
LIMIT 20)' USING $1;
END;
$$ LANGUAGE plpgsql;
This works in version 8.4 and you do have to install plpgsql:
CREATE LANGUAGE plpgsql;

Related

dynamically select the most occurring value from a column which occurs in multiple tables

The Customer, Musician, and Staff tables in my database include a column called FirstName. The query below returns the most occurring FirstName in those three tables and returns multiple FirstNames if more than one FirstNames occurs the same amount of times.
WITH AllFirstNames AS (
SELECT FirstName
FROM Customer
UNION ALL
SELECT FirstName
FROM Musician
UNION ALL
SELECT FirstName
FROM Staff
), FirstNameOccurrences AS (
SELECT FirstName,
COUNT(*) AS Occurrences
FROM AllFirstNames
GROUP BY FirstName
)
SELECT FirstName AS MostOccurringFirstNames
FROM AllFirstNames
WHERE FirstName IN (
SELECT FirstName
FROM FirstNameOccurrences
WHERE Occurrences IN (
SELECT MAX(Occurrences)
FROM FirstNameOccurrences
)
)
GROUP BY MostOccurringFirstNames;
This only works if the tables which include the FirstName column are specified in the query which returns the temporary AllFirstNames table. If a new table with a FirstName column is added to the database, then this query will have to be updated manually. What do I need to do to the query which returns the temporary AllFirstNames table for it to dynamically UNION ALL FirstName columns from all tables which include a FirstName column? I understand that this will only work if the same naming convention is used throughout the databases lifetime.
The query below lists all the tables that include a FirstName column, but I don't know where to go from there.
SELECT table_name
FROM information_schema.columns
WHERE column_name = 'FirstName';
This does sound like a strange database design, but you can do that by creating a function that iterates over all tables.
The following function counts the distinct values per table.
create or replace function count_names()
returns table(tablename text, firstname text, occurrences bigint)
as
$$
declare
l_row record;
begin
for l_row in select distinct table_schema, table_name, column_name
from information_schema.columns
where table_schema = 'public'
and column_name = 'firstname'
loop
return query execute
format('select %L as tablename, cast(%I as text), count(*) occurrences from %I.%I group by %I',
l_row.table_name, l_row.column_name, l_row.table_schema, l_row.table_name, l_row.column_name);
end loop;
end;
$$
language plpgsql;
The above runs a count()/group by for every table that has a column named firstname in the schema public. The result can then be summed. I included the source table name in the result for debugging purposes, but it's not really needed.
With that function you can do something like this:
select firstname, sum(occurrences) num_names
from count_names()
order by num_names desc
limit 10;
Dynamic SQL is best created using the format() function to properly deal with identifiers. The column and table names you used in your question suggests you created them using the dreaded double quotes ("FirstName" is something different than FirstName) - you should really rethink that. Avoid those dreaded double quotes in SQL

conditional join with input value in postgreslq function

I have three tables:
create table id_table (
id integer
);
insert into id_table values (1),(2),(3);
create table alphabet_table (
id integer,
letter text
);
insert into alphabet_table values (1,'a'),(2,'b'),(3,'c');
create table greek_table (
id integer,
letter text
);
insert into greek_table values (1,'alpha'),(2,'beta');
I like to create a function that join id_table with either alphabet_table or greek_table on id. The choice of the table depends on an input value specified in the function. I wrote:
CREATE OR REPLACE FUNCTION choose_letters(letter_type text)
RETURNS table (id integer,letter text) AS $$
BEGIN
RETURN QUERY
select t1.id,
case when letter_type = 'alphabet' then t2.letter else t3.letter end as letter
from id_table t1,
alphabet_table t2 ,
greek_table t3
where t1.id = t2.id and t1.id = t3.id;
END;
$$LANGUAGE plpgsql;
I ran select choose_letter('alphabet'). The problem with this code is that when id_table joins with alphabet_table, it does not pick up id, No 3. It seems that inner joins are done for both alphabet_table and greek_table (so it only picks up the common ids, 1 and 2). To avoid this problem, I wrote:
CREATE OR REPLACE FUNCTION choose_letters(letter_type text)
RETURNS table (id integer, letter text) AS $$
BEGIN
RETURN QUERY
select t1.id,
case when letter_type = 'alphabet' then t2.letter else t3.letter end as letter
from id_table t1
left join alphabet_table t2 on t1.id=t2.id
left join greek_table t3 on t1.id=t3.id
where t2.letter is not null or t3.letter is not null;
END;
$$LANGUAGE plpgsql;
Now it pick up all the 3 ids when id_table and alphabet_table join. However, When I ran select choose_letter('greek'). The id no. 3 appears with null value in letter column despite the fact that I specified t3.letter is not null in where clause.
What I'm looking for is that when I ran select choose_letters('alphabet'), the output needs to be (1,'a'), (2,'b'),(3,'c'). select choose_letters('greek') should produce (1,'alpha'),(2,'beta). No missing values nor null. How can I accomplish this?
Learn to use proper, explicit JOIN syntax. Simple rule: Never use commas in the FROM clause.
You can do what you want with LEFT JOIN and some other logic:
select i.id,
coalesce(a.letter, g.letter) as letter
from id_table i left join
alphabet_table a
on i.id = a.id and letter_type = 'alphabet' left join
greek_table g
on i.id = g.id and letter_type <> 'alphabet'
where a.id is not null or g.id is not null;
The condition using letter_type needs to be in the on clauses. Otherwise, alphabet_table will always have a match.
Gordon Linoff's answer above is certainly correct, but here is an alternative way to write the code.
It may or may not be better from a performance perspective, but it is logically equivalent. If performance is a concern you'd need to run EXPLAIN ANALYZE on the query an inspect the plan, and do other profiling.
Some good parts about this are the inner join makes the join clause and where clause simpler and easier to reason about. It's also more straight forward for the execution engine to parallelize the query.
On the downside it looks like code duplication, however, the DRY principle is often misapplied to SQL. Repeating code is less important than repeating data reads. What you're aiming to do is not scan the same data multiple times.
If there is no index on the fields you are joining or the letter_type then this could end up doing a full table scan twice, and be worse. If you do have the indexes then it can do it with index range scans nicely.
SELECT
i.id,
a.letter
FROM id_table i
INNER JOIN alphabet_table a
ON i.id = a.id
WHERE letter_type = 'alphabet'
UNION ALL
SELECT
i.id,
g.letter
FROM id_table i
INNER JOIN greek_table g
ON i.id = g.id
WHERE letter_type <> 'alphabet'
The first problem is your tables or not structured properly, You would have created single table like char_table (id int, letter text, type text) type will specify whether it is alphabet or Greek.
Another solution is you can write two SQL queries one in if condition other one is in else part

SELECT clause in FOR LOOP control using plpgsql

I try to make a script as to output all the foos that are used by only one user, if a foo is used by more than one user, it shouldn't be outputed.
here's my tables
foos (id, value)
users (id, name)
used (foo_id, user_id)
and my not working script
FUNCTION output_unshared_foos ()
RETURNS foos AS
$a$
DECLARE
foocounts RECORD;
BEGIN
SELECT u.foo_id, count(*)
INTO foocounts -- store in the local variable
FROM used u
GROUP BY u.foo_id;
FOR f IN SELECT * FROM foos
LOOP
IF (SELECT fc.count < 2 FROM foocounts fc WHERE fc.foo_id = f.id) THEN
RETURN NEXT f;
END IF;
END LOOP;
END
$a$ language plpgsql;
doesn't seem to work, every rows are returned and the conditional control seems to be always true.
Your first problem is that you can't store the result of a query that returns more than one row into a single variable (the SELECT u.foo_id, count(*) INTO ... part). I'm surprised you don't get a runtime error when you call your function.
Your function also doesn't compile because the record f is not declared and a functioned defined as returns foos can't use return next
But your approach is wrong (even if it worked). Doing row-by-row processing is almost always the wrong choice in SQL. SQL and relational databases are meant to handle sets, not single rows.
Your problem can be solved with a single query:
select foo_id
from used
group by foo_id
having count(distinct user_id) = 1
will return all foo ids that are used by exactly one user.
If you need the additional information from the foos table, you can join the above query to the foos table:
select f.*
from foos f
join (
select foo_id
from used
group by foo_id
having count(distinct user_id) = 1
) u on f.id = u.foo_id

Get columns that differ between 2 rows

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;

Ordering a query on a field in a return record

I've got a query that calls a function in its select clause. The function returns a record type. In the calling query, I want to order by one of the fields in the returned record and if possible I'd also like to return the fields of the record as fields of the calling query. To make this clear, here's a simplified version of the code:
CREATE OR REPLACE FUNCTION getStatus(lastContact timestamptz, lastAlTime timestamptz, lastGps timestamptz, out status varchar, out toelichting varchar, out colorLevel integer)
RETURNS record AS
$BODY$
BEGIN
status := 'controle_status_ok';
toelichting := '';
colorLevel := 3;
END
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;
ALTER FUNCTION DMI_Controle_getStatus(timestamptz, timestamptz, timestamptz, out varchar, out varchar, out integer) OWNER TO xyz;
Using this function, I want to have a query like this one:
SELECT
id,
name,
getStatus(tabel3.lastcontact, tabel4.lastchanged, tabel5.lastfound) as status
FROM
tabel1
left join tabel2 on ...
left join tabel3 on ...
left join tabel4 on ...
left join tabel5 on ...
ORDER BY
status
Postgres comes with the following error:
ERROR: could not identify an ordering operator for type record
HINT: Use an explicit ordering operator or modify the query.
The question: how should I order by the value of colorLevel that's been returned by getStatus?
Additional question: can I return the three fields of the getStatus function at fields of the query that calls the getStatus function?
Use
ORDER BY (status).colorlevel
to reference a column of your record type.
As an aside: I used lower case(colorlevel instead of colorLevel) because identifiers are cast to lower case if not double-quoted anyway, and using mixed case identifiers is generally a bad idea in PostgreSQL.
As to your additional question, similar syntax requirement. I also use a subquery to optimize the query:
SELECT id
, name
, (x.status).status
, (x.status).toelichting
, (x.status).colorLevel
FROM tabel
, (SELECT getStatus(now(), now(), now()) as status) x
ORDER BY (x.status).colorlevel
Read about accessing composite types in the manual.
Answer after additional input
To use columns from your tables, put it all in the a subquery. I am trying to avoid to call the function multiple times, because that may be expensive.
SELECT
id,
name,
(status).status,
(status).toelichting,
(status).colorLevel
FROM (
SELECT
id,
name,
getStatus(tabel3.lastcontact, tabel4.lastchanged, tabel5.lastfound) as status
FROM
tabel1
left join tabel2 on ...
left join tabel3 on ...
left join tabel4 on ...
left join tabel5 on ...
) x
ORDER BY
(status).colorlevel