Postgresql DISTINCT and other fields gives Missing FROM-clause - postgresql

I want to search DISTINCT on a Postgresql 9.4 table with about 300 000 records. It takes almost 8 seconds. I read on this post that using this could speed it up. And it really did. Down to 0.26 sec.
SELECT COUNT(*) FROM (SELECT DISTINCT column_name FROM table_name) AS temp;
Is much faster than
COUNT(DISTINCT(column_name))
When I write this I get the result but I want to add a WHERE clause.
This works but takes over 7 sec.
SELECT COUNT(DISTINCT(species)) FROM darwincore2
WHERE darwincore2.dataeier ILIKE '%nnog%'
This works (0.26 sec.) but fails when I add the WHERE clause:
SELECT COUNT(*) FROM (SELECT DISTINCT species FROM darwincore2) as temp
WHERE darwincore2.dataeier ILIKE '%nnog%'
with:
ERROR: missing FROM-clause entry for table "darwincore2"
Anyone know how I can fix this? Or am I trying to do something that does not work??

The WHERE clause should be in the subquery:
SELECT COUNT(*)
FROM (
SELECT DISTINCT species
FROM darwincore2
WHERE darwincore2.dataeier ILIKE '%nnog%'
) as temp

Related

In PostgresSQL how to write a query returning multiple rows and columns

My original query was like this:
SELECT *,
(SELECT COUNT(*), user_id FROM user_sessions WHERE company_id=companies.id GROUP BY user_id) as user_sessions
FROM companies
Which comes back with this error:
error: more than one row returned by a subquery used as an expression
I found a way past that error with this:
SELECT *,
ARRAY (SELECT COUNT(*), user_id FROM user_sessions WHERE company_id=companies.id GROUP BY user_id) as user_sessions
FROM companies
But then it has this error:
error: subquery must return only one column
If I remove either COUNT(*) or user_id from the returned columns it works, however I need both sets of data. How do I return more than one column in a sub-query like this?
I guess a join should do the trick:
select * from
companies
join
( select count(*), company_id, user_id
from user_sessions
group by (company_id, user_id)
) as user_sessions
on companies.id = company_id
For anyone who runs into this in the future, tested the best way to dot this is what #Matt mentioned in the comments above:
Replace COUNT(*), user_id with ARRAY[COUNT(*), user_id]
Works perfectly

Most efficient way to remove duplicates - Postgres

I have always deleted duplicates with this kind of query:
delete from test a
using test b
where a.ctid < b.ctid
and a.col1=b.col1
and a.col2=b.col2
and a.col3=b.col3
Also, I have seen this query being used:
DELETE FROM test WHERE test.ctid NOT IN
(SELECT ctid FROM (
SELECT DISTINCT ON (col1, col2) *
FROM test));
And even this one (repeated until you run out of duplicates):
delete from test ju where ju.ctid in
(select ctid from (
select distinct on (col1, col2) * from test ou
where (select count(*) from test inr
where inr.col1= ou.col1 and inr.col2=ou.col2) > 1
Now I have run into a table with 5 million rows, which have indexes in the columns that are going to match in the where clause. And now I wonder:
Which, of all those methods that apparently do the same, is the most efficient and why?
I just run the second one and it is taking it over 45 minutes to remove duplicates. I'm just curious about which would be the most efficient one, in case I have to remove duplicates from another huge table. It wouldn't matter if it has a primary key in the first place, you can always create it or not.
demo:db<>fiddle
Finding duplicates can be easily achieved by using row_number() window function:
SELECT ctid
FROM(
SELECT
*,
ctid,
row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid)
FROM test
)s
WHERE row_number >= 2
This orders groups tied rows and adds a row counter. So every row with row_number > 1 is a duplicate which can be deleted:
DELETE
FROM test
WHERE ctid IN
(
SELECT ctid
FROM(
SELECT
*,
ctid,
row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid)
FROM test
)s
WHERE row_number >= 2
)
I don't know if this solution is faster than your attempts but your could give it a try.
Furthermore - as #a_horse_with_no_name already stated - I would recommend to use an own identifier instead of ctid for performance issues.
Edit:
For my test data your first version seems to be a little bit faster than my solution. Your second version seems to be slower and your third version does not work for me (after fixing the compiling errors it shows no result).
demo:db<>fiddle

How can I select the newest 40 records into a temp table?

I would like to get the latest 40 records into a temp table, something like this:
SELECT * INTO #MY_TEMP
FROM
(
SELECT TOP 40 *
FROM SOME_TABLE
ORDER BY RECORD_DATE DESC
)
However I am getting an error:
An ORDER BY clause is not allowed in a derived table.
I saw a few workarounds mentioned on other postings, involving TOP PERCENT, but my select already uses a TOP and it is not working.
How can I get these records into my temp table?
You can not use ORDER BY in subselect statements. Instead remove the subselect. Your SQL should look like this:
SELECT TOP 40 * INTO #MY_TEMP
FROM SOME_TABLE
ORDER BY RECORD_DATE DESC

Postgres poor performance "in-clause"

I have this query:
with serie as (
select to_char(kj, 'yyyymmdd')::numeric
from generate_series('2016-02-06 01:56:00','2016-02-06 23:57:00', '1 day'::interval) kj
)
select col1,col2,col3
from foreign_table
where col3 in (select * from serie) -- from CTE serie here is only one number 20160216
And its performance is poor, the foreign table has an index on col3.
But if I write the values from CTE serie manually it performs fast
select col1,col2,col3
from foreign_table
where col3 in (20160216,20160217)
I put there one more value just to show it works fast with more than one value
And if I write "=" to first query instead of "in" it also performs fast
with serie as (
select to_char(kj, 'yyyymmdd')::numeric
from generate_series('2016-02-06 01:56:00','2016-02-06 23:57:00', '1 day'::interval) kj
)
select col1,col2,col3
from foreign_table
where col3 = (select * from serie) -- I can write "=" in this case because I have just one number returned from CTE
(I am using Postgres 9.5.1)
Why does Postgres performs so poorly with in-clase with CTE compare to manually writing these values or using "=". I obviously can not write values manually all the time since I need this query universal and I can not put there "=" because I need it universal here as well.
So any ideas here ?
btw: This is not the only case when in-clause made a poor performance compare to other two methods I showed here
These are the query plans, I have other queries that are not affected by foreign table, once I find them I will put them here as well
http://i.imgur.com/zeiXwwW.png

Best alternative for FOUND_ROWS() in PostgreSQL

I was searching on NET for alternative to MySQL FOUND_ROWS(), how to get all rows when using limit and WHERE. I need this for pagination.
I found a few different approach, but I don't have much experience (I migrate to PostgreSQL a week ago) which approach will give the best performance?
SELECT stuff, count(*) OVER() AS total_count
FROM table
WHERE condition
ORDER BY stuff OFFSET 40 LIMIT 20
BEGIN;
SELECT * FROM mytable OFFSET X LIMIT Y;
SELECT COUNT(*) AS total FROM mytable;
END;
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT id, username, title, date FROM posts ORDER BY date DESC LIMIT 20;
SELECT count(id, username, title, date) AS total FROM posts;
END;
select * into tmp_tbl from tbl where [limitations];
select * from tmp_tbl offset 10 limit 10;
select count(*) from tmp_tbl;
drop table tmp_tbl;
If there is another approach not described yet and will get the best performance, please let me know.
I am using PostgreSQL version 9.3.4 and PDO for PHP.