View doesn't increase performance of correlated subquery? - postgresql

I have table User and table Order, now I want to find the names of user who made more than 100 orders, I can do a query like below:
SELECT U.name
FROM User U
WHERE 100 < (
SELECT COUNT(*) FROM Orders O
WHERE O.uid=U.uid
)
This is slow because of correlated subquery.
Therefore I think I can optimize it by creating a view which contains how many order each user has made like below
View UserOrderCount
uid orderCount
0 11
1 108
2 100
3 99
4 32
5 67
Then the query is much simpler:
SELECT U.name
FROM User U, UserOrderCount C
WHERE 100 < C.orderCount And U.uid=C.cid;
But this turn out to take more time, I can't figure out why...
Please shed some light on this, thanks in advance!
EDIT:
Here is how the view is created:
CREATE VIEW UserOrderCount
AS
select U.uid, count(*) AS orderCount
from User U, orders O
group by U.uid;

Creating a view would not be expected to make this faster. The same amount of work still needs to be done behind the scenes. The view just makes the text of your query simpler to look at; it doesn't make it simpler to execute. (But note that you are using two different queries, one is a subquery and one is a join. This is orthogonal to the use of a view: you could have used the view in a subquery, or you could have done a join without the view. Mixing these two things together is likely to cause confusion.)
With Materialized Views, new in 9.3, it would actually store the computed results of the counting, so the execution would be faster. But the price you pay for this is that you would need to refresh the materialized view periodically, and you would be using out-of-date counts in the mean time.

SELECT U.name,count(*)
FROM User U
JOIN Orders o on o.uid=u.uid
WHERE COUNT(*) > 100
GROUP BY U.NAME
You do not need a view for that...

Related

PostgreSQL how to GROUP BY single field from returned table

So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.

Filtering over database views is much slower than direct query

I noticed a big difference in query plan when doing regular query versus creating database view and then querying the view.
case 1 basic query:
SELECT <somequery> WHERE <some-filter> <some-group-by>
case 2 database view:
CREATE VIEW myview AS SELECT <some-query> <some-group-by>;
SELECT FROM myview WHERE <some-filter>;
I have noticed that in the case 2 postgres will join/aggregate everything possible, and only then it applies the filter. In case 1 it doesn't touch rows filtered out with where clause. So case 2 is a lot slower.
Are there any tricks to work around this while keeping the database view?
Your View has to re-create the dataset to filter from every time you perform the SELECT FROM.
The easiest way is to change the view to a materialized view. If your data is not changing every 2 minutes, a materialized view will save the select to be used, where your filter can then work on the "saved" dataset. The second thing you can do is add Indexes on the View.
Example Here: https://hashrocket.com/blog/posts/materialized-view-strategies-using-postgresql
create materialized view matview.account_balances as
select
name,
coalesce(
sum(amount) filter (where post_time <= current_timestamp),
0
) as balance
from accounts
left join transactions using(name)
group by name;
create index on matview.account_balances (name);
create index on matview.account_balances (balance);
This is the simplest way to reduce the runtime of your query.
Hope this helps.

T-SQL different JOIN approaches, same results, which one would you prefer?

these are 3 approaches how to make a join. I would like to hear some word on perforance of these 3 queries.
Thank you
SELECT * FROM
tableA A LEFT JOIN tableB B
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
ON B.ColumnB = A.ColumnB
WHERE ColumnX = 'XY'
Versus
SELECT * FROM
tableA A LEFT JOIN tableB B
ON B.ColumnB = A.ColumnB
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
WHERE ColumnX = 'XY'
Versus Common Table Expression
WITH T...
It does not matter.
SQL Server has a cost-based optimizer (as opposed to a rule-based optimizer). That means that the engine is able to figure out that both of your first two options are identical. Run your estimated and actual execution plans and you will see that this is the case.
The only reason you would choose one option over the other is for readability's sake. I go with your second option, because it's a lot easier to read when there are a great many joins involved. ON clauses in reverse order become quite difficult to track.
In my experience, any of the above could be quicker depending on your tables.
As you're setting up joins, you want to start with the most restrictive as possible (without negatively affecting your end result, obviously). This same logic also applies to the Where clause for the same reason. By starting with the most restrictive, you're limiting the number of rows that are being joined and thus evaluated by the Where clause and then returned/manipulated in the select clause. For my answers below regarding the three specific scenarios, I'm assuming a sufficiently complicated query that is doing more than just looking to combine data from multiple tables (i.e., queries answering specific questions).
If Table A is huge and Tables B & C are smaller and more directly related to the data you're trying to isolate, then the first option would likely be fastest.
If Table B or C are huge and Table A is more related to your desired data, the second option would likely be fastest.
As far as option 3 goes, I love CTEs but I try to only use them when I need to do so. Using a CTE will speed up your overall query if the data joined, manipulated, and returned by the CTE is only related to the rest of the query in a limited fashion. Including tables that are only partially related to your end result in your primary string of joins is going to needlessly slow down your query. If you can parse out that data into a CTE, it can run quickly by itself and then be incorporated back into the main query at the end.

Left Join Hangs

I am trying to figure out what could be causing a left join to hang. I've narrowed a problem down to a specific table but I can't for the life of me figure out what might be going on. Basically, I have two tables, lets call them table A and table B. When I left join table A to table B (its a 1 to 1 relationship with table B not always having a related record to table A) the query hangs. When I inner join table A to table B, it runs in about a half second returning about 27,000 records. Why is it when I run a left join, which should take a bit longer but not by much, it hangs? Could I have bad data in table B? The fields I'm joining are bigint's. I'm stumped on this one. Any help would be much appreciated.
Here is my sql:
select
RegMemberTrip.idmember,
RegParent1.idMember_Parent1,
regparent1.idParent1
from
regmembertrip
left join
regparent1 on RegMemberTrip.idmember = regparent1.idMember_Parent1
where
regmembertrip.IDRound = 25
RegParent1 is a view
If I change the where criteria to '= 24' it works fine. IDRound = 25 is fairly new data. And like I said, if I keep this the way it is (idround = 25) with an inner join it works fine.
Thanks,
Ben
Have you tried the execution path tool in the Management Console? Are you sure your left join is not in fact doing a giant cartesian product across A and B?

PostgreSQL slow COUNT() - is trigger the only solution?

I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here