Correct sum with multiple subrecords (postgresql)

Correct sum with multiple subrecords (postgresql) - postgresql

This have maybe been asked several times before, but I do not how to achieve a correct som from both parent and child.
Here is the tables:
CREATE TABLE co
(coid int4, coname text);
INSERT INTO co
(coid, coname)
VALUES
(1, 'Volvo'),
(2, 'Ford'),
(3, 'Jeep'),
(4, 'Toyota')
;
CREATE TABLE inv
(invid int4, invco int4, invsum numeric(10,2));
INSERT INTO inv
(invid, invco, invsum)
VALUES
(1,1,100),
(2,1,100),
(3,2,100),
(4,3,100),
(5,4,100)
;
CREATE TABLE po
(poid int4, poinv int4, posum int4);
INSERT INTO po
(poid, poinv, posum)
VALUES
(1,1,50),
(2,1,50),
(3,3,100),
(4,4,100)
;
I started with this simple query
SELECT coname, sum(invsum)
FROM inv
LEFT JOIN co ON coid=invco
GROUP BY 1
ORDER BY 1
Which gave a correct result:
coname sum
Ford 100
Jeep 100
Toyota 100
Volvo 200
Then I added the po record and the sums became incorrect:
SELECT coname, sum(posum) as po, sum(invsum)
FROM inv
LEFT JOIN co ON coid=invco
LEFT JOIN po ON poinv=invid
GROUP BY 1
ORDER BY 1
Which multiplied the sum for Volvo:
coname po sum
Ford 100 100
Jeep 100 100
Toyota (null) 100 (no records for po = correct)
Volvo 100 300 (wrong sum for inv)
How do I construct a query that gives correct result with multiple subrecords of po? (Window function?)
Sqlfiddle: http://sqlfiddle.com/#!15/0d90c/12

Do the aggregation before the joins. This is a little complicated in your case, because the relationship between co and po seems to require inv:
SELECT co.coname, p.posum, i.invsum
FROM co LEFT JOIN
(SELECT i.invco, sum(i.invsum) as invsum
FROM inv i
GROUP BY i.invco
) i
ON co.coid = i.invco LEFT JOIN
(SELECT i.invco, sum(po.posum) as posum
FROM po JOIN
inv i
ON po.poinv = i.invid
GROUP BY i.invco
) p
ON co.coid = p.invco
ORDER BY 1;
Note: I presume the logic is to keep everything in the co table, even if there are no matches in the other tables. The LEFT JOIN should start with this table, the one with all the rows you want to keep.

Related

Choose "strongest" intersected area

I have a materialized view which is the result of a spatial joint using st_intersect of two polygons layers. Table1 and table2, features of table1 can be itnersected for few polygons of table2, thsi is how i create the mview:
SELECT g.field1,
att.ogc_fid,
st_intersection(g.geom, att.geom) AS intersect_geom,
st_area(g.geom) AS geom_area,
st_area(st_intersection(g.geom, att.geom)) AS intersect_area
FROM table1 g
JOIN table2 att ON g.geom && att.geom;
field1 | ogc_fid | intersect_geom| geom_area | intersect_area
aa12345 1 123123 123131 1313123414
aa12345 3 1 1 1
bb12345 2 4124141 13141 14415151
bb12345 1 1243141414 1231313 13131323
From this mview i want to pick just the strongest intersected area and join to a description coming from table2.. I have tried the code below:
select a.*, b.desc
from table1 a
left join lateral
(
select desc
table2
where table2.ogc_fid= table1.ogc_fid
order by (intersect_area/geom_area) DESC NULLS LAST
limit 1
) b
field1 | ogc_fid | intersect_geom| geom_area | intersect_area | desc
aa12345 1 123123 123131 1313123414 desc for 1
bb12345 2 4124141 13141 14415151 desc for 2
but results here are not the expected ones. I went through other threads but im stuck when trying to get just one result (the strongest), and create a table with those strongest intersection so for one feature in table one i have the most strongest intersected.

If I understood you right, you have done the hard bit already. You just need to pick the one record per field from the view and join with table2... So try this:
SELECT DISTINCT ON (field1) field1, m.ogc_fid, b.desc FROM
mview AS m
INNER JOIN table2 AS b ON b.ogc_fid = m.ogc_fid
ORDER BY field1, (intersect_area/geom_area) DESC

How to make postgres (cursor?) start at particular row

I have created the following query:
select t.id, t.row_id, t.content, t.location, t.retweet_count, t.favorite_count, t.happened_at,
a.id, a.screen_name, a.name, a.description, a.followers_count, a.friends_count, a.statuses_count,
c.id, c.code, c.name,
t.parent_id
from tweets t
join accounts a on a.id = t.author_id
left outer join countries c on c.id = t.country_id
where t.row_id > %s
-- order by t.row_id
limit 100
Where %s is a number that starts at 0 and is incremented by 100 after each such query is conducted. I want to fetch all records from the database using this method, where I just increase the %s in the where condition. I found this approach on https://ivopereira.net/efficient-pagination-dont-use-offset-limit. I also included a column in my table which is corresponding to row number (I named it row_id). Now the problem is when I run this query the first time, it returns rows which have an row_id of 3 million. I would like the cursor (not sure if my terminology is correct) to start from rows with row_id 1 through 100 and so on. The table contains 7 million rows. Am I missing something obvious with which I could achieve my goal?

Snowflake "Exploding Join" issue while doing left join for multiple tables

I am trying to do some left joins on multiple tables and facing the following issue.
Row Counts of tables
Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M
When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
;
Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City
FROM Table1
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):
SELECT
Table1.ID1, Table1.ID2,
Table2.Name, Table2.City,
Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;

So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1
But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.
this could be seen via:
SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3
ORDER BY counts DESC
;
this will show the total distinct pairs, and which are the worst contributors to the combination explosion

When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.
Try reading these questions here to help here and here
Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns
i.e. :-
Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System
HAVING COUNT(*)>1 -- Filtering on duplicate rows that has more than a row satisfying the join condition
Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records

Have you tried adding a DISTINCT clause?
SELECT DISTINCT columns, of, choice
FROM Table1
LEFT JOIN Table2 on ...
LEFT JOIN Table3 on ...
I think what's happening is you have dups that left join on another giant set of dups.

Use the proper keys to join the two tables, it solves the issue.

Find equal twin record postgresql

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I have a record with id 22 and I know it has a twin because I run this (simplified code):
SELECT min(co_id),co_name,count(*) FROM co
GROUP BY co_name
HAVING count(*) > 1
The result shows there are one twin (count 2) and I get the oldest id by min(co_id)
My question is how I search for the twin co_id? Just passing the oldest id?
Something like:
SELECT co_id FROM co
WHERE co_name EQUAL TO co_id='22'
LIMIT 2
Sample data:
id co_name
22 Volvo
23 Volvo
24 Ford
25 Ford
I know id 22 and I want to search for the twin 23 based on the content of 22.
The closest I found is this. Which is far from generic. And a nightmare for comparing 60 field:
SELECT id,
(SELECT max(b.id) from co b
WHERE a.co_name = b.co_name
LIMIT 1) as twin
FROM co a
WHERE id='22'
How do I do this in a more simple and generic way? I just want the twin record co_id.
Thank you in advance!

select max_co,co_name from (
select max(co_id) max_co,min(co_id) min_co,co_name from co
group by co_name having count(*)>1) where min_co=(your old co id as input);

You can join your table with itself:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON c1.co_name=c2.co_name
AND c1.id>c2.id
this will return all duplicated records (but not the original record with the lowest id). Or since you're using Postgresql you can use a window function:
SELECT *
FROM (
SELECT
id,
co_name,
row_number() OVER (PARTITION by co_name ORDER BY id) as row
FROM
co_name
) s
WHERE
row>1;
Please see an example here.
If you want to compare multiple columns, the JOIN solution would be more flexible. I don't know exactly how you want to compare your columns and how you exactly define "twin" rows, but you a query like this should help:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON (
c1.co_name=c2.co_name
OR c1.co_city=c2.co_city
OR c1.co_owner=c2.co_owner
OR ...
) AND c1.id>c2.id
if you just want duplicated records of id=22 then you can try with this:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON c1.co_name=c2.co_name
AND c1.id>c2.id
WHERE
c2.id=22
or if you just want a single twin, comparing 60 columns, you can try with this query:
SELECT MIN(ID) as Twin /* or MAX(ID), depending what you're after */
FROM
co_name c1 INNER JOIN co_name c2
ON (
c1.co_name=c2.co_name
OR c1.co_city=c2.co_city
OR c1.co_owner=c2.co_owner
OR ...
) AND c1.id>c2.id
WHERE
c2.id=22

I found one solution that is working on 60 columns if I use variables in stead of hardcode in the query. Thanks everybody for all input. Some of them were about the same track.
SELECT id,
(SELECT max(b.id) from co b
WHERE concat(a.co_name,etc) = concat(b.co_name,etc)
LIMIT 1) as twin
FROM co a
WHERE id='22'
Not the best one, but fetch one twin at a time. And it is far from generic. Thanks for pointing me in the right direction. A generic solution would be nicer.

SSRS 2005 column chart: show series label missing when data count is zero

I have a pretty simple chart with a likely common issue. I've searched for several hours on the interweb but only get so far in finding a similar situation.
the basics of what I'm pulling contains a created_by, person_id and risk score
the risk score can be:
1 VERY LOW
2 LOW
3 MODERATE STABLE
4 MODERATE AT RISK
5 HIGH
6 VERY HIGH
I want to get a headcount of persons at each risk score and display a risk count even if there is a count of 0 for that risk score but SSRS 2005 likes to suppress zero counts.
I've tried this in the point labels
=IIF(IsNothing(count(Fields!person_id.value)),0,count(Fields!person_id.value))
Ex: I'm missing values for "1 LOW" as the creator does not have any "1 LOW" they've assigned risk scores for.
*here's a screenshot of what I get but I'd like to have a column even for a count when it still doesn't exist in the returned results.
#Nathan
Example scenario:
select professor.name, grades.score, student.person_id
from student
inner join grades on student.person_id = grades.person_id
inner join professor on student.professor_id = professor.professor_id
where
student.professor_id = #professor
Not all students are necessarily in the grades table.
I have a =Count(Fields!person_id.Value) for my data points & series is grouped on =Fields!score.Value
If there were a bunch of A,B,D grades but no C & F's how would I show labels for potentially non-existent counts

In your example, the problem is that no results are returned for grades that are not linked to any students. To solve this ideally there would be a table in your source system which listed all the possible values of "score" (e.g. A - F) and you would join this into your query such that at least one row was returned for each possible value.
If such a table doesn't exist and the possible score values are known and static, then you could manually create a list of them in your query. In the example below I create a subquery that returns a combination of all professors and all possible scores (A - F) and then LEFT join this to the grades and students tables (left join means that the professor/score rows will be returned even if no students have those scores in the "grades" table).
SELECT
professor.name
, professorgrades.score
, student.person_id
FROM
(
SELECT professor_id, score
FROM professor
CROSS JOIN
(
SELECT 'A' AS score
UNION
SELECT 'B'
UNION
SELECT 'C'
UNION
SELECT 'D'
UNION
SELECT 'E'
UNION
SELECT 'F'
) availablegrades
) professorgrades
INNER JOIN professor ON professorgrades.professor_id = professor.professor_id
LEFT JOIN grades ON professorgrades.score = grades.score
LEFT JOIN student ON grades.person_id = student.person_id AND
professorgrades.professor_id = student.professor_id
WHERE professorgrades.professor_id = 1
See a live example of how this works here: SQLFIDDLE

SELECT RS.RiskScoreId, RS.Description, SUM(DT.RiskCount) AS RiskCount
FROM (
SELECT RiskScoreId, 1 AS RiskCount
FROM People
UNION ALL
SELECT RiskScoreId, 0 AS RiskCount
FROM RiskScores
) DT
INNER JOIN RiskScores RS ON RS.RiskScoreId = DT.RiskScoreId
GROUP BY RS.RiskScoreId, RS.Description
ORDER BY RS.RiskScoreId