PostgreSQL- Adding "TOTAL" row with more than one Group by columns - postgresql

I was following this topic:
PostgreSQL - making first row show as total of other rows
..and i used this query to accomplish something similar in my code:
with w as ( select fruits, sum(a) a, sum(b) b, sum(c) c
from basket
group by fruits )
select * from w union all select 'total', sum(a), sum(b), sum(c) from w
It works fine but i now need to put two more columns before the sum columns simliar to the fruit one and i'm getting an error :
"... must appear in the GROUP BY clause or be used in an aggregate function"
Any help on how to do like the example above but with to more columns like "fruit"?
(Sorry my rep didn't let me continue the previous topic)

It was an easier fix than i thought.
with w as ( select fruits, vegetables, cereals, sum(a) a, sum(b) b, sum(c) c
from basket
group by fruits, vegetables, cereals )
select * from w union all select 'total', null, null, sum(a), sum(b), sum(c) from w
Two nulls in the last select solved the problem

what you need is something called "Grouping sets". The magic word here is ROLLUP, which is currently not yet supported by PostgreSQL. I guess it will be in 9.5. For now you have to continue on the path you have chosen (= subselects, CTEs, etc.).
a guy called atri shama is currently working on the stuff you really want. sorry for the bad news for now.

Related

Postgres - Insert nearest neighbour distance into another table

So I have three tables (A, B, C). In tables A and B I have points, and I want to insert into C each row from A, and some columns from the closest point from B to each point in A, as well as the distance between them. I know that the query to get the nearest neighbour is this:
SELECT DISTINCT ON (A.id5) A.state, B.way, st_distance (A.geom,B.geom) INTO C
FROM A, B
WHERE ST_DWithin(A.geom, B.geom, 150)
ORDER BY A.objectid, ST_Distance(A.geom,A.geom)
But I need to get that into a bigger INSERT query, and I tried to do it this way:
INSERT INTO complete(id_door, distance, id_way,Y, X, geom, check)
(SELECT A.state, (select distinct on (A.id5) ST_DISTANCE(A.geom,B.geom) from A order by A.id5, st_distance(A.geom,B.geom)), b.way, ST_Y(B.geom), ST_X(B.geom) ,B.geom, V.check
FROM A, B, C, V
WHERE
ST_INTERSECTS(A.geom, V.geom)\
AND ST_DWithin(A.geom, B.geom,150))
But this is not the right way, because I get the error:
psycopg2.ProgrammingError: more than one row returned by a subquery used as an expression
I cannot copy all the distances from A and B to C and then delete all but the closest because it is a huge table and I would run out of memory, so I need a way to only insert the rows with the info from the closest point from B to A.
What am I doing wrong here? Thank you in advance
UPDATE:
After some help, I have learned that I should use a Lateral in the Select query, but I'm not sure how to use it.
I need the Select to get each row in table A and find its nearest neighbour from table B, which I guess it is done using the query previously stated, and insert into table C some columns from A, some columns from its nearest neighbour (table B), and some columns from table V, which is selected by an Intersect condition. The main problem is how to organize all that into the Select so I don't get an error.
This is where I am at this point:
INSERT INTO C (id_door, distance, id_way,Y, X, geom, check)
(SELECT A.state, l.*, V.check
FROM A, B, C, V
lateral (select st_distance(a.geom,b.geom), b.way, ST_Y(B.geom), ST_X(B.geom) ,B.geom
From B
Where ST_DWithin(a.geom, b.geom,150))
Order by a.geom<->b.geom limit 1) l
WHERE
ST_INTERSECTS(A.geom, V.geom)
You can use lateral join - very smart type of subquery that can reference tables outside the subquery. More about lateral you can find here
-- Edited according to new information in answer --
Insert into C (id_door, distance, id_way,Y, X, geom, check)
select l.*
from a,
lateral (select a.state, st_distance(a.geom,b.geom),
b.way, ST_Y(B.geom), ST_X(B.geom), B.geom,
v.check
from b, v
where ST_DWithin(a.geom, b.geom,150)
and st_dwithin(a.geom,v.geom,0)
and st_intersects(a.geom,v.geom)
order by a.geom<->b.geom, v.geom limit 1) l
If you want more records per each point from A then increase the limit from 1 to your desired value.

lead and lag on large table 1billion rows

I have a table T as follows with 1 Billion records. Currently, this table has no Primary key or Indexes.
create table T(
day_c date,
str_c varchar2(20),
comm_c varchar2(20),
src_c varchar2(20)
);
some sample data:
insert into T
select to_date('20171011','yyyymmdd') day_c,'st1' str_c,'c1' comm_c,'s1' src_c from dual
union
select to_date('20171012','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171013','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171014','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171015','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171016','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171017','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171018','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171019','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171020','yyyymmdd'),'st1','c1','s1' from dual;
The expected result is to generate the date ranges for the changes in column src_c.
I have the following code snippet which provides the desired result. However, it is slow as the cost of running lag and lead is quite high on the table.
WITH EndsMarked AS (
SELECT
day_c,str_c,comm_c,src_c,
CASE WHEN src_c= LAG(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_START,
CASE WHEN src_c= LEAD(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_END
FROM T
), GroupsNumbered AS (
SELECT
day_c,str_c,comm_c,
src_c,
IS_START,
IS_END,
COUNT(CASE WHEN IS_START = 1 THEN 1 END)
OVER (ORDER BY day_c) AS GroupNum
FROM EndsMarked
WHERE IS_START=1 OR IS_END=1
)
SELECT
str_c,comm_c,src_c,
MIN(day_c) AS GROUP_START,
MAX(day_c) AS GROUP_END
FROM GroupsNumbered
GROUP BY str_c,comm_c, src_c,GroupNum
ORDER BY groupnum;
Output :
STR_C COMM_C SRC_C GROUP_START GROUP_END
st1 c1 s1 11-OCT-17 13-OCT-17
st1 c1 s2 14-OCT-17 16-OCT-17
st1 c1 s1 17-OCT-17 20-OCT-17
Any suggestion to speed up?
Oracle database :12c.
SGA Memory:20GB
Total CPU:22
Explain plan:
Order by day_c only, or do you need to partition by str_c and comm_c first? It seems so - in which case I am not sure your query is correct, and Sentinel's solution will need to be adjusted accordingly.
Then:
For some reason (which escapes me), it appears that the match_recognize clause (available only since Oracle 12.1) is faster than analytic functions, even when the work done seems to be the same.
In your problem, (1) you must read 1 billion rows from disk, which can't be done faster than the hardware allows (do you REALLY need to do this on all 1 billion rows, or should you archive a large portion of your table, perhaps after performing this identification of GROUP_START and GROUP_END)? (2) you must order the data by day_c no matter what method you use, and that is time consuming.
With that said, the tabibitosan method (see Sentinel's answer) will be faster than the start-of-group method (which is close to, but simpler than what you currently have).
The match_recognize solution, which will probably be faster than any solution based on analytic functions, looks like this:
select str_c, comm_c, src_c, group_start, group_end
from t
match_recognize(
partition by str_c, comm_c
order by day_c
measures x.src_c as src_c,
first(day_c) as group_start,
last(day_c) as group_end
pattern ( x y* )
define y as src_c = x.src_c
)
-- Add ORDER BY clause here, if needed
;
Here is a quick explanation of how this works; for developers who are not familiar with match_recognize, I provided links to a few good tutorials in a Comment below this Answer.
The match_recognize clause partitions the input rows by str_c and comm_c and orders them by day_c. So far this is exactly the same work that analytic functions do.
Then in the PATTERN and DEFINE clauses I declare and define two "classes" of rows, which will be flagged as X and Y, respectively. X is any row (there are no restrictions on it in the DEFINE clause). However, Y is restricted: it must have the same src_c as the last X row preceding it.
So, in each partition, and reading from the earliest row to the latest (within the partition), I am looking for any number of matches, where a match consists of an arbitrary row (marked X), followed by as many Y rows as possible; where Y means "same src_c as the first row in this match. So, this will identify sequences of rows where the src_c did not change.
For each match that is found, the clause will output the src_c value from the X row (which is the same, really, for all the rows in that match), and the first and the last value in the day_c column for that match. That is what we need to put in the SELECT clause of the overall query.
You can eliminate one CTE by using the Tabibito-san (Traveler) method:
with Groups as (
select t.*
, row_number() over (order by day_c)
- row_number() over (partition by str_c
, comm_c
, src_c
order by day_c) GroupNum
from t
)
select str_c
, comm_c
, src_c
, min(day_c) GROUP_START
, max(day_c) GROUP_END
from Groups
group by str_c
, comm_c
, src_c
, GroupNum

Create multiple incrementing columns using with recursive in postgresql

I'm trying to create a table with the following columns:
I want to use a with recursive table to do this. The following code however is giving the following error:
'ERROR: column "b" does not exist'
WITH recursive numbers AS
(
SELECT 1,2,4 AS a, b, c
UNION ALL
SELECT a+1, b+1, c+1
FROM Numbers
WHERE a + 1 <= 10
)
SELECT * FROM numbers;
I'm stuck because when I just include one column this works perfectly. Why is there an error for multiple columns?
This appears to be a simple syntax issue: You are aliasing the columns incorrectly. (SELECT 1,2,4 AS a, b, c) is incorrect. Your attempt has 5 columns: 1,2,a,b,c
Break it down to just: Select 1,2,4 as a,b,c and you see the error but Select 1 a,2 b,4 c works fine.
b is unknown in the base select because it is being interpreted as a field name; yet no table exists having that field. Additionally the union would fail as you have 5 fields in the base and 3 in the recursive union.
DEMO: http://rextester.com/IUWJ67486
One can define the columns outside the select making it easier to manage or change names.
WITH recursive numbers (a,b,c) AS
(
SELECT 1,2,4
UNION ALL
SELECT a+1, b+1, c+1
FROM Numbers
WHERE a + 1 <= 10
)
SELECT * FROM numbers;
or this approach which aliases the fields internally so the 1st select column's names would be used. (a,b,c) vs somereallylongalias... in union query. It should be noted that not only the name of the column originates from the 1st query in the unioned sets; but also the datatype for the column; which, must match between the two queries.
WITH recursive numbers AS
(
SELECT 1 as a ,2 as b,4 as c
UNION ALL
SELECT a+1 someReallyLongAlias
, b+1 someReallyLongAliasAgain
, c+1 someReallyLongAliasYetAgain
FROM Numbers
WHERE a<5
)
SELECT * FROM numbers;
Lastly, If you truly want to stop at 5 then the where clause should be WHERE a < 5. The image depicts this whereas the query does not; so not sure what your end game is here.

Full outer join on multiple tables in PostgreSQL

In PostgreSQL, I have N tables, each consisting of two columns: id and value. Within each table, id is a unique identifier and value is numeric.
I would like to join all the tables using id and, for each id, create a sum of values of all the tables where the id is present (meaning the id may be present only in subset of tables).
I was trying the following query:
SELECT COALESCE(a.id, b.id, c.id) AS id,
COALESCE(a.value,0) + COALESCE(b.value,0) + COALESCE(c.value.0) AS value
FROM
a
FULL OUTER JOIN
b
ON (a.id=b.id)
FULL OUTER JOIN
c
ON (b.id=c.id)
But it doesn't work for cases when the id is present in a and c, but not in b.
I suppose I would have to do some bracketing like:
SELECT COALESCE(x.id, c.id) AS id, x.value+c.value AS value
FROM
(SELECT COALESCE(a.id, b.id), a.value+b.value AS value
FROM
a
FULL OUTER JOIN
b
ON (a.id=b.id)
) AS x
FULL OUTER JOIN
c
ON (x.id = c.id)
It was only 3 tables and the code is ugly enough already imho. Is there some elegant, systematic ways how to do the join for N tables? Not to get lost in my code?
I would also like to point out that I did some simplifications in my example. Tables a, b, c, ..., are actually results of quite complex queries over several materialized views. But the syntactical problem remains the same.
I understood you need to sum the values from N tables and group them by id, correct?
For that I would do this:
Select x.id, sum (x.value) from (
Select * from a
Union all
Select * from b
Union all........
) as x group by x.id;
Since the n tables are composed by the same fields you can union them all creating a big table full of all the id - value tuples from all tables. Use union all because union filters for duplicates!
Then just sum all the values grouped by id.

SSRS 2005 column chart: show series label missing when data count is zero

I have a pretty simple chart with a likely common issue. I've searched for several hours on the interweb but only get so far in finding a similar situation.
the basics of what I'm pulling contains a created_by, person_id and risk score
the risk score can be:
1 VERY LOW
2 LOW
3 MODERATE STABLE
4 MODERATE AT RISK
5 HIGH
6 VERY HIGH
I want to get a headcount of persons at each risk score and display a risk count even if there is a count of 0 for that risk score but SSRS 2005 likes to suppress zero counts.
I've tried this in the point labels
=IIF(IsNothing(count(Fields!person_id.value)),0,count(Fields!person_id.value))
Ex: I'm missing values for "1 LOW" as the creator does not have any "1 LOW" they've assigned risk scores for.
*here's a screenshot of what I get but I'd like to have a column even for a count when it still doesn't exist in the returned results.
#Nathan
Example scenario:
select professor.name, grades.score, student.person_id
from student
inner join grades on student.person_id = grades.person_id
inner join professor on student.professor_id = professor.professor_id
where
student.professor_id = #professor
Not all students are necessarily in the grades table.
I have a =Count(Fields!person_id.Value) for my data points & series is grouped on =Fields!score.Value
If there were a bunch of A,B,D grades but no C & F's how would I show labels for potentially non-existent counts
In your example, the problem is that no results are returned for grades that are not linked to any students. To solve this ideally there would be a table in your source system which listed all the possible values of "score" (e.g. A - F) and you would join this into your query such that at least one row was returned for each possible value.
If such a table doesn't exist and the possible score values are known and static, then you could manually create a list of them in your query. In the example below I create a subquery that returns a combination of all professors and all possible scores (A - F) and then LEFT join this to the grades and students tables (left join means that the professor/score rows will be returned even if no students have those scores in the "grades" table).
SELECT
professor.name
, professorgrades.score
, student.person_id
FROM
(
SELECT professor_id, score
FROM professor
CROSS JOIN
(
SELECT 'A' AS score
UNION
SELECT 'B'
UNION
SELECT 'C'
UNION
SELECT 'D'
UNION
SELECT 'E'
UNION
SELECT 'F'
) availablegrades
) professorgrades
INNER JOIN professor ON professorgrades.professor_id = professor.professor_id
LEFT JOIN grades ON professorgrades.score = grades.score
LEFT JOIN student ON grades.person_id = student.person_id AND
professorgrades.professor_id = student.professor_id
WHERE professorgrades.professor_id = 1
See a live example of how this works here: SQLFIDDLE
SELECT RS.RiskScoreId, RS.Description, SUM(DT.RiskCount) AS RiskCount
FROM (
SELECT RiskScoreId, 1 AS RiskCount
FROM People
UNION ALL
SELECT RiskScoreId, 0 AS RiskCount
FROM RiskScores
) DT
INNER JOIN RiskScores RS ON RS.RiskScoreId = DT.RiskScoreId
GROUP BY RS.RiskScoreId, RS.Description
ORDER BY RS.RiskScoreId