How to get back aggregate values across 2 dimensions using Python Cubes? - postgresql
Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table
Related
Efficient way to retrieve all values from a column that start with other values from the same column in PostgreSQL
For the sake of simplicity, suppose you have a table with numbers like: | number | ---------- |123 | |1234 | |12345 | |123456 | |111 | |1111 | |2 | |700 | What would be an efficient way of retrieving the shortest numbers (call them roots or whatever) and all values derived from them, eg: | root | derivatives | -------------------------------- | 123 | 1234, 12345, 123456 | | 111 | 1111 | Numbers 2 & 700 are excluded from the list because they're unique, and thus have no derivatives. An output as the above would be ideal, but since it's probably difficult to achieve, the next best thing would be something like below, which I can then post-process: | root | derivative | ----------------------- | 123 | 1234 | | 123 | 12345 | | 123 | 123456 | | 111 | 1111 | My naive initial attempt to at least identify roots (see below) has been running for 4h now with a dataset of ~500k items, but the real one I'd have to inspect consists of millions. select number from numbers n1 where exists( select number from numbers n2 where n2.number <> n1.number and n2.number like n1.number || '_%' );
This works if number is an integer or bigint: select min(a.number) as root, b.number as derivative from nums a cross join lateral generate_series(1, 18) as gs(power) join nums b on b.number / (10^gs.power)::bigint = a.number group by b.number order by root, derivative; EDIT: I moved a non-working query to the bottom. It fails for reasons outlined by #Morfic in the comments. We can do a similar and simpler join using like for character types: select min(a.number) as root, b.number as derivative from numchar a join numchar b on b.number like a.number||'%' and b.number != a.number group by b.number order by root, derivative; Updated fiddle. Faulty Solution Follows If number is a character type, then try this: with groupings as ( select number, case when number like (lag(number) over (order by number))||'%' then 0 else 1 end as newgroup from numchar ), groupnums as ( select number, sum(newgroup) over (order by number) as groupnum from groupings ), matches as ( select min(number) over (partition by groupnum) as root, number as derivative from groupnums ) select * from matches where root != derivative; There should be only a single sort on groupnum in this execution since the column is your table's primary key. db<>fiddle here
PostgreSQL - Setting null values to missing rows in a join statement
SQL newbie here. I'm trying to write a query that generates a scoring table, setting null to a student's grades in a module for which they haven't yet taken their exams (on PostgreSQL). So I start with tables that look something like this: student_evaluation: |student_id| module_id | course_id |grade | |----------|-----------|-----------|-------| | 1 | 1 | 1 |3 | | 1 | 1 | 1 |7 | | 1 | 2 | 1 |8 | | 2 | 4 | 2 |9 | course_module: | module_id | course_id | | ---------- | --------- | | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 2 | In our use case, a course is made up of several modules. Each module has a single exam, but a student who failed his exam may have a couple of retries. The same module may also be present in different courses, but an exam attempt only counts for one instance of the module (ie. student A passed module 1's exam on course 1. If course 2 also has module 1, student A has to retake the same exam for course 2 if he also has access to that course). So the output should look like this: student_id module_id course_id grade 1 1 1 3 1 1 1 7 1 2 1 8 1 3 1 null 2 4 2 9 I feel like this should have been a simple task, but I think I have a very flawed understanding of how outer and cross joins work. I have tried stuff like: SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se RIGHT OUTER JOIN course_module ON course_module.course_id = se.course_id AND course_module.module_id = se.module_id or SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se CROSS JOIN course_module WHERE course_module.course_id = se.course_id Neither worked. These all feel wrong, but I'm lost as to what would be the proper way to go about this. Thank you in advance.
I think you need both join types: first use a cross join to build a list of all combinations of students and courses, then use an outer join to add the grades. SELECT sc.student_id, sc.module_id, sc.course_id, se.grade FROM student_evaluation se RIGHT JOIN (SELECT s.student_id, c.module_id, c.course_id FROM (SELECT DISTINCT student_id FROM student_evaluation) AS s CROSS JOIN course_module AS c) AS sc USING (course_id));
PostgreSQL One ID multiple values
I have a Postgres table where one id may have multiple Channel values as follows ID |Channel | Column 3 | Column 4 _____|________|__________|_________ 1 | Sports | x | null 1 | Organic| x | z 2 | Organic| null | q 3 | Arts | b | w 3 | Organic| e | r 4 | Sports | sp | t No ID will have a duplicate channel name, and no ID will be both Sports and Arts. That is, ID 1 could have a Sports and Organic channel, a Sports and Arts channel, but not two sports or two organic entries and not a Sports and Arts channel. I want all IDs to be in the query, but if there is a non-organic channel I prefer that. The result I would want would be ID |Channel | Column 3 | Column 4 _____|________|__________|_________ 1 | Sports | x | null 2 | Organic| null | q 3 | Arts | b | w 4 | Sports | sp | t I feel like there is some CTE here, a rank and partition or something that could do the trick, but I'm just not getting it. I'm only including Columns 3 and 4 to show there are extra columns. Does anyone have any ideas on the code to deploy here?
You could use DISTINCT ON with an appropriate ORDER BY clause: SELECT DISTINCT ON (id) id, channel, column3, column4 FROM atable ORDER BY id, channel = 'Organic'; This relies on the fact that FALSE < TRUE.
I ended up using a rank over function ROW_NUMBER () over (partition by salesforce_id order by case when channel is organic then 0 else 1 end desc, timestamp desc) as id_rank I didn't include in the original question that I had a timestamp! This works now. Thanks
SELECT DISTINCT on a ordered subquery's table
I'm working on a problem, involving these two tables. books isbn | title | author ------------+-----------------------------------------+------------------ 1840918626 | Hogwarts: A History | Bathilda Bagshot 3458400871 | Fantastic Beasts and Where to Find Them | Newt Scamander 9136884926 | Advanced Potion-Making | Libatius Borage transactions id | patron_id | isbn | checked_out_date | checked_in_date ----+-----------+------------+------------------+----------------- 1 | 1 | 1840918626 | 2012-05-05 | 2012-05-06 2 | 4 | 9136884926 | 2012-05-05 | 2012-05-06 3 | 2 | 3458400871 | 2012-05-05 | 2012-05-06 4 | 3 | 3458400871 | 2018-04-29 | 2018-05-02 5 | 2 | 9136884926 | 2018-05-03 | NULL 6 | 1 | 3458400871 | 2018-05-03 | 2018-05-05 7 | 5 | 3458400871 | 2018-05-05 | NULL the query "Make a list of all book titles and denote whether or not a copy of that book is checked out." so pretty much just the first table with a checked out column. im trying to SELECT DISTINCT on a sub query with the checkout books first, but that doesn't work. I've researched and others say to accomplish this use a GROUP BY clause instead of DISTINCT but the examples they provide are one column queries and when more columns are added it doesn't work. this is my closest attempt SELECT DISTINCT ON (title) title, checked_out FROM( SELECT b.title, t.checked_in_date IS NULL AS checked_out FROM transactions t natural join books b ORDER BY checked_out DESC ) t;
or you can join only transactions where books are not checked in: SELECT b.title, t.isbn IS NOT NULL AS checked_out , t.checked_out_date FROM books b LEFT JOIN transactions t ON t.isbn = b.isbn AND t.checked_in_date IS NULL ORDER BY checked_out DESC
I adjusted your attempt a little bit. Basically I changed the way your data is joined SELECT DISTINCT ON (title) title, checked_out FROM( SELECT b.title, t.checked_in_date IS NULL AS checked_out FROM books b LEFT OUTER JOIN transactions t USING (isbn) ORDER BY checked_out DESC ) t;
Equivalent to unpivot() in PostgreSQL
Is there a unpivot equivalent function in PostgreSQL?
Create an example table: CREATE TEMP TABLE foo (id int, a text, b text, c text); INSERT INTO foo VALUES (1, 'ant', 'cat', 'chimp'), (2, 'grape', 'mint', 'basil'); You can 'unpivot' or 'uncrosstab' using UNION ALL: SELECT id, 'a' AS colname, a AS thing FROM foo UNION ALL SELECT id, 'b' AS colname, b AS thing FROM foo UNION ALL SELECT id, 'c' AS colname, c AS thing FROM foo ORDER BY id; This runs 3 different subqueries on foo, one for each column we want to unpivot, and returns, in one table, every record from each of the subqueries. But that will scan the table N times, where N is the number of columns you want to unpivot. This is inefficient, and a big problem when, for example, you're working with a very large table that takes a long time to scan. Instead, use: SELECT id, unnest(array['a', 'b', 'c']) AS colname, unnest(array[a, b, c]) AS thing FROM foo ORDER BY id; This is easier to write, and it will only scan the table once. array[a, b, c] returns an array object, with the values of a, b, and c as it's elements. unnest(array[a, b, c]) breaks the results into one row for each of the array's elements.
You could use VALUES() and JOIN LATERAL to unpivot the columns. Sample data: CREATE TABLE test(id int, a INT, b INT, c INT); INSERT INTO test(id,a,b,c) VALUES (1,11,12,13),(2,21,22,23),(3,31,32,33); Query: SELECT t.id, s.col_name, s.col_value FROM test t JOIN LATERAL(VALUES('a',t.a),('b',t.b),('c',t.c)) s(col_name, col_value) ON TRUE; DBFiddle Demo Using this approach it is possible to unpivot multiple groups of columns at once. EDIT Using Zack's suggestion: SELECT t.id, col_name, col_value FROM test t CROSS JOIN LATERAL (VALUES('a', t.a),('b', t.b),('c',t.c)) s(col_name, col_value); <=> SELECT t.id, col_name, col_value FROM test t ,LATERAL (VALUES('a', t.a),('b', t.b),('c',t.c)) s(col_name, col_value); db<>fiddle demo
Great article by Thomas Kellerer found here Unpivot with Postgres Sometimes it’s necessary to normalize de-normalized tables - the opposite of a “crosstab” or “pivot” operation. Postgres does not support an UNPIVOT operator like Oracle or SQL Server, but simulating it, is very simple. Take the following table that stores aggregated values per quarter: create table customer_turnover ( customer_id integer, q1 integer, q2 integer, q3 integer, q4 integer ); And the following sample data: customer_id | q1 | q2 | q3 | q4 ------------+-----+-----+-----+---- 1 | 100 | 210 | 203 | 304 2 | 150 | 118 | 422 | 257 3 | 220 | 311 | 271 | 269 But we want the quarters to be rows (as they should be in a normalized data model). In Oracle or SQL Server this could be achieved with the UNPIVOT operator, but that is not available in Postgres. However Postgres’ ability to use the VALUES clause like a table makes this actually quite easy: select c.customer_id, t.* from customer_turnover c cross join lateral ( values (c.q1, 'Q1'), (c.q2, 'Q2'), (c.q3, 'Q3'), (c.q4, 'Q4') ) as t(turnover, quarter) order by customer_id, quarter; will return the following result: customer_id | turnover | quarter ------------+----------+-------- 1 | 100 | Q1 1 | 210 | Q2 1 | 203 | Q3 1 | 304 | Q4 2 | 150 | Q1 2 | 118 | Q2 2 | 422 | Q3 2 | 257 | Q4 3 | 220 | Q1 3 | 311 | Q2 3 | 271 | Q3 3 | 269 | Q4 The equivalent query with the standard UNPIVOT operator would be: select customer_id, turnover, quarter from customer_turnover c UNPIVOT (turnover for quarter in (q1 as 'Q1', q2 as 'Q2', q3 as 'Q3', q4 as 'Q4')) order by customer_id, quarter;
FYI for those of us looking for how to unpivot in RedShift. The long form solution given by Stew appears to be the only way to accomplish this. For those who cannot see it there, here is the text pasted below: We do not have built-in functions that will do pivot or unpivot. However, you can always write SQL to do that. create table sales (regionid integer, q1 integer, q2 integer, q3 integer, q4 integer); insert into sales values (1,10,12,14,16), (2,20,22,24,26); select * from sales order by regionid; regionid | q1 | q2 | q3 | q4 ----------+----+----+----+---- 1 | 10 | 12 | 14 | 16 2 | 20 | 22 | 24 | 26 (2 rows) pivot query create table sales_pivoted (regionid, quarter, sales) as select regionid, 'Q1', q1 from sales UNION ALL select regionid, 'Q2', q2 from sales UNION ALL select regionid, 'Q3', q3 from sales UNION ALL select regionid, 'Q4', q4 from sales ; select * from sales_pivoted order by regionid, quarter; regionid | quarter | sales ----------+---------+------- 1 | Q1 | 10 1 | Q2 | 12 1 | Q3 | 14 1 | Q4 | 16 2 | Q1 | 20 2 | Q2 | 22 2 | Q3 | 24 2 | Q4 | 26 (8 rows) unpivot query select regionid, sum(Q1) as Q1, sum(Q2) as Q2, sum(Q3) as Q3, sum(Q4) as Q4 from (select regionid, case quarter when 'Q1' then sales else 0 end as Q1, case quarter when 'Q2' then sales else 0 end as Q2, case quarter when 'Q3' then sales else 0 end as Q3, case quarter when 'Q4' then sales else 0 end as Q4 from sales_pivoted) group by regionid order by regionid; regionid | q1 | q2 | q3 | q4 ----------+----+----+----+---- 1 | 10 | 12 | 14 | 16 2 | 20 | 22 | 24 | 26 (2 rows) Hope this helps, Neil
Pulling slightly modified content from the link in the comment from #a_horse_with_no_name into an answer because it works: Installing Hstore If you don't have hstore installed and are running PostgreSQL 9.1+, you can use the handy CREATE EXTENSION hstore; For lower versions, look for the hstore.sql file in share/contrib and run in your database. Assuming that your source (e.g., wide data) table has one 'id' column, named id_field, and any number of 'value' columns, all of the same type, the following will create an unpivoted view of that table. CREATE VIEW vw_unpivot AS SELECT id_field, (h).key AS column_name, (h).value AS column_value FROM ( SELECT id_field, each(hstore(foo) - 'id_field'::text) AS h FROM zcta5 as foo ) AS unpiv ; This works with any number of 'value' columns. All of the resulting values will be text, unless you cast, e.g., (h).value::numeric.
Just use JSON: with data (id, name) as ( values (1, 'a'), (2, 'b') ) select t.* from data, lateral jsonb_each_text(to_jsonb(data)) with ordinality as t order by data.id, t.ordinality; This yields |key |value|ordinality| |----|-----|----------| |id |1 |1 | |name|a |2 | |id |2 |1 | |name|b |2 | dbfiddle
I wrote a horrible unpivot function for PostgreSQL. It's rather slow but it at least returns results like you'd expect an unpivot operation to. https://cgsrv1.arrc.csiro.au/blog/2010/05/14/unpivotuncrosstab-in-postgresql/ Hopefully you can find it useful..
Depending on what you want to do... something like this can be helpful. with wide_table as ( select 1 a, 2 b, 3 c union all select 4 a, 5 b, 6 c ) select unnest(array[a,b,c]) from wide_table
You can use FROM UNNEST() array handling to UnPivot a dataset, tandem with a correlated subquery (works w/ PG 9.4). FROM UNNEST() is more powerful & flexible than the typical method of using FROM (VALUES .... ) to unpivot datasets. This is b/c FROM UNNEST() is variadic (with n-ary arity). By using a correlated subquery the need for the lateral ORDINAL clause is eliminated, & Postgres keeps the resulting parallel columnar sets in the proper ordinal sequence. This is, BTW, FAST -- in practical use spawning 8 million rows in < 15 seconds on a 24-core system. WITH _students AS ( /** CTE **/ SELECT * FROM ( SELECT 'jane'::TEXT ,'doe'::TEXT , 1::INT UNION SELECT 'john'::TEXT ,'doe'::TEXT , 2::INT UNION SELECT 'jerry'::TEXT ,'roe'::TEXT , 3::INT UNION SELECT 'jodi'::TEXT ,'roe'::TEXT , 4::INT ) s ( fn, ln, id ) ) /** end WITH **/ SELECT s.id , ax.fanm -- field labels, now expanded to two rows , ax.anm -- field data, now expanded to two rows , ax.someval -- manually incl. data , ax.rankednum -- manually assigned ranks ,ax.genser -- auto-generate ranks FROM _students s ,UNNEST /** MULTI-UNNEST() BLOCK **/ ( ( SELECT ARRAY[ fn, ln ]::text[] AS anm -- expanded into two rows by outer UNNEST() /** CORRELATED SUBQUERY **/ FROM _students s2 WHERE s2.id = s.id -- outer relation ) ,( /** ordinal relationship preserved in variadic UNNEST() **/ SELECT ARRAY[ 'first name', 'last name' ]::text[] -- exp. into 2 rows AS fanm ) ,( SELECT ARRAY[ 'z','x','y'] -- only 3 rows gen'd, but ordinal rela. kept AS someval ) ,( SELECT ARRAY[ 1,2,3,4,5 ] -- 5 rows gen'd, ordinal rela. kept. AS rankednum ) ,( SELECT ARRAY( /** you may go wild ... **/ SELECT generate_series(1, 15, 3 ) AS genser ) ) ) ax ( anm, fanm, someval, rankednum , genser ) ; RESULT SET: +--------+----------------+-----------+----------+---------+------- | id | fanm | anm | someval |rankednum| [ etc. ] +--------+----------------+-----------+----------+---------+------- | 2 | first name | john | z | 1 | . | 2 | last name | doe | y | 2 | . | 2 | [null] | [null] | x | 3 | . | 2 | [null] | [null] | [null] | 4 | . | 2 | [null] | [null] | [null] | 5 | . | 1 | first name | jane | z | 1 | . | 1 | last name | doe | y | 2 | . | 1 | | | x | 3 | . | 1 | | | | 4 | . | 1 | | | | 5 | . | 4 | first name | jodi | z | 1 | . | 4 | last name | roe | y | 2 | . | 4 | | | x | 3 | . | 4 | | | | 4 | . | 4 | | | | 5 | . | 3 | first name | jerry | z | 1 | . | 3 | last name | roe | y | 2 | . | 3 | | | x | 3 | . | 3 | | | | 4 | . | 3 | | | | 5 | . +--------+----------------+-----------+----------+---------+ ----
Here's a way that combines the hstore and CROSS JOIN approaches from other answers. It's a modified version of my answer to a similar question, which is itself based on the method at https://blog.sql-workbench.eu/post/dynamic-unpivot/ and another answer to that question. -- Example wide data with a column for each year... WITH example_wide_data("id", "2001", "2002", "2003", "2004") AS ( VALUES (1, 4, 5, 6, 7), (2, 8, 9, 10, 11) ) -- that is tided to have "year" and "value" columns SELECT id, r.key AS year, r.value AS value FROM example_wide_data w CROSS JOIN each(hstore(w.*)) AS r(key, value) WHERE -- This chooses columns that look like years -- In other cases you might need a different condition r.key ~ '^[0-9]{4}$'; It has a few benefits over other solutions: By using hstore and not jsonb, it hopefully minimises issues with type conversions (although hstore does convert everything to text) The columns don't need to be hard coded or known in advance. Here, columns are chosen by a regex on the name, but you could use any SQL logic based on the name, or even the value. It doesn't require PL/pgSQL - it's all SQL