joining with a DISTINCT ON on an ordered subquery in sqlalchemy - postgresql

Here is (an extremely simplified version of) my problem.
I'm using Postgresql as the backend and trying to build a sqlalchemy query
from another query.
Table setup
Here are the tables with some random data for the example.
You can assume that each table was declared in sqlalchemy declaratively, with
the name of the mappers being respectively Item and ItemVersion.
At the end of the question you can find a link where I put the code for
everything in this question, including the table definitions.
Some items.
item
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
A table containing versions of each item. Each has at least one.
item_version
+----+---------+---------+-----------+
| id | item_id | version | text |
+----+---------+---------+-----------+
| 1 | 1 | 0 | item_1_v0 |
| 2 | 1 | 1 | item_1_v1 |
| 3 | 2 | 0 | item_2_v0 |
| 4 | 3 | 0 | item_3_v0 |
+----+---------+---------+-----------+
The query
Now, for a given sqlalchemy query over Item, I want a function that returns
another query, but this time over (Item, ItemVersion), where the Items are
the same as in the original query (and in the same order!), and where the
ItemVersion are the corresponding latest versions for each Item.
Here is an example in SQL, which is pretty straightforward:
First a random query over the item table
SELECT item.id as item_id
FROM item
WHERE item.id != 2
ORDER BY item.id DESC
which corresponds to
+---------+
| item_id |
+---------+
| 3 |
| 1 |
+---------+
Then from that query, if I want to join the right versions, I can do
SELECT sq2.item_id AS item_id,
sq2.item_version_id AS item_version_id,
sq2.item_version_text AS item_version_text
FROM (
SELECT DISTINCT ON (sq.item_id)
sq.item_id AS item_id,
iv.id AS item_version_id,
iv.text AS item_version_text
FROM (
SELECT item.id AS item_id
FROM item
WHERE id != 2
ORDER BY id DESC) AS sq
JOIN item_version AS iv
ON iv.item_id = sq.item_id
ORDER BY sq.item_id, iv.version DESC) AS sq2
ORDER BY sq2.item_id DESC
Note that it has to be wrapped in a subquery a second time because the
DISTINCT ON discards the ordering.
Now the challenge is to write a function that does that in sqlalchemy.
Here is what I have so far.
First the initial sqlalchemy query over the items:
session.query(Item).filter(Item.id != 2).order_by(desc(Item.id))
Then I'm able to build my second query but without the original ordering. In
other words I don't know how to do the second subquery wrapping that I did in
SQL to get back the ordering that was discarded by the DISTINCT ON.
def join_version(session, query):
sq = aliased(Item, query.subquery('sq'))
sq2 = session.query(sq, ItemVersion) \
.distinct(sq.id) \
.join(ItemVersion) \
.order_by(sq.id, desc(ItemVersion.version))
return sq2
I think this SO question could be part of the answer but I'm not quite
sure how.
The code to run everything in this question (database creation, population and
a failing unit test with what I have so far) can be found here. Normally
if you can fix the join_version function, it should make the test pass!

Ok so I found a way. It's a bit of a hack but still only queries the database twice so I guess I will survive! Basically I'm querying the database for the Items first, and then I do another query for the ItemVersions, filtering on item_id, and then reordering with a trick I found here (this is also relevant).
Here is the code:
def join_version(session, query):
items = query.all()
item_ids = [i.id for i in items]
items_v_sq = session.query(ItemVersion) \
.distinct(ItemVersion.item_id) \
.filter(ItemVersion.item_id.in_(item_ids)) \
.order_by(ItemVersion.item_id, desc(ItemVersion.version)) \
.subquery('sq')
sq = aliased(ItemVersion, items_v_sq)
items_v = session.query(sq) \
.order_by('idx(array{}, sq.item_id)'.format(item_ids))
return zip(items, items_v)

Related

Hibernate - SQL query: How to get all child descandants starting with specific node

I have the following sample data (items) with some kind of recursion. For the sake of simplicity I limited the sample to 2 level. Matter of fact - they could grow quite deep.
+----+--------------------------+----------+------------------+-------+
| ID | Item - Name | ParentID | MaterializedPath | Color |
+----+--------------------------+----------+------------------+-------+
| 1 | Parent 1 | null | 1 | green |
| 2 | Parent 2 | null | 2 | green |
| 4 | Parent 2 Child 1 | 2 | 2.4 | orange|
| 6 | Parent 2 Child 1 Child 1 | 4 | 2.4.6 | red |
| 7 | Parent 2 Child 1 Child 2 | 4 | 2.4.7 | orange|
| 3 | Parent 1 Child 1 | 1 | 1.3 | orange|
| 5 | Parent 1 Child 1 Child | 3 | 1.3.5 | red |
+----+--------------------------+----------+------------------+-------+
I need to get via SQL all children
which are not orange
for a given starting ID
with either starting ID=1. The result should be 1, 1.3.5. When start with ID=4 the should be 2.4.6.
I read little bit and found the CTE should be used. I tried the following simplified definition
WITH w1( id, parent_item_id) AS
( SELECT
i.id,
i.parent_item_id
FROM
item i
WHERE
id = 4
UNION ALL
SELECT
i.id,
i.parent_item_id
FROM
item, JOIN w1 ON i.parent_item_id = w1.id
);
However, this won't even be executed as SQL-statement. I have several question to this:
CTE could be used with Hibernate?
Is there a way have the result via SQL queries? (more or less as recursive pattern)
I'm somehow lost with the recursive pattern combined with selection of color for the end result.
Your query is invalid for the following reasons:
As documented in the manual a recursive CTE requires the RECURSIVE keyword
Your JOIN syntax is wrong. You need to remove the , and give the items table an alias.
If you need the color column, just add it to both SELECTs inside the CTE and filter the rows in the final SELECT.
If that is changed, the following works fine:
WITH recursive w1 (id, parent_item_id, color) AS
(
SELECT i.id,
i.parent_item_id,
i.color
FROM item i
WHERE id = 4
UNION ALL
SELECT i.id,
i.parent_item_id,
i.color
FROM item i --<< missing alias
JOIN w1 ON i.parent_item_id = w1.id
)
select *
from w1
where color <> 'orange'
Note that the column list for the CTE definition is optional, so you can just write with recursive w1 as ....

UPDATE from temp table picking the "last" row per group

Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**
Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?

Given a row representing a path, union a total column

Say I have a table like the following table that represents a path from 1 -> 2 -> 3 -> 4 -> 5:
+------+----+--------+
| from | to | weight |
+------+----+--------+
| a | b | 1 |
| b | c | 2 |
| c | d | 1 |
| d | e | 1 |
| e | f | 3 |
+------+----+--------+
Each row knows where it came from and where it is going
I would like to union a total row that takes the starting name, ending name, and a total weight like so:
+------+----+--------+
| from | to | weight |
+------+----+--------+
| a | f | 8 |
+------+----+--------+
The first table is a result of a CTE expression, and I can easily get the total of the previous query with SUM, but I'm unable to get the LAST_VALUE to work in a similar way to:
WITH RECURSIVE cte AS (
...
)
SELECT *
FROM cte
UNION ALL
SELECT 'total', FIRST_VALUE(from), LAST_VALUE(to), SUM(weight)
FROM cte
The FIRST_VALUE and LAST_VALUE functions require OVER clauses which seem to add unnecessary complications to what I would expect, so I think I am going the wrong direction with that. Any ideas on how to achieve this?
So I made a strange solution that:
Selects the first from value (partitioned by TRUE)
Selects the last to value (partitioned by TRUE again)
Cross joins the sum of all weights, limited to 1
WITH RECURSIVE cte AS (
...
)
SELECT *
FROM cte
UNION ALL (
SELECT FIRST_VALUE(from) OVER (PARTITION BY TRUE), LAST_VALUE(to) OVER (PARTITION BY TRUE), total
FROM cte
CROSS JOIN (
SELECT SUM(weight) as total
FROM cte
) tmp
LIMIT 1
);
Is it hacky? Yes. Does it work? Also yes. I'm sure there are better solutions, and I would love to hear them.

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

Query join result appears to be incorrect

I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?