UPDATE from temp table picking the "last" row per group - postgresql

Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**

Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?

Related

How to update delta table based on lookup DataFrame?

I need to update delta table based on lookup file rows .
lookup file has two columns a_acc & b_acc, it will have mulitple rows.
i need to update b_acc value in delta table based on a_acc column value in lookup in delta table .
Lookup table
+------------
|a_acc|b_acc|
+-----------
| 4636|1999|
| 1023|892 |
| 3333|1111|
Delta Table
+------------
|a_acc|b_acc|
+-----------
| 4636|0123|
| 1023|843 |
| 3333|3232|
output Delta table:
+------------
|a_acc|b_acc|
+-----------
| 4636|1999|
| 1023|892 |
| 3333|1111|
For single single row i can update the values in delta table .
deltaTable.update(condition = "a_acc = '4636'",set = {"b_acc": "1999"})
But how can i loop all the value in lookup table and update correspondingly ?
This is exactly MERGE operation:
You can upsert data from a source table, view, or DataFrame into a target Delta table by using the MERGE SQL operation. Delta Lake supports inserts, updates and deletes in MERGE, and it supports extended syntax beyond the SQL standards to facilitate advanced use cases.
You can use merge to update the values (b_acc) in delta table when matching key found in lookup table (a_acc).
deltaTable.alias("dt").merge(
source = Lookuptable.alias("lt"),
condition = "dt.a_acc = lt.a_acc"
).whenMatchedUpdate(set =
{
"b_acc": "lt.b_acc"
}
).execute()
hope this can help u.
d1 =[[4636, 1999],[1023,892],[3333,1111], [444, 123]]
lookup_table = spark.createDataFrame(d1, ['a_acc', 'b_acc'])
d2 =[[4636, 123],[1023, 843],[3333,3232], [567, 221]]
delta_table = spark.createDataFrame(d2, ['a_acc', 'b_acc'])
(
delta_table
.join(lookup_table.withColumnRenamed("b_acc", "lookup_b_acc"), ["a_acc"], "left")
.withColumn("b_acc", coalesce(col("lookup_b_acc"), col("b_acc")))
.sort("a_acc")
.show()
)
+-----+-----+------------+
|a_acc|b_acc|lookup_b_acc|
+-----+-----+------------+
| 567| 221| null|
| 1023| 892| 892|
| 3333| 1111| 1111|
| 4636| 1999| 1999|
+-----+-----+------------+

Querying data with additional column that creates a number for ordering purposes

I am trying to create a "queue" system by adding an arbitrary column that creates a number based on a condition and date, to sort the importance of a row.
For example, below is the query result I pulled in Postgres:
Table: task
Result:
description | status/condition| task_created |
bla | A | 2019-12-01 07:00:00|
pikachu | A | 2019-12-01 16:32:10|
abcdef | B | 2019-12-02 18:34:22|
doremi | B | 2019-12-02 15:09:43|
lalala | A | 2019-12-03 22:10:59|
In the above, each task has a date/timestamp and status/condition applied to them. I would like to create another column that gives a number to a row where it prioritises the older tasks first, BUT if the condition is B, then we take the older task of those in B as first priority.
The expected end result (based on the example) should be:
Table1: task
description | status/condition| task_created | priority index
bla | A | 2019-12-01 07:00:00| 3
pikachu | A | 2019-12-01 16:32:10| 4
abcdef | B | 2019-12-02 18:34:22| 2
doremi | B | 2019-12-02 15:09:43| 1
lalala | A | 2019-12-03 22:10:59| 5
For priority number, 1 being most urgent to do/resolve, while 5 being the least.
How would I go about adding this additional column into the existing query? especially since there's another condition apart from just the task_created date/time.
Any help is appreciated. Many thanks!
You maybe want the Rank or Dense Rank function (depends on your needs) window functions.
If you don't need a conditional order on the status you can use this one.
SELECT *,
rank() OVER (
ORDER BY status desc, task_created
) as priority_index
FROM task
If you need a custom order based on the value of the status:
SELECT *,
rank() OVER (
ORDER BY
CASE status
WHEN 'B' THEN 1
WHEN 'A' THEN 2
WHEN 'C' THEN 3
ELSE 4
END, task_created
) as priority_index
FROM task
If you have few values this is good enough, because we can simply specify your custom order. But if you have a lot of values and the ordering information is fixed, then it should have its own table.

joining with a DISTINCT ON on an ordered subquery in sqlalchemy

Here is (an extremely simplified version of) my problem.
I'm using Postgresql as the backend and trying to build a sqlalchemy query
from another query.
Table setup
Here are the tables with some random data for the example.
You can assume that each table was declared in sqlalchemy declaratively, with
the name of the mappers being respectively Item and ItemVersion.
At the end of the question you can find a link where I put the code for
everything in this question, including the table definitions.
Some items.
item
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
A table containing versions of each item. Each has at least one.
item_version
+----+---------+---------+-----------+
| id | item_id | version | text |
+----+---------+---------+-----------+
| 1 | 1 | 0 | item_1_v0 |
| 2 | 1 | 1 | item_1_v1 |
| 3 | 2 | 0 | item_2_v0 |
| 4 | 3 | 0 | item_3_v0 |
+----+---------+---------+-----------+
The query
Now, for a given sqlalchemy query over Item, I want a function that returns
another query, but this time over (Item, ItemVersion), where the Items are
the same as in the original query (and in the same order!), and where the
ItemVersion are the corresponding latest versions for each Item.
Here is an example in SQL, which is pretty straightforward:
First a random query over the item table
SELECT item.id as item_id
FROM item
WHERE item.id != 2
ORDER BY item.id DESC
which corresponds to
+---------+
| item_id |
+---------+
| 3 |
| 1 |
+---------+
Then from that query, if I want to join the right versions, I can do
SELECT sq2.item_id AS item_id,
sq2.item_version_id AS item_version_id,
sq2.item_version_text AS item_version_text
FROM (
SELECT DISTINCT ON (sq.item_id)
sq.item_id AS item_id,
iv.id AS item_version_id,
iv.text AS item_version_text
FROM (
SELECT item.id AS item_id
FROM item
WHERE id != 2
ORDER BY id DESC) AS sq
JOIN item_version AS iv
ON iv.item_id = sq.item_id
ORDER BY sq.item_id, iv.version DESC) AS sq2
ORDER BY sq2.item_id DESC
Note that it has to be wrapped in a subquery a second time because the
DISTINCT ON discards the ordering.
Now the challenge is to write a function that does that in sqlalchemy.
Here is what I have so far.
First the initial sqlalchemy query over the items:
session.query(Item).filter(Item.id != 2).order_by(desc(Item.id))
Then I'm able to build my second query but without the original ordering. In
other words I don't know how to do the second subquery wrapping that I did in
SQL to get back the ordering that was discarded by the DISTINCT ON.
def join_version(session, query):
sq = aliased(Item, query.subquery('sq'))
sq2 = session.query(sq, ItemVersion) \
.distinct(sq.id) \
.join(ItemVersion) \
.order_by(sq.id, desc(ItemVersion.version))
return sq2
I think this SO question could be part of the answer but I'm not quite
sure how.
The code to run everything in this question (database creation, population and
a failing unit test with what I have so far) can be found here. Normally
if you can fix the join_version function, it should make the test pass!
Ok so I found a way. It's a bit of a hack but still only queries the database twice so I guess I will survive! Basically I'm querying the database for the Items first, and then I do another query for the ItemVersions, filtering on item_id, and then reordering with a trick I found here (this is also relevant).
Here is the code:
def join_version(session, query):
items = query.all()
item_ids = [i.id for i in items]
items_v_sq = session.query(ItemVersion) \
.distinct(ItemVersion.item_id) \
.filter(ItemVersion.item_id.in_(item_ids)) \
.order_by(ItemVersion.item_id, desc(ItemVersion.version)) \
.subquery('sq')
sq = aliased(ItemVersion, items_v_sq)
items_v = session.query(sq) \
.order_by('idx(array{}, sq.item_id)'.format(item_ids))
return zip(items, items_v)

Update intermediate result

EDIT
As requested a little background of what I want to achieve. I have a table that I want to query but I don't want to change the table itself. Next the result of the SELECT query (what I called the 'intermediate table') needs to be cleaned a bit. For example certain cells of certain rows need to be swapped and some strings need to be trimmed. Of course this could all be done as postprocessing in, e.g., Python, but I was hoping to do all of this with one query statement.
Being new to Postgresql I want to update the intermediate table that results from a SELECT statement. So I basically want to edit the resulting table from a SELECT statement in one query. I'd like to prevent having to store the intermediate result.
I've tried the following 'with clause':
with result as (
select
a
from
b
)
update result as r
set
a = 'd'
...but that results in ERROR: relation "result" does not exist, while the following does work:
with result as (
select
a
from
b
)
select
*
from
result
As I said, I'm new to Postgresql so it is entirely possible that I'm using the wrong approach.
Depending on the complexity of the transformations you want to perform, you might be able to munge it into the SELECT, which would let you get away with a single query:
WITH foo AS (SELECT lower(name), freq, cumfreq, rank, vec FROM names WHERE name LIKE 'G%')
SELECT ... FROM foo WHERE ...
Or, for more or less unlimited manipulation options, you could create a temp table that will disappear at the end of the current transaction. That doesn't get the job done in a single query, but it does get it all done on the SQL server, which might still be worthwhile.
db=# BEGIN;
BEGIN
db=# CREATE TEMP TABLE foo ON COMMIT DROP AS SELECT * FROM names WHERE name LIKE 'G%';
SELECT 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-----------------------
GREEN | 0.183 | 11.403 | 35 | 'KRN':1 'green':1
GONZALEZ | 0.166 | 11.915 | 38 | 'KNSL':1 'gonzalez':1
GRAY | 0.106 | 15.921 | 69 | 'KR':1 'gray':1
GONZALES | 0.087 | 18.318 | 94 | 'KNSL':1 'gonzales':1
GRIFFIN | 0.084 | 18.659 | 98 | 'KRFN':1 'griffin':1
(5 rows)
db=# UPDATE foo SET name = lower(name);
UPDATE 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
--------+-------+---------+-------+---------------------
grube | 0.002 | 67.691 | 7333 | 'KRP':1 'grube':1
gasper | 0.001 | 69.999 | 9027 | 'KSPR':1 'gasper':1
gori | 0.000 | 81.360 | 28946 | 'KR':1 'gori':1
goeltz | 0.000 | 85.471 | 47269 | 'KLTS':1 'goeltz':1
gani | 0.000 | 86.202 | 51743 | 'KN':1 'gani':1
(5 rows)
db=# COMMIT;
COMMIT
db=# SELECT * FROM foo;
ERROR: relation "foo" does not exist

Query join result appears to be incorrect

I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?