I need to update delta table based on lookup file rows .
lookup file has two columns a_acc & b_acc, it will have mulitple rows.
i need to update b_acc value in delta table based on a_acc column value in lookup in delta table .
Lookup table
+------------
|a_acc|b_acc|
+-----------
| 4636|1999|
| 1023|892 |
| 3333|1111|
Delta Table
+------------
|a_acc|b_acc|
+-----------
| 4636|0123|
| 1023|843 |
| 3333|3232|
output Delta table:
+------------
|a_acc|b_acc|
+-----------
| 4636|1999|
| 1023|892 |
| 3333|1111|
For single single row i can update the values in delta table .
deltaTable.update(condition = "a_acc = '4636'",set = {"b_acc": "1999"})
But how can i loop all the value in lookup table and update correspondingly ?
This is exactly MERGE operation:
You can upsert data from a source table, view, or DataFrame into a target Delta table by using the MERGE SQL operation. Delta Lake supports inserts, updates and deletes in MERGE, and it supports extended syntax beyond the SQL standards to facilitate advanced use cases.
You can use merge to update the values (b_acc) in delta table when matching key found in lookup table (a_acc).
deltaTable.alias("dt").merge(
source = Lookuptable.alias("lt"),
condition = "dt.a_acc = lt.a_acc"
).whenMatchedUpdate(set =
{
"b_acc": "lt.b_acc"
}
).execute()
hope this can help u.
d1 =[[4636, 1999],[1023,892],[3333,1111], [444, 123]]
lookup_table = spark.createDataFrame(d1, ['a_acc', 'b_acc'])
d2 =[[4636, 123],[1023, 843],[3333,3232], [567, 221]]
delta_table = spark.createDataFrame(d2, ['a_acc', 'b_acc'])
(
delta_table
.join(lookup_table.withColumnRenamed("b_acc", "lookup_b_acc"), ["a_acc"], "left")
.withColumn("b_acc", coalesce(col("lookup_b_acc"), col("b_acc")))
.sort("a_acc")
.show()
)
+-----+-----+------------+
|a_acc|b_acc|lookup_b_acc|
+-----+-----+------------+
| 567| 221| null|
| 1023| 892| 892|
| 3333| 1111| 1111|
| 4636| 1999| 1999|
+-----+-----+------------+
Related
I have a problem in spark(v2.2.2)/scala(v2.11.8). Mostly into scala/spark functional language.
I have a list of person with rented_date like below.
These are csv file which I will convert into parquet and read as a dataframe.
Table: Person
+-------------------+-----------+
| ID |report_date|
+-------------------+-----------+
| 123| 2011-09-25|
| 111| 2017-08-23|
| 222| 2018-09-30|
| 333| 2020-09-30|
| 444| 2019-09-30|
+-------------------+-----------+
I want to find out the start_date of the address for the period person's rented it out by grouping on ID
Table: Address
+-------------------+----------+----------+
| ID |start_date|close_date|
+-------------------+----------+----------+
| 123|2008-09-23|2009-09-23|
| 123|2009-09-24|2010-09-23|
| 123|2010-09-24|2011-09-23|
| 123|2011-09-30|2012-09-23|
| 123|2012-09-24| null|
| 111|2013-09-23|2014-09-23|
| 111|2014-09-24|2015-09-23|
| 111|2015-09-24|2016-09-23|
| 111|2016-09-24|2017-09-23|
| 111|2017-09-24| null|
| 222|2018-09-24| null|
+-------------------+----------+----------+
ex: For 123 rented_date is 2011-09-20, which in address table falls in the period (start_date, close_date) 2010-09-24,2011-09-23 (row 3 in address). Form here I have to fetch start_date 2010-09-24.
I have to do this on entire dataset by joining the tables. Or need to fetch start_date from address table into the Person table.
Also need to handle where closed date is null.
Sometime scenario may also include where rented date will not fall in any of the period in that case we need to take it where rented_date < closed_date.
Apologies, proper format of tables are not populating.
Thanks in Advance.
First of all
I have a list of person with rented_date like below. These are csv file which I will convert into parquet and read as a dataframe.
No need to convert it you can just read it directly with spark
spark.read.csv("path")
spark.read.format("csv").load("path")
I am not sure what your expectation in null fields are so I would filter them out for now:
dfAdressNotNull.filter($"close_date".isNotNull)
Of course now you need to join them together and since the data in Address is the relevant one I would do a left join.
val joinedDf = dfAddressNotNull.join(dfPerson, Seq("ID"), "left")
No you have Addresses and Persons combined
If you filter now like that
joinedDf.filter($"report_date" >= $"start_date" && $"report_date" < $"closed_date")
You should have something like that what you want to achieve.
Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**
Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
How would you create a trigger that uses the values of the row being inserted to be calculated first so that a value being inserted gets transformed?
Let's say I have this table labor_rates,
+---------------+-----------------+--------------+------------+
| labor_rate_id | rate_per_minute | unit_minutes | created_at |
+---------------+-----------------+--------------+------------+
| bigint | numeric | numeric | timestamp |
+---------------+-----------------+--------------+------------+
Each time a new record is created, I need that the rate is calculated as rate/unit (the smallest unit here is a minute).
So example, when inserting a new record:
INSERT INTO labor_rates(rate, unit)
VALUES (60, 480);
It would create a new record with these values:
+---------------+-----------------+--------------+----------------------------+
| labor_rate_id | rate_per_minute | unit_minutes | created_at |
+---------------+-----------------+--------------+----------------------------+
| 1000000 | 1.1979 | 60 | 2017-03-16 01:59:47.208111 |
+---------------+-----------------+--------------+----------------------------+
One could argue that this should be left as a calculated field instead of storing the calculated value. But in this case, it would be best if the calculated value is stored.
I am fairly new to triggers so any help would be much appreciated.
Here is (an extremely simplified version of) my problem.
I'm using Postgresql as the backend and trying to build a sqlalchemy query
from another query.
Table setup
Here are the tables with some random data for the example.
You can assume that each table was declared in sqlalchemy declaratively, with
the name of the mappers being respectively Item and ItemVersion.
At the end of the question you can find a link where I put the code for
everything in this question, including the table definitions.
Some items.
item
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
A table containing versions of each item. Each has at least one.
item_version
+----+---------+---------+-----------+
| id | item_id | version | text |
+----+---------+---------+-----------+
| 1 | 1 | 0 | item_1_v0 |
| 2 | 1 | 1 | item_1_v1 |
| 3 | 2 | 0 | item_2_v0 |
| 4 | 3 | 0 | item_3_v0 |
+----+---------+---------+-----------+
The query
Now, for a given sqlalchemy query over Item, I want a function that returns
another query, but this time over (Item, ItemVersion), where the Items are
the same as in the original query (and in the same order!), and where the
ItemVersion are the corresponding latest versions for each Item.
Here is an example in SQL, which is pretty straightforward:
First a random query over the item table
SELECT item.id as item_id
FROM item
WHERE item.id != 2
ORDER BY item.id DESC
which corresponds to
+---------+
| item_id |
+---------+
| 3 |
| 1 |
+---------+
Then from that query, if I want to join the right versions, I can do
SELECT sq2.item_id AS item_id,
sq2.item_version_id AS item_version_id,
sq2.item_version_text AS item_version_text
FROM (
SELECT DISTINCT ON (sq.item_id)
sq.item_id AS item_id,
iv.id AS item_version_id,
iv.text AS item_version_text
FROM (
SELECT item.id AS item_id
FROM item
WHERE id != 2
ORDER BY id DESC) AS sq
JOIN item_version AS iv
ON iv.item_id = sq.item_id
ORDER BY sq.item_id, iv.version DESC) AS sq2
ORDER BY sq2.item_id DESC
Note that it has to be wrapped in a subquery a second time because the
DISTINCT ON discards the ordering.
Now the challenge is to write a function that does that in sqlalchemy.
Here is what I have so far.
First the initial sqlalchemy query over the items:
session.query(Item).filter(Item.id != 2).order_by(desc(Item.id))
Then I'm able to build my second query but without the original ordering. In
other words I don't know how to do the second subquery wrapping that I did in
SQL to get back the ordering that was discarded by the DISTINCT ON.
def join_version(session, query):
sq = aliased(Item, query.subquery('sq'))
sq2 = session.query(sq, ItemVersion) \
.distinct(sq.id) \
.join(ItemVersion) \
.order_by(sq.id, desc(ItemVersion.version))
return sq2
I think this SO question could be part of the answer but I'm not quite
sure how.
The code to run everything in this question (database creation, population and
a failing unit test with what I have so far) can be found here. Normally
if you can fix the join_version function, it should make the test pass!
Ok so I found a way. It's a bit of a hack but still only queries the database twice so I guess I will survive! Basically I'm querying the database for the Items first, and then I do another query for the ItemVersions, filtering on item_id, and then reordering with a trick I found here (this is also relevant).
Here is the code:
def join_version(session, query):
items = query.all()
item_ids = [i.id for i in items]
items_v_sq = session.query(ItemVersion) \
.distinct(ItemVersion.item_id) \
.filter(ItemVersion.item_id.in_(item_ids)) \
.order_by(ItemVersion.item_id, desc(ItemVersion.version)) \
.subquery('sq')
sq = aliased(ItemVersion, items_v_sq)
items_v = session.query(sq) \
.order_by('idx(array{}, sq.item_id)'.format(item_ids))
return zip(items, items_v)
I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?