Efficient way to conditionally update values in 2 records in PostgreSQL - postgresql

I have a table that contains 4 columns
id | category | score | enabled
1 | news | 95 | t
id -- serial
category -- varchar
score -- float
enabled -- bool
I want to update enabled to False if there's another record with a higher score.
For example, if I have:
id | category | score | enabled
1 | news | 95 | t
Then, after some operation, a new record with the same category is inserted:
id | category | score | enabled
1 | news | 95 | t
2 | news | 100 | f
Since the score for id=2 is higher, I want to change enabled for id=2 to True and change enabled for id=1 to False.
I'm wondering if I can combine these operations into 1 query. Right now I do 2 SELECT queries to get the 2 records, then compare the scores locally, and then change the enabled value (if needed).
So simply,
SELECT id, score
FROM table
WHERE category = %s
AND enabled = True
SELECT id, score
FROM table
WHERE category = %s
AND id = (SELECT max(id) WHERE category=%s)
if score2>= score1:
UPDATE table SET enabled = True
WHERE id = id2
UPDATE table SET enabled = False
WHERE id = id1
It works, but it seems very inefficient. Any way to improve these queries?

You can do that with a single update:
update the_table
set enabled = (score = t.max_score)
from (
select id, category, max(score) over (partition by category) as max_score
from the_table
where category = 'news'
) t
where t.id = the_table.id
and t.category = the_table.category;
This will set the enabled flags for all rows with the same category in a single statement.
Online example: https://rextester.com/DXR80618
If you happen to have more than one row with the same highest score for one category, the above statement will change enabled to true for all of, .
E.g.
id | category | score
---+----------+------
1 | news | 95
2 | news | 100
3 | news | 100
If you don't want that, and e.g. always pick the one with the lowest id to be the enabled row, you can use the following:
update the_table
set enabled = (rn = 1)
from (
select id, category,
row_number() over (partition by category order by score desc, id) as rn
from the_table
where category = 'news'
) t
where t.id = the_table.id
and t.category = the_table.category;
Online example: https://rextester.com/JPA61125

Related

Finding duplicate records posted within a lapse of time, in PostgreSQL

I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15

Delete Duplicate Data on PostgreSQL

How to delete duplicate data on a table which have kind data like these.
I want to keep it with the latest updated_at at each attribute id.
Like as follows:
attribute id | created at | product_id
1 | 2020-04-28 15:31:11 | 112235
4 | 2020-04-28 15:30:25 | 112235
1 | 2020-04-29 15:30:25 | 112236
4 | 2020-04-29 15:30:25 | 112236
You can use an EXISTS condition.
delete from the_table t1
where exists (select *
from the_table t2
where t2.created_at > t1.created_at
and t2.attribute_id = t1.attribute_id);
This will delete all rows where another row for the same attribute_id exists that has bigger created_at value (thus keeping only the row with the highest created_at for each attribute_id). Note that if two created_at values are identical, nothing will be deleted for that attribute_id
Online example

How do I write postgres conditional SELECT query?

I have a table that has 3 columns.
id | name | score | approve
--------------------
1 | foo | 90 | f
2 | foo | 80 | t
I want to
SELECT id WHERE name='foo'
with these conditions:
if approve is True, then return that one (only one will be true for the same name)
otherwise select the one that has highest score
I was looking into IF...ELSE but cannot even come up with a query that executes (despite a working one...)
How to set up the query command for this type of queries?
In SQL, you can often use some logic by defining the right order and limit:
select id
from my_table
where name = 'foo'
order by approve desc, score desc
limit 1

Sum with different condition for every line

In my Postgresql 9.3 database I have a table stock_rotation:
+----+-----------------+---------------------+------------+---------------------+
| id | quantity_change | stock_rotation_type | article_id | date |
+----+-----------------+---------------------+------------+---------------------+
| 1 | 10 | PURCHASE | 1 | 2010-01-01 15:35:01 |
| 2 | -4 | SALE | 1 | 2010-05-06 08:46:02 |
| 3 | 5 | INVENTORY | 1 | 2010-12-20 08:20:35 |
| 4 | 2 | PURCHASE | 1 | 2011-02-05 16:45:50 |
| 5 | -1 | SALE | 1 | 2011-03-01 16:42:53 |
+----+-----------------+---------------------+------------+---------------------+
Types:
SALE has negative quantity_change
PURCHASE has positive quantity_change
INVENTORY resets the actual number in stock to the given value
In this implementation, to get the current value that an article has in stock, you need to sum up all quantity changes since the latest INVENTORY for the specific article (including the inventory value). I do not know why it is implemented this way and unfortunately it would be quite hard to change this now.
My question now is how to do this for more than a single article at once.
My latest attempt was this:
WITH latest_inventory_of_article as (
SELECT MAX(date)
FROM stock_rotation
WHERE stock_rotation_type = 'INVENTORY'
)
SELECT a.id, sum(quantity_change)
FROM stock_rotation sr
INNER JOIN article a ON a.id = sr.article_id
WHERE sr.date >= (COALESCE(
(SELECT date FROM latest_inventory_of_article),
'1970-01-01'
))
GROUP BY a.id
But the date for the latest stock_rotation of type INVENTORY can be different for every article.
I was trying to avoid looping over multiple article ids to find this date.
In this case I would use a different internal query to get the max inventory per article. You are effectively using stock_rotation twice but it should work. If it's too big of a table you can try something else:
SELECT sr.article_id, sum(quantity_change)
FROM stock_rotation sr
LEFT JOIN (
SELECT article_id, MAX(date) AS date
FROM stock_rotation
WHERE stock_rotation_type = 'INVENTORY'
GROUP BY article_id) AS latest_inventory
ON latest_inventory.article_id = sr.article_id
WHERE sr.date >= COALESCE(latest_inventory.date, '1970-01-01')
GROUP BY sr.article_id
You can use DISTINCT ON together with ORDER BY to get the latest INVENTORY row for each article_id in the WITH clause.
Then you can join that with the original table to get all later rows and add the values:
WITH latest_inventory as (
SELECT DISTINCT ON (article_id) id, article_id, date
FROM stock_rotation
WHERE stock_rotation_type = 'INVENTORY'
ORDER BY article_id, date DESC
)
SELECT article_id, sum(sr.quantity_change)
FROM stock_rotation sr
JOIN latest_inventory li USING (article_id)
WHERE sr.date >= li.date
GROUP BY article_id;
Here is my take on it: First, build the list of products at their last inventory state, using a window function. Then, join it back to the entire list, filtering on operations later than the inventory date for the item.
with initial_inventory as
(
select article_id, date, quantity_change from
(select article_id, date, quantity_change, rank() over (partition by article_id order by date desc)
from stockRotation
where type = 'INVENTORY'
) a
where rank = 1
)
select ii.article_id, ii.quantity_change + sum(sr.quantity_change)
from initial_inventory ii
join stockRotation sr on ii.article_id = sr.article_id and sr.date > ii.date
group by ii.article_id, ii.quantity_change

Update Count column in Postgresql

I have a single table laid out as such:
id | name | count
1 | John |
2 | Jim |
3 | John |
4 | Tim |
I need to fill out the count column such that the result is the number of times the specific name shows up in the column name.
The result should be:
id | name | count
1 | John | 2
2 | Jim | 1
3 | John | 2
4 | Tim | 1
I can get the count of occurrences of unique names easily using:
SELECT COUNT(name)
FROM table
GROUP BY name
But that doesn't fit into an UPDATE statement due to it returning multiple rows.
I can also get it narrowed down to a single row by doing this:
SELECT COUNT(name)
FROM table
WHERE name = 'John'
GROUP BY name
But that doesn't allow me to fill out the entire column, just the 'John' rows.
you can do that with a common table expression:
with counted as (
select name, count(*) as name_count
from the_table
group by name
)
update the_table
set "count" = c.name_count
from counted c
where c.name = the_table.name;
Another (slower) option would be to use a co-related sub-query:
update the_table
set "count" = (select count(*)
from the_table t2
where t2.name = the_table.name);
But in general it is a bad idea to store values that can easily be calculated on the fly:
select id,
name,
count(*) over (partition by name) as name_count
from the_table;
Another method : Using a derived table
UPDATE tb
SET count = t.count
FROM (
SELECT count(NAME)
,NAME
FROM tb
GROUP BY 2
) t
WHERE t.NAME = tb.NAME