how to drop rows if a variale is less than x, in sql - postgresql

I have the following query code
query = """
with double_entry_book as (
SELECT to_address as address, value as value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE to_address is not null
AND block_timestamp < '2022-01-01 00:00:00'
AND status = 1
AND (call_type not in ('delegatecall', 'callcode', 'staticcall') or call_type is null)
union all
-- credits
SELECT from_address as address, -value as value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE from_address is not null
AND block_timestamp < '2022-01-01 00:00:00'
AND status = 1
AND (call_type not in ('delegatecall', 'callcode', 'staticcall') or call_type is null)
union all
)
SELECT address,
sum(value) / 1000000000000000000 as balance
from double_entry_book
group by address
order by balance desc
LIMIT 15000000
"""
In the last part, I want to drop rows where "balance" is less than, let's say, 0.02 and then group, order, etc. I imagine this should be a simple code. Any help will be appreciated!

We can delete on a CTE and use returning to get the id's of the rows being deleted, but they still exist until the transaction is comitted.
CREATE TABLE t (
id serial,
variale int);
insert into t (variale) values
(1),(2),(3),(4),(5);
✓
5 rows affected
with del as
(delete from t
where variale < 3
returning id)
select
t.id,
t.variale,
del.id ids_being_deleted
from t
left join del
on t.id = del.id;
id | variale | ids_being_deleted
-: | ------: | ----------------:
1 | 1 | 1
2 | 2 | 2
3 | 3 | null
4 | 4 | null
5 | 5 | null
select * from t;
id | variale
-: | ------:
3 | 3
4 | 4
5 | 5
db<>fiddle here

Related

Get different LIMIT on each group on postgresql rank

To get 2 rows from each group I can use ROW_NUMBER() with condition <= 2 at last but my question is what If I want to get different limits on each group e.g 3 rows for section_id 1, 1 rows for 2 and 1 rows for 3?
Given the following table:
db=# SELECT * FROM xxx;
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
3 | 1 | C
4 | 1 | D
5 | 2 | E
6 | 2 | F
7 | 3 | G
8 | 2 | H
(8 rows)
I get the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
5 | 2 | E
6 | 2 | F
7 | 3 | G
(5 rows)
Current Query:
SELECT
*
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
t.*
FROM
xxx t) x
WHERE
x.r <= 2;
Create a table to contain the section limits, then join. The big advantage being that as new sections are required or limits change maintenance is reduced to a single table update and comes at very little cost. See example.
select s.section_id, s.name
from (select section_id, name
, row_number() over (partition by section_id order by name) rn
from sections
) s
left join section_limits sl on (sl.section_id = s.section_id)
where
s.rn <= coalesce(sl.limit_to,2);
Just fix up your where clause:
with numbered as (
select row_number() over (partition by section_id
order by name) as r,
t.*
from xxx t
)
select *
from numbered
where (section_id = 1 and r <= 3)
or (section_id = 2 and r <= 1)
or (section_id = 3 and r <= 1);

Count the number of consecutive entries fulfilling a condition within a GROUP BY

I've got a list of users who are behind on their bills, and I want to generate an entry for each of them that says how many consecutive bills they've been behind on. So here's the table:
user | bill_date | outstanding_balance
---------------------------------------
a | 2017-03-01 | 90
a | 2016-12-01 | 60
a | 2016-09-01 | 30
b | 2017-03-01 | 50
b | 2016-12-01 | 0
b | 2016-09-01 | 40
c | 2017-03-01 | 0
c | 2016-12-01 | 0
c | 2016-09-01 | 1
And I want a query that would generate the following table:
user | consecutive_billing_periods_behind
-----------------------------------------
a | 3
b | 1
a | 0
In other words, if you've paid up at any point, I want to ignore all of the earlier entries, and only count how many billing periods you've been behind since you've been last paid up. How do I do this most simply?
If I understood the question correctly, first you need to find the last date that any given customer paid their bill so the last date their outstanding balance was 0. You can do this by this subquery:
(SELECT
user1,
bill_date AS no_outstanding_bill_date
FROM table1
WHERE outstanding_balance = 0)
Then you need get the last bill date and create field for each row if they are outstanding bill. Then filter the rows between the last clear day to last bill date of each customer by this where clause:
WHERE bill_date >= last_clear_day AND bill_date <= last_bill_date
Then if you put the pieces together you can have the results by this query:
SELECT
DISTINCT
user1,
sum(is_outstanding_bill)
OVER (
PARTITION BY user1 ) AS consecutive_billing_periods_behind
FROM (
SELECT
user1,
last_value(bill_date)
OVER (
PARTITION BY user1
ORDER BY bill_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS last_bill_date,
CASE WHEN outstanding_balance > 0
THEN 1
ELSE 0 END AS is_outstanding_bill,
bill_date,
outstanding_balance,
nvl(max(t2.no_outstanding_bill_date)
OVER (
PARTITION BY user1 ), min(bill_date)
OVER (
PARTITION BY user1 )) AS last_clear_day
FROM table1 t1
LEFT JOIN (SELECT
user1,
bill_date AS no_outstanding_bill_date
FROM table1
WHERE outstanding_balance = 0) t2 USING (user1)
) table2
WHERE bill_date >= last_clear_day AND bill_date <= last_bill_date
Since we are using distinct you will not need the group by clause.
select
user,
count(case when min_balance > 0 then 1 end)
as consecutive_billing_periods_behind
from
(
select
user,
min(outstanding_balance)
over (partition by user order by bill_date) as min_balance
from tbl
)
group by user
Or:
select
user,
count(*)
as consecutive_billing_periods_behind
from
(
select
user,
bill_date,
max(case when outstanding_balance = 0 then bill_date) over
(partition by user)
as max_bill_date_with_zero_balance
from tbl
)
where
-- If user has no outstanding_balance = 0, then
max_bill_date_with_zero_balance is null
-- Count all rows in this case.
-- Otherwise
or
-- count rows with
bill_date > max_bill_date_with_zero_balance
group by user

Can window function LAG reference the column which value is being calculated?

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form
SELECT <some fields>,
<some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>
This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.
Here is an example. I have a table with events. Each event has type and time_stamp.
create table event (id serial, type integer, time_stamp integer);
I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then
the first event is not a duplicate
all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
and so on
For this data
insert into event (type, time_stamp)
values
(1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10),
(1,15), (1, 21), (2,13),
(1, 40);
and TIMEFRAME=10 result should be
time_stamp | type | duplicate
-----------------------------
1 | 1 | false
2 | 1 | true
3 | 1 | true
10 | 1 | true
15 | 1 | false
21 | 1 | true
40 | 1 | false
2 | 2 | false
10 | 2 | true
13 | 2 | false
I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:
WITH evt AS (
SELECT
time_stamp,
CASE WHEN
time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
THEN
time_stamp
ELSE
LAG(current_non_dupl_time_stamp) OVER w
END AS current_non_dupl_time_stamp
FROM event
WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
But this does not work because the field which is calculated cannot be referenced in LAG:
ERROR: column "current_non_dupl_time_stamp" does not exist.
So the question: can I rewrite this query to achieve the effect I need?
Naive recursive chain knitter:
-- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
SELECT e.type,e.time_stamp
, ROW_NUMBER() OVER www as rn -- number the records
, FIRST_VALUE(e.time_stamp) OVER www as fst -- the "group leader"
, EXISTS (SELECT * FROM event x
WHERE x.type = e.type
AND x.time_stamp < e.time_stamp) AS is_dup
FROM event e
WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
;
WITH RECURSIVE ttt AS (
SELECT d0.*
FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
UNION ALL
SELECT d1.type, d1.time_stamp, d1.rn
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
ELSE ttt.fst END AS fst -- new "group leader"
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
ELSE True END AS is_dup
FROM drag d1
JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
)
SELECT * FROM ttt
ORDER BY type, time_stamp
;
Results:
CREATE TABLE
INSERT 0 10
CREATE VIEW
type | time_stamp | rn | fst | is_dup
------+------------+----+-----+--------
1 | 1 | 1 | 1 | f
1 | 2 | 2 | 1 | t
1 | 3 | 3 | 1 | t
1 | 10 | 4 | 1 | t
1 | 15 | 5 | 1 | t
1 | 21 | 6 | 1 | t
1 | 40 | 7 | 40 | f
2 | 2 | 1 | 2 | f
2 | 10 | 2 | 2 | t
2 | 13 | 3 | 2 | t
(10 rows)
An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.
State transition function:
create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
if st is null or st[1] + timeframe <= time_stamp
then
st[1] := time_stamp;
end if;
st[2] := time_stamp;
return st;
end $$;
Final function:
create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
select st[1] <> st[2];
$$;
Aggregate:
create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
sfunc = is_duplicate,
stype = int[],
finalfunc = is_duplicate_final
);
Query:
select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;
id | type | time_stamp | is_duplicate_agg
----+------+------------+------------------
1 | 1 | 1 | f
2 | 1 | 2 | t
4 | 1 | 3 | t
5 | 1 | 10 | t
7 | 1 | 15 | f
8 | 1 | 21 | t
10 | 1 | 40 | f
3 | 2 | 2 | f
6 | 2 | 10 | t
9 | 2 | 13 | f
(10 rows)
Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.
This feels more like a recursive problem than windowing function. The following query obtained the desired results:
WITH RECURSIVE base(type, time_stamp) AS (
-- 3. base of recursive query
SELECT x.type, x.time_stamp, y.next_time_stamp
FROM
-- 1. start with the initial records of each type
( SELECT type, min(time_stamp) AS time_stamp
FROM event
GROUP BY type
) x
LEFT JOIN LATERAL
-- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = x.type
AND time_stamp > (x.time_stamp + 10)
) y ON true
UNION ALL
-- 4. recursive join, same logic as base
SELECT e.type, e.time_stamp, z.next_time_stamp
FROM event e
JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
LEFT JOIN LATERAL
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = e.type
AND time_stamp > (e.time_stamp + 10)
) z ON true
)
-- The actual query:
-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
FROM base
UNION
-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
FROM event
WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base)
ORDER BY type, time_stamp
There are a lot of caveats with this. It assumes no duplicate time_stamp for a given type. Really the joins should be based on a unique id rather than type and time_stamp. I didn't test this much, but it may at least suggest an approach.
This is my first time to try a LATERAL join. So there may be a way to simplify that moe. Really what I wanted to do was a recursive CTE with the recursive part using MIN(time_stamp) based on time_stamp > (x.time_stamp + 10), but aggregate functions are not allowed in CTEs in that manner. But it seems the lateral join can be used in the CTE.

Iterate through rows, compare them against each other and store results in another table

I have a table that contains the following rows:
product_id | order_date
A | 12/04/12
A | 01/11/13
A | 01/21/13
A | 03/05/13
B | 02/14/13
B | 03/09/13
What I now need is an overview for each month, how many products have been bought for the first time (=have not been bought the month before), how many are existing products (=have been bought the month before) and how many have not been purchased within a given month. Taken the sample above as an input, the script should deliver the following result, regardless of what period of time is in the data:
month | new | existing | nopurchase
12/2012 | 1 | 0 | 0
01/2013 | 0 | 1 | 0
02/2013 | 1 | 0 | 1
03/2013 | 1 | 1 | 0
Would be great to get a first hint how this could be solved so I'm able to continue.
Thanks!
SQL Fiddle
with t as (
select product_id pid, date_trunc('month', order_date)::date od
from t
group by 1, 2
)
select od,
sum(is_new::integer) "new",
sum(is_existing::integer) existing,
sum(not_purchased::integer) nopurchase
from (
select od,
lag(t_pid) over(partition by s_pid order by od) is null and t_pid is not null is_new,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is not null is_existing,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is null not_purchased
from (
select t.pid t_pid, s.pid s_pid, s.od
from
t
right join
(
select pid, s.od
from
t
cross join
(
select date_trunc('month', d)::date od
from
generate_series(
(select min(od) from t),
(select max(od) from t),
'1 month'
) s(d)
) s
group by pid, s.od
) s on t.od = s.od and t.pid = s.pid
) s
) s
group by 1
order by 1

T-SQL: How to use GROUP BY and getting the value which excesses 60%?

sorry for the bad title, I don't know how to describe my problem.
I have the following table:
| ItemID | Date |
-------------------------
| 1 | 01.01.10 |
| 1 | 03.01.10 |
| 1 | 05.01.10 |
| 1 | 06.01.10 |
| 1 | 10.01.10 |
| 2 | 05.01.10 |
| 2 | 10.01.10 |
| 2 | 20.01.10 |
Now I want to GROUP BY ItemID and for the date I want to get the value, which excesses 60%. What I mean is, that for item 1 I've five rows, so each have a percentage of 20% and for item 2 I've three row, so each have a percentage of 33,33%. So for item 1 I need the 3rd and for item 2 the 2nd value, so that the result looks like that.
| ItemID | Date |
-------------------------
| 1 | 06.01.10 |
| 2 | 10.01.10 |
Is there a easy way so get this data? Maybe using OVER?
Thank you
Torben
with NumItems as
( select itemID, count(*) as numOfItems from table group by itemID)
),
rowNums as
(
select itemID,Date, row_number() over (partition by ItemID order by date asc) as rowNum
from table
)
select itemID, min(Date) from
rowNums a inner join NumItems b on a.itemID = b.ItemID
where cast(b.rowNum as float) / cast(numOfItems as float) >= 0.6
group by itemID
that should do it although I am certain It can be writter with only one table scan. That should work nice though.
The provided the script contained a few errors. Below is a working one:
with NumItems as
(
select itemID, count(*) as numOfItems from table group by itemID
),
rowNums as
(
select itemID, Date, row_number() over (partition by ItemID order by date asc) as rowNum
from table
)
select a.itemID, min(a.Date) from
rowNums a inner join NumItems b on a.itemID = b.ItemID
where cast(a.rowNum as float) / cast(numOfItems as float) >= 0.6
group by a.itemID