Select the maximum rows of sorted subgroups - postgresql

Using PostgreSQL 11, I have a table containing a DAY and MONTH_TO_DAY entry for each day of every month. I would like to select the most recent MONTH_TO_DAY entry for each account.
My table is:
+------+------------+--------------+------------+--------------------------+
|id |account |code |interval |timestamp |
+------+------------+--------------+------------+--------------------------+
|387276|ALPBls6EsP |52 |MONTH_TO_DAY|2020-09-01 01:05:00.000000|
|387275|ALPBls6EsP |52 |DAY |2020-09-01 01:05:00.000000|
|387272|YkON8lk8A8 |25 |MONTH_TO_DAY|2020-09-01 01:05:00.000000|
|387271|YkON8lk8A8 |25 |DAY |2020-08-01 01:05:00.000000|
|387273|ALPBls6EsP |32 |MONTH_TO_DAY|2020-08-31 01:05:00.000000|
|387274|ALPBls6EsP |32 |DAY |2020-08-31 01:05:00.000000|
|387272|ALPBls6EsP |27 |MONTH_TO_DAY|2020-08-30 01:05:00.000000|
|387271|ALPBls6EsP |27 |DAY |2020-08-30 01:05:00.000000|
+------+------------+--------------+------------+--------------------------+
If it helps, the entries are always in descending order timewise.
In a query asking for all accounts, since the 31st is the last day of 08 and the 1st is the most recent entry of 09, my expected output would be
+------+------------+--------------+------------+--------------------------+
|id |account |code |interval |timestamp |
+------+------------+--------------+------------+--------------------------+
|387276|ALPBls6EsP |52 |MONTH_TO_DAY|2020-09-01 01:05:00.000000|
|387272|YkON8lk8A8 |25 |MONTH_TO_DAY|2020-09-01 01:05:00.000000|
|387273|ALPBls6EsP |32 |MONTH_TO_DAY|2020-08-31 01:05:00.000000|
+------+------------+--------------+------------+--------------------------+
I was thinking I'd like to group entries by month (truncate the dd/hh/ss), and then select the row with the maximum timestamp in each group. I can get the right rows with this but I can't figure out how to get any of the other fields.
SELECT max(timestamp)
FROM mytable
GROUP BY date_trunc('month', mytable.timestamp);
I also thought I could use distinct on something like the below, but I'm not too familiar with distinct on or date_trunc and I can't figure out how to use them together.
SELECT distinct on (timestamp)
*
FROM mytable
ORDER BY date_trunc('month', mytable.timestamp)

You do want distinct on, but you want to apply it to the account:
select distinct on (account) *
from mytable
where interval = 'MONTH_TO_DAY'
order by account, timestamp desc;
If you want the latest by account by month, then this should work:
select distinct on (date_trunc('month', timestamp), account) *
from mytable
where interval = 'MONTH_TO_DAY'
order by date_trunc('month', timestamp), account, timestamp desc;

Related

DB2 - difference between two timestamp in days

how to get exact difference between two timestamp in days in DB2. I mean if one date is FROM_DATE=5/6/2015 2:22:27.000000 PM and TO_DATE=3/30/2015 2:33:52.000000 PM, then the timestamp difference should show 36 days. I tried using below
((24*DAYS(From_Date)+MIDNIGHT_SECONDS(From_Date)/3600) -
(24*DAYS(To_Date)+MIDNIGHT_SECONDS(To_Date)/3600))/(24)
But this is giving me the difference as 37 days.
What about
SELECT days_between ('2015-05-06-02.22.27.000000', '2015-03-30-02.33.52.000000')
FROM SYSIBM.sysdummy1
It returns 36.
Wrong formula. Check out the following.
SELECT
FROM_DATE, TO_DATE
,
(
(24*DAYS(From_Date)+MIDNIGHT_SECONDS(From_Date)/3600)
- (24*DAYS(To_Date)+MIDNIGHT_SECONDS(To_Date)/3600)
)/24 AS D1
,
(
(DAYS(From_Date)*bigint(86400) + MIDNIGHT_SECONDS(From_Date))
- (DAYS(To_Date) *bigint(86400) + MIDNIGHT_SECONDS(To_Date))
)/86400 AS D2
FROM
(
VALUES
(TIMESTAMP('2015-05-06-14.22.27'), TIMESTAMP('2015-03-30-14.33.52'))
, (TIMESTAMP('2015-03-31-14.22.27'), TIMESTAMP('2015-03-30-14.33.52'))
, (TIMESTAMP('2015-04-01-14.22.27'), TIMESTAMP('2015-03-30-14.33.52'))
) T(FROM_DATE, TO_DATE);
|FROM_DATE |TO_DATE |D1 |D2 |
|--------------------------|--------------------------|-----------|--------------------|
|2015-05-06-14.22.27.000000|2015-03-30-14.33.52.000000|37 |36 |
|2015-03-31-14.22.27.000000|2015-03-30-14.33.52.000000|1 |0 |
|2015-04-01-14.22.27.000000|2015-03-30-14.33.52.000000|2 |1 |

I need a type of group-sort that I couldn't figure out with ROW_NUMBER on T-SQL

I have a table with a table_id row and 2 other rows. I want type of numbering with row_number function and I want result to seem like this:
id |col1 |col2 |what I want
------------------------------
1 |x |a |1
2 |x |b |2
3 |x |a |3
4 |x |a |3
5 |x |c |4
6 |x |c |4
7 |x |c |4
please consider that;
there's only one x, so "partition by col1" is OK. other than that;
there are two sequences of a's, and they'll be counted seperately
(not 1,2,1,1,3,3,3). and sorting must be by id, not by col2 (so
order by col2 is NOT OK).
I want that number to increase by one anytime col2 changes compared to previous line.
row_number () over (partition by col1 order by col2) DOESN'T WORK. because I want it ordered by id.
Using LAG and a windowed COUNT appears to get you what you are after:
WITH Previous AS(
SELECT V.id,
V.col1,
V.col2,
V.[What I want],
LAG(V.Col2,1,V.Col2) OVER (ORDER BY ID ASC) AS PrevCol2
FROM (VALUES(1,'x','a',1),
(2,'x','b',2),
(3,'x','a',3),
(4,'x','a',3),
(5,'x','c',4),
(6,'x','c',4),
(7,'x','c',4))V(id, col1, col2, [What I want]))
SELECT P.id,
P.col1,
P.col2,
P.[What I want],
COUNT(CASE P.Col2 WHEN P.PrevCol2 THEN NULL ELSE 1 END) OVER (ORDER BY P.ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) +1 AS [What you get]
FROM Previous P;
DB<>Fiddle

Querying historical balance data from a transactions table

In need of help to get the total balances of customers on a daily basis if I backtracked the data.
I have the following table structures in a Postgres database:
Table1: accounts (acc)
|id|acc_created|
|1 |2019-01-01 |
|2 |2019-01-01 |
|3 |2019-01-01 |
Table2: transactions
|transaction_id|acc_id|balance|txn_created |
|1 |1 |100 |2019-01-01 07:00:00|
|2 |1 |50 |2019-01-01 16:32:10|
|3 |1 |25 |2019-01-01 22:10:59|
|4 |2 |200 |2019-01-02 18:34:22|
|5 |3 |150 |2019-01-02 15:09:43|
|6 |1 |125 |2019-01-04 04:52:31|
|7 |1 |0 |2019-01-05 05:10:00|
|8 |2 |300 |2019-01-05 12:34:56|
|9 |3 |120 |2019-01-06 23:59:59|
The transactions table shows the balance after a transaction is made on the account.
To be honest, I am unsure how to write the query, or whether I am overthinking the situation. I know it would involve last_value() and coalesce(), and possibly lag() and lead(). Essentially the criterias I would like to fulfill are:
It takes the last balance value of that day, for that account.
(i.e. the balance for acc_id = '1' on 2019-01-01 would be $25, acc_id ='2' and '3' would be $0)
For days where there are no transaction made by an account, the balance would take from the previous balance of that account.
(i.e. the balance for acc_id = '1' on 2019-01-03 would be $25)
Lastly, I would like the total balance of all accounts aggregated by date.
(i.e. At end of 2019-01-02, the total balance should be $375 (=25+200+150)
I have tried the query below:
SELECT date_trunc('day',date), sum(balance_of_day) FROM (
SELECT txn.created as date,
acc_id,
row_number() over (partition BY acc_id ORDER BY txn_created ASC) as order_of_created,
last_value(balance) over (partition by acc_id ORDER BY txn_created RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as balance_of_day
FROM transactions) X
where X.order_of_created = 1
GROUP BY 1
However, this only gives me the total balance if a transaction was made by any account on a certain day.
The expected end result (based on the example) should be:
|date |total_balance|
|2019-01-01 |25 |
|2019-01-02 |375 |
|2019-01-03 |375 |
|2019-01-04 |475 |
|2019-01-05 |450 |
|2019-01-06 |420 |
I won't need to present the different account numbers, just the total accumulated balance from all customers at the end of the day. Please let me know how I can solve this! Many thanks!
You can use a few cool postgres feature to accomplish this. First, to get the last balance per day, use DISTINCT ON:
SELECT DISTINCT on(acc_id, txn_created::date)
transaction_id, acc_id, balance, txn_created::date as day
FROM transactions
ORDER BY acc_id, txn_created::date, txn_created desc;
To figure out the balance on any given day, we'll use a daterange per row that includes the current row and excludes the next row, partitioned by acc_id:
SELECT transaction_id, acc_id, balance, daterange(day, lead(day, 1) OVER (partition by acc_id order by day), '[)')
FROM (
SELECT DISTINCT on(acc_id, txn_created::date)
transaction_id, acc_id, balance, txn_created::date as day
FROM transactions
ORDER BY acc_id, txn_created::date, txn_created desc
) sub;
Lastly, join to generate_series. We can join where the date in generate_series is contained by the daterange we created in the last step. The dateranges are intentionally not overlapping, so we can query on any date safely.
WITH balances as (
SELECT transaction_id, acc_id, balance, daterange(day, lead(day, 1) OVER (partition by acc_id order by day), '[)') as drange
FROM (
SELECT DISTINCT on(acc_id, txn_created::date)
transaction_id, acc_id, balance, txn_created::date as day
FROM transactions
ORDER BY acc_id, txn_created::date, txn_created desc
) sub
)
SELECT d::date, sum(balance)
FROM generate_series('2019-01-01'::date, '2019-01-06'::date, '1 day') as g(d)
JOIN balances ON d::date <# drange
GROUP BY d::date;
d | sum
------------+-----
2019-01-01 | 25
2019-01-02 | 375
2019-01-03 | 375
2019-01-04 | 475
2019-01-05 | 450
2019-01-06 | 420
(6 rows)
Here's a fiddle.

dataframe spark scala take the (MAX-MIN) for each group

i have a dataframe from a processing part, looks like :
+---------+------+-----------+
|Time |group |value |
+---------+------+-----------+
| 28371| 94| 906|
| 28372| 94| 864|
| 28373| 94| 682|
| 28374| 94| 574|
| 28383| 95| 630|
| 28384| 95| 716|
| 28385| 95| 913|
i would like to take the (value for max time - value for min time) for each group, to have this result :
+------+-----------+
|group | value |
+------+-----------+
| 94| -332|
| 95| 283|
Thank you in advance for the help
df.groupBy("groupCol").agg(max("value")-min("value"))
Based on the question edit by the OP, here is a way to do this in PySpark. The idea is to compute the row numbers in ascending and descending order of time per group and use those values for subtraction.
from pyspark.sql import Window
from pyspark.sql import functions as func
w_asc = Window.partitionBy(df.groupCol).orderBy(df.time)
w_desc = Window.partitionBy(df.groupCol).orderBy(func.desc(df.time))
df = df.withColumn(func.row_number().over(w_asc).alias('rnum_asc')) \
.withColumn(func.row_number().over(w_desc).alias('rnum_desc'))
df.groupBy(df.groupCol) \
.agg((func.max(func.when(df.rnum_desc==1,df.value))-func.max(func.when(df.rnum_asc==1,df.value))).alias('diff')).show()
It would have been easier if window function first_value were available in Spark SQL. A generic way to solve this using SQL is
select distinct groupCol,diff
from (
select t.*
,first_value(val) over(partition by groupCol order by time) -
first_value(val) over(partition by groupCol order by time desc) as diff
from tbl t
) t

SQL calculating stock per month

I have specific task, and don't know how to realize it. I hope someone can help me =)
I have stock_move table:
product_id |location_id |location_dest_id |product_qty |date_expected |
-----------|------------|-----------------|------------|--------------------|
327 |80 |84 |10 |2014-05-28 00:00:00 |
327 |80 |84 |10 |2014-05-23 00:00:00 |
327 |80 |84 |10 |2014-02-26 00:00:00 |
327 |80 |85 |10 |2014-02-21 00:00:00 |
327 |80 |84 |10 |2014-02-12 00:00:00 |
327 |84 |85 |20 |2014-02-06 00:00:00 |
322 |84 |80 |120 |2015-12-16 00:00:00 |
322 |80 |84 |30 |2015-12-10 00:00:00 |
322 |80 |84 |30 |2015-12-04 00:00:00 |
322 |80 |84 |15 |2015-11-26 00:00:00 |
i.e. it's table of product moves from one warehouse to second.
I can calculate stock at custom date if I use something like this:
select
coalesce(si.product_id, so.product_id) as "Product",
(coalesce(si.stock, 0) - coalesce(so.stock, 0)) as "Stock"
from
(
select
product_id
,sum(product_qty * price_unit) as stock
from stock_move
where
location_dest_id = 80
and date_expected < now()
group by product_id
) as si
full outer join (
select
product_id
,sum(product_qty * price_unit) as stock
from stock_move
where
location_id = 80
and date_expected < now()
group by product_id
) as so
on si.product_id = so.product_id
Result I have current stock:
Product |Stock |
--------|------|
325 |1058 |
313 |34862 |
304 |2364 |
BUT what to do if I need stock per month?
something like this?
Month |Total Stock |
--------|------------|
Jan |130238 |
Feb |348262 |
Mar |2323364 |
How can I sum product qty from start period to end of each month?
I have just one idea - it's use 24 sub queries for get stock per each month (ex. below)
Jan |Feb | Mar |
----|----|-----|
123 |234 |345 |
End after this rotate rows and columns?
I think this's stupid, but I don't know another way... Help me pls =)
Something like this could give you monthly "ending" inventory snapshots. The trick is your data may omit certain months for certain parts, but that part will still have a balance (ie 50 received in January, nothing happened in February, but you still want to show February with a running total of 50).
One way to handle this is to come up with all possible combinations part/dates. I assumed 1/1/14 + 24 months in this example, but that's easily changed in the all_months subquery. For example, you may only want to start with the minimum date from the stock_move table.
with all_months as (
select '2014-01-01'::date + interval '1 month' * generate_series(0, 23) as month_begin
),
stock_calc as (
select
product_id, date_expected,
date_trunc ('month', date_expected)::date as month_expected,
case
when location_id = 80 then -product_qty * price_unit
when location_dest_id = 80 then product_qty * price_unit
else 0
end as qty
from stock_move
union all
select distinct
s.product_id, m.month_begin::date, m.month_begin::date, 0
from
stock_move s
cross join all_months m
),
running_totals as (
select
product_id, date_expected, month_expected,
sum (qty) over (partition by product_id order by date_expected) as end_qty,
row_number() over (partition by product_id, month_expected
order by date_expected desc) as rn
from stock_calc
)
select
product_id, month_expected, end_qty
from running_totals
where
rn = 1