Replace DataFrame rows with most recent data based on key - date

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..

try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01

Related

Create a range of dates in a pyspark DataFrame

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

Is there a way to do a selective sum using a time interval in Postgres?

I have two tables, the first table has columns: id, start_time, and end_time. The second table has columns: id, timestamp, value. Is there a way to make a sum of table 2 based on the conditions in table 1?
Table 1:
id
start_date
end_date
5
2000-01-01 01:00:00
2000-01-05 02:45:00
5
2000-01-10 01:00:00
2000-01-15 02:45:00
6
2000-01-01 01:00:00
2000-01-05 02:45:00
6
2000-01-11 01:00:00
2000-01-12 02:45:00
6
2000-01-15 01:00:00
2000-01-20 02:45:00
Table 2:
id
timestamp
value
5
2000-01-01 05:00:00
1
5
2000-01-01 06:00:00
2
6
2000-01-01 05:00:00
1
6
2000-01-11 05:00:00
2
6
2000-01-15 05:00:00
2
6
2000-01-15 05:30:00
2
Desired result:
id
start_date
end_date
Sum
5
2000-01-01 01:00:00
2000-01-05 02:45:00
3
5
2000-01-10 01:00:00
2000-01-15 02:45:00
null
6
2000-01-01 01:00:00
2000-01-05 02:45:00
1
6
2000-01-11 01:00:00
2000-01-12 02:45:00
2
6
2000-01-15 01:00:00
2000-01-20 02:45:00
4
Try this :
SELECT a.id, a.start_date, a.end_date, sum(b.value) AS sum
FROM table1 AS a
LEFT JOIN table2 AS b
ON b.id = a.id
AND b.timestamp >= a.start_date
AND b.timestamp < a.end_date
GROUP BY a.id, a.start_date, a.end_date

Postgres how to fragment a 100 row result with 10 items each on a single column?

I am entirely new to Postgres and I have a query that someone created in the past that gets the temperature of a machine. Now, that usually returns 100 rows so I aggregated the result, and it is messy and long. My sample table looks like these
id temperature Date
-------------------------
1 10 2020-01-01
2 10 2020-01-01
3 11 2020-01-01
4 11 2020-01-01
5 11 2020-01-01
6 22 2020-01-01
7 10 2020-01-01
8 10 2020-01-01
9 10 2020-01-01
10 10 2020-01-01
11 10 2020-01-01
12 10 2020-01-01
13 10 2020-01-01
14 10 2020-01-01
15 10 2020-01-01
16 10 2020-01-01
17 10 2020-01-01
18 10 2020-01-01
19 10 2020-01-01
20 10 2020-01-01
and my current result looks like these
Cooker Name|temperature |
Psalmonela |[10,10,11,11,11,22,10,10,10,10,10,10,10,10,10,10,10,10,10,10] |
because this is a JSON aggregate function and results are sometimes go to a 100 rows, is there a way I could JSON aggregate it and show it 10 items per aggregate?
(SELECT to_jsonb(ARRAY(
SELECT
temperature
FROM stations
WHERE
station_id = 'c23d77d5-0dc5-40c4-asda-22123132'
LIMIT 10 OFFSET 10
))) AS normal_temp1,
My initial thought is returning results by 10, and using this offset, however, is there a way I could just merge all those 10 fragmented results into 1 column called temperature? So the result will be vertical?
My current output looks like this
As a newbie, I added a lot of subquery with the same code but limit and offset everything. So can I merge those columns into one column called temperature so it looks nicer, however, they dislike the columns that there are now more columns
This is what the report looks like before.

aggregating with a condition in groupby spark dataframe

I have a dataframe
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
2 13 14 13 14 2 [0.5,1.5] 2.5 3.5
2 13 14 13 14 2 null 3.5 4
2 13 14 13 14 2 null 4 null
so I wanted to apply a condition while using groupby in agg function that if we do groupby col("id") and col("detector") then I want to check the condition that if lag_interval in that group has any non-null value then in aggregation I want two columns one is
min("lag_interval.col1") and other is max("lead_gpsdt")
If the above condition is not met then I want
min("gpsdt"), max("lead_gpsdt")
using this approach I want to get the data with a condition
df.groupBy("detector","id").agg(first("lat-long").alias("start_coordinate"),
last("lat-long").alias("end_coordinate"),struct(min("gpsdt"), max("lead_gpsdt")).as("interval"))
output
id interval start_coordinate end_coordinate
1 [1.5,6] [12,13] [13,14]
1 [6,6.5] [13,14] [13,14]
2 [0.5,4] [13,14] [13,14]
**
for more explanation
**
if we see a part of what groupby("id","detector") does is taking a part out,
we have to see that if in that group of data if one of the value in the col("lag_interval") is not null then we need to use aggregation like this min(lag_interval.col1),max(lead_gpsdt)
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
and if the all value of col("lag_interval") is null in that group of data then we need aggregation output as
min("gpsdt"),max("lead_gpsdt")
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
The conditional dilemma that you have should be solved by using simple when inbuilt function as suggested below
import org.apache.spark.sql.functions._
df.groupBy("id","detector")
.agg(
struct(
when(isnull(min("lag_interval.col1")), min("gpsdt")).otherwise(min("lag_interval.col1")).as("min"),
max("lead_gpsdt").as(("max"))
).as("interval")
)
which should give you output as
+---+--------+----------+
|id |detector|interval |
+---+--------+----------+
|2 |2 |[0.5, 4.0]|
|1 |2 |[6.0, 6.5]|
|1 |1 |[1.5, 6.0]|
+---+--------+----------+
and I guess you must already have idea how to do first("lat-long").alias("start_coordinate"), last("lat-long").alias("end_coordinate") as you have done.
I hope the answer is helpful

How to query with lead() values not in current range

I´m having problems querying when lead() values are not within the range of current row, rows on the range's edge return null lead() values.
Let’s say I have a simple table to keep track of continuous counters
create table anytable
( wseller integer NOT NULL,
wday date NOT NULL,
wshift smallint NOT NULL,
wconter numeric(9,1) )
with the following values
wseller wday wshift wcounter
1 2016-11-30 1 100.5
1 2017-01-03 1 102.5
1 2017-01-25 2 103.2
1 2017-02-05 2 106.1
2 2015-05-05 2 81.1
2 2017-01-01 1 92.1
2 2017-01-01 2 93.1
3 2016-12-01 1 45.2
3 2017-01-05 1 50.1
and want net units for current year
wseller wday wshift units
1 2017-01-03 1 2
1 2017-01-25 2 0.7
1 2017-02-05 2 2.9
2 2017-01-01 1 11
2 2017-01-01 2 1
3 2017-01-05 1 4.9
If I use
seletc wseller, wday, wshift, wcounter-lead(wcounter) over (partition by wseller order by wseller, wday desc, wshift desc)
from anytable
where wday>='2017-01-01'
gives me nulls on the first wseller by partition. I´m using this query within a large CTE.
What am I doing wrong?
The scope of a window function takes into account conditions in the WHERE clause. Move the condition to the outer query:
select *
from (
select
wseller, wday, wshift,
wcounter- lead(wcounter) over (partition by wseller order by wday desc, wshift desc)
from anytable
) s
where wday >= '2017-01-01'
order by wseller, wday, wshift
wseller | wday | wshift | ?column?
---------+------------+--------+----------
1 | 2017-01-03 | 1 | 2.0
1 | 2017-01-25 | 2 | 0.7
1 | 2017-02-05 | 2 | 2.9
2 | 2017-01-01 | 1 | 11.0
2 | 2017-01-01 | 2 | 1.0
3 | 2017-01-05 | 1 | 4.9
(6 rows)