Create a range of dates in a pyspark DataFrame - scala

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?

I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

Related

Postgres how to fragment a 100 row result with 10 items each on a single column?

I am entirely new to Postgres and I have a query that someone created in the past that gets the temperature of a machine. Now, that usually returns 100 rows so I aggregated the result, and it is messy and long. My sample table looks like these
id temperature Date
-------------------------
1 10 2020-01-01
2 10 2020-01-01
3 11 2020-01-01
4 11 2020-01-01
5 11 2020-01-01
6 22 2020-01-01
7 10 2020-01-01
8 10 2020-01-01
9 10 2020-01-01
10 10 2020-01-01
11 10 2020-01-01
12 10 2020-01-01
13 10 2020-01-01
14 10 2020-01-01
15 10 2020-01-01
16 10 2020-01-01
17 10 2020-01-01
18 10 2020-01-01
19 10 2020-01-01
20 10 2020-01-01
and my current result looks like these
Cooker Name|temperature |
Psalmonela |[10,10,11,11,11,22,10,10,10,10,10,10,10,10,10,10,10,10,10,10] |
because this is a JSON aggregate function and results are sometimes go to a 100 rows, is there a way I could JSON aggregate it and show it 10 items per aggregate?
(SELECT to_jsonb(ARRAY(
SELECT
temperature
FROM stations
WHERE
station_id = 'c23d77d5-0dc5-40c4-asda-22123132'
LIMIT 10 OFFSET 10
))) AS normal_temp1,
My initial thought is returning results by 10, and using this offset, however, is there a way I could just merge all those 10 fragmented results into 1 column called temperature? So the result will be vertical?
My current output looks like this
As a newbie, I added a lot of subquery with the same code but limit and offset everything. So can I merge those columns into one column called temperature so it looks nicer, however, they dislike the columns that there are now more columns
This is what the report looks like before.

How to perform a pivot() and write.parquet on each partition of pyspark dataframe?

I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!

Combing rows in a spark dataframe

If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.
One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01

KDB: Aggregate across consecutive rows with common label

I would like to sum across consecutive rows that share the same label. Any very simple ways to do this?
Example: I start with this table...
qty flag
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
... and would like to generate...
qty flag
1 OFF
5 ON
11 OFF
4 ON
One method:
q)show t:flip`qty`flag!(1 3 2 2 9 4;`OFF`ON`ON`OFF`OFF`ON)
qty flag
--------
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
q)show result:select sum qty by d:sums differ flag,flag from t
d flag1| qty
----------| ---
1 OFF | 1
2 ON | 5
3 OFF | 11
4 ON | 4
Then to get it in the format you require:
q)`qty`flag#0!result
qty flag
--------
1 OFF
5 ON
11 OFF
4 ON