I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.
I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")
This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))
This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
get TopN of all groups after group by using Spark DataFrame
(1 answer)
Closed 5 years ago.
I have one DataFrame which contains these values :
Dept_id | name | salary
1 A 10
2 B 100
1 D 100
2 C 105
1 N 103
2 F 102
1 K 90
2 E 110
I want the result in this form :
Dept_id | name | salary
1 N 103
1 D 100
1 K 90
2 E 110
2 C 105
2 F 102
Thanks In Advance :).
the solution is similar to Retrieve top n in each group of a DataFrame in pyspark which is in pyspark
If you do the same in scala, then it should be as below
df.withColumn("rank", rank().over(Window.partitionBy("Dept_id").orderBy($"salary".desc)))
.filter($"rank" <= 3)
.drop("rank")
I hope the answer is helpful