spark Group By data-frame columns without aggregation [duplicate] - scala

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)

You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.

Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

Related

splitting string column into multiple columns based on key value item using spark scala

I have a dataframe where one column contains several information in a 'key=value' format.
There are almost a 30 different 'key=value' that can appear in that column will use 4 columns
for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns?
Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC
Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)
Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show

How to explode array column given a formula?

I am trying a very basic thing in spark scala, but couldn't figure it out.
I have data like this:
col1
col2
John
1
Jack
2
And, I want to achieve this:
col1
col2
John
1
John
0
John
2
Jack
2
Jack
1
Jack
3
That is, for each row, I want to create two more rows, one is val(col2)-1 and val(col2)+1.
I tried to use explode, but couldn't figure out how to do it properly.
val exploded_df = df.withColumn("col2", explode($"col2" -1, $"col2", $"col2" +1 ))
And, got:
too many arguments for method explode: (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
You need to specify an array column to explode:
val exploded_df = df.withColumn("col2", explode(array($"col2" -1, $"col2", $"col2" +1 )))

filter rows based on combination of 2 columns in Spark DF

input DF:
A B
1 1
2 1
2 2
3 3
3 1
3 2
3 3
3 4
I am trying to filter the rows based on the combination of
(A, Max(B))
Output Df:
A B
1 1
2 3
3 4
I am able to do this with
df.groupBy()
But there are also other columns in the DF which I want to be selected but do not want to be included in the GroupBy
So that condition on filtering the rows should only be wrt these columns and not the other columns in the DF. Ay suggestions please>
As suggested in How to get other columns when using Spark DataFrame groupby? you can use window functions
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
df.withColumn("maxB", max(col("B")).over(Window.partitionBy("A"))).where(...)
where ... is replaced by a predicate based on A and maxB.

Combing rows in a spark dataframe

If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.
One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))

How to add columns using Scala

I have df as below and want to add additional column using Scala
Id Name
1 ab
2 BC
1 Cd
2 mf
3 Hh
Expected output should be below
Id name repeatedcount
1 ab 2
2 BC 2
1 Cd 2
2 mf 2
3 Hh 3
I'm using DF.groupBy($"id").count.show() but I'm getting different output.
Can someone please help me on this.
val grouped = df.groupBy($"id").count
val res = df.join(grouped,Seq("id"))
.withColumnRenamed("count","repeatedcount")
Group By will give count of each id's. Join that with original dataframe to get count against each id.