Merge rows into List for similar values in SPARK - scala

Spark version 2.0.2.6 and Scala Version 2.11.11
I am having the following csv file.
sno name number
1 hello 1
1 hello 2
2 hai 12
2 hai 22
2 hai 32
3 how 43
3 how 44
3 how 45
3 how 46
4 are 33
4 are 34
4 are 45
4 are 44
4 are 43
I want output as:
sno name number
1 hello [1,2]
2 hai [12,22,32]
3 how [43,44,45,46]
4 are [33,34,44,45,43]
Order of the elements in the list is not important.
Using dataframes or RDD's which ever is appropriate.
Thanks
Tom

import org.apache.spark.sql.functions._
scala> df.groupBy("sno", "name").agg(collect_list("number").alias("number")).sort("sno").show()
+---+-----+--------------------+
|sno| name| number|
+---+-----+--------------------+
| 1|hello| [1, 2]|
| 2| hai| [12, 22, 32]|
| 3| how| [43, 44, 45, 46]|
| 4| are|[33, 34, 45, 44, 43]|
+---+-----+--------------------+

Related

Creating a new column using info from another df

I'm trying to create a new column based off information from another data table.
df1
Loc Time Wage
1 192 1
3 192 2
1 193 3
5 193 3
7 193 5
2 194 7
df2
Loc City
1 NYC
2 Miami
3 LA
4 Chicago
5 Houston
6 SF
7 DC
desired output:
Loc Time Wage City
1 192 1 NYC
3 192 2 LA
1 193 3 NYC
5 193 3 Houston
7 193 5 DC
2 194 7 Miami
The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map but I haven't found much documentation for that online. join doesn't really seem to fit this situation.
join is exactly what you need. Try running this in the spark-shell
import spark.implicits._
val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
val outputDf = df1.join(df2, Seq("loc")) // join on the column "loc"
outputDf.show()
This will output
+---+----+----+-------+
|loc|time|wage| city|
+---+----+----+-------+
| 1| 192| 1| NYC|
| 1| 193| 3| NYC|
| 2| 194| 7| Miami|
| 3| 193| 2| LA|
| 5| 193| 3|Houston|
| 7| 193| 5| DC|
+---+----+----+-------+

Perform bucketing properly on spark query

Let's consider a dataset:
name
age
Max
33
Adam
32
Zim
41
Muller
62
Now, if we run this query on dataset x:
x.as("a").join(x.as("b")).where(
$"b.age" - $"a.age" <= 10 and
$"a.age" > $"b.age").show()
name
age
name
age
Max
33
Zim
41
Adam
32
Max
33
Adam
32
Zim
41
That is my desired result.
Now, conceptually if I have a very big dataset, I might want to use bucketing to reduce search space.
So, doing bucketing with:
val buck_x = x.withColumn("buc_age", floor($"age"/ 10))
which gives me:
name
age
buck_age
Max
33
3
Adam
32
3
Zim
41
4
Muller
62
6
After explode, I get the following result:
val exp_x = buck_x.withColumn("buc_age", explode(array($"buc_age" -1, $"buc_age", $"buc_age" + 1)))
name
age
buck_age
Max
33
2
Max
33
3
Max
33
4
Adam
32
2
Adam
32
3
Adam
32
4
Zim
41
3
Zim
41
4
Zim
41
5
Muller
62
5
Muller
62
6
Muller
62
7
Now, after final query,
exp_x.as("a").join(exp_x.as("b")).where(
$"a.buc_age" === $"b.buc_age" and
$"b.age" - $"a.age" <= 10 and
$"b.age" > $"a.age").show()
I get the following result.
name
age
buc_age
name
age
buc_age
Max
33
3
Zim
41
3
Max
33
4
Zim
41
4
Adam
32
2
Max
33
2
Adam
32
3
Zim
41
3
Adam
32
3
Max
33
3
Adam
32
4
Zim
41
4
Adam
32
4
Max
33
4
Clearly, it's not the same as my expectation, I am getting more rows than expected. How to solve this while using bucket?
Drop your bucketing columns and then select distinct rows, essentially undoing the duplication caused by explode:
exp_x.select(res1.columns.map(c => col(c).as(c + "_a")) : _*).join(exp_x.select(res1.columns.map(c => col(c).as(c + "_b")) : _*)).where(
$"buc_age_a" === $"buc_age_b" and
$"age_b" - $"age_a" <= 10 and
$"age_b" > $"age_a").
drop("buc_age_a", "buc_age_b").
distinct.
show
+------+-----+------+-----+
|name_a|age_a|name_b|age_b|
+------+-----+------+-----+
| Adam| 32| Zim| 41|
| Adam| 32| Max| 33|
| Max| 33| Zim| 41|
+------+-----+------+-----+
There is really no need for an explode.
Instead, this approach unions two inner self joins. The two joins find cases where:
A and B are in the same bucket, and B is older
B is one bucket more, but no more than 10 years older
This should perform better than using the explode, since fewer comparisons are performed (because the sets being joined here are one third of the exploded size).
val namesDF = Seq(("Max", 33), ("Adam", 32), ("Zim", 41), ("Muller", 62)).toDF("name", "age")
val buck_x = namesDF.withColumn("buc_age", floor($"age" / 10))
// same bucket where b is still older
val same = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" === $"b.buc_age" && $"b.age" > $"a.age"), "inner")
// different buckets -- b is one bucket higher but still no more than 10 ages different
val diff = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" + 1 === $"b.buc_age" && $"b.age" <= $"a.age" + 10), "inner")
val result = same.union(diff)
The result (you can do a drop to remove excess columns like in Charlie's answer):
result.show(false)
+----+---+-------+----+---+-------+
|name|age|buc_age|name|age|buc_age|
+----+---+-------+----+---+-------+
|Adam|32 |3 |Max |33 |3 |
|Max |33 |3 |Zim |41 |4 |
|Adam|32 |3 |Zim |41 |4 |
+----+---+-------+----+---+-------+

I got an error in pypsark, that states: TypeError: 'Column' object is not callable

What I did is I tried to groupby and collect_list:
Data:
id dates quantity
-- ----- -----
12 2012-03-02 1
32 2012-02-21 4
43 2012-03-02 4
5 2012-12-02 5
42 2012-12-02 7
21 2012-31-02 9
3 2012-01-02 5
2 2012-01-02 5
3 2012-01-02 7
2 2012-01-02 1
3 2012-01-02 3
21 2012-01-02 6
21 2012-03-23 5
21 2012-03-24 3
21 2012-04-25 1
21 2012-07-23 6
21 2012-01-02 8
Code:
new_df = df.groupby('id').agg(F.collect_list("dayid"),F.collect_list("quantity"))
The code seems to work fine for me, only question I have is you have used dayid as column in the collect_list rest all looks fine.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
sc = spark.sparkContext
dataset1 = [{'id' : 12,'dates' : '2012-03-02','quantity' : 1},
{'id' : 32,'dates' : '2012-02-21','quantity' : 4},
{'id' : 12,'dates' : '2012-03-02','quantity' : 1},
{'id' : 32,'dates' : '2012-02-21','quantity' : 4}]
rdd1 = sc.parallelize(dataset1)
df1 = spark.createDataFrame(rdd1)
df1.show()
+----------+---+--------+
| dates| id|quantity|
+----------+---+--------+
|2012-03-02| 12| 1|
|2012-02-21| 32| 4|
|2012-03-02| 12| 1|
|2012-02-21| 32| 4|
+----------+---+--------+
new_df = df1.groupby('id').agg(F.collect_list("dayid"),F.collect_list("quantity"))
+---+----------------+----------------------+
| id|collect_list(id)|collect_list(quantity)|
+---+----------------+----------------------+
| 32| [32, 32]| [4, 4]|
| 12| [12, 12]| [1, 1]|
+---+----------------+----------------------+

reshape dataframe from column to rows in scala

I want to reshape a dataframe in Spark using scala . I found most of the example uses groupBy andpivot. In my case i dont want to use groupBy. This is how my dataframe looks like
tagid timestamp value
1 1 2016-12-01 05:30:00 5
2 1 2017-12-01 05:31:00 6
3 1 2017-11-01 05:32:00 4
4 1 2017-11-01 05:33:00 5
5 2 2016-12-01 05:30:00 100
6 2 2017-12-01 05:31:00 111
7 2 2017-11-01 05:32:00 109
8 2 2016-12-01 05:34:00 95
And i want my dataframe to look like this,
timestamp 1 2
1 2016-12-01 05:30:00 5 100
2 2017-12-01 05:31:00 6 111
3 2017-11-01 05:32:00 4 109
4 2017-11-01 05:33:00 5 NA
5 2016-12-01 05:34:00 NA 95
i used pivot without groupBy and it throws error.
df.pivot("tagid")
error: value pivot is not a member of org.apache.spark.sql.DataFrame.
How do i convert this? Thank you.
Doing the following should solve your issue.
df.groupBy("timestamp").pivot("tagId").agg(first($"value"))
you should have final dataframe as
+-------------------+----+----+
|timestamp |1 |2 |
+-------------------+----+----+
|2017-11-01 05:33:00|5 |null|
|2017-11-01 05:32:00|4 |109 |
|2017-12-01 05:31:00|6 |111 |
|2016-12-01 05:30:00|5 |100 |
|2016-12-01 05:34:00|null|95 |
+-------------------+----+----+
for more information you can checkout databricks blog

Dataframe groupBy, get corresponding rows value, based on result of aggregate function [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 6 years ago.
I have dataframe with column by name c1, c2, c3, c4. I want to group it on a column and use agg function on other column eg min/max/agg.. etc and get the corresponding other column value based on result of agg function
Example :
c1 c2 c3 c4
1 23 1 1
1 45 2 2
1 91 3 3
1 90 4 4
1 71 5 5
1 42 6 6
1 72 7 7
1 44 8 8
1 55 9 9
1 21 0 0
Should result:
c1 c2 c3 c4
1 91 3 3
let dataframe be df
df.groupBy($"c1").agg(max($"c2"), ??, ??)
can someone please help what should go inplace of ??
i know solution of this problem using RDD. Wanted to explore if this can be solve in easier way using Dataframe/Dataset api
You can do this in two steps:
calculate the aggregated data frame;
join the data frame back with the original data frame and filter based on the condition;
so:
val maxDF = df.groupBy("c1").agg(max($"c2").as("maxc2"))
// maxDF: org.apache.spark.sql.DataFrame = [c1: int, maxc2: int]
df.join(maxDF, Seq("c1")).where($"c2" === $"maxc2").drop($"maxc2").show
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 1| 91| 3| 3|
+---+---+---+---+