Perform bucketing properly on spark query - scala

Let's consider a dataset:
name
age
Max
33
Adam
32
Zim
41
Muller
62
Now, if we run this query on dataset x:
x.as("a").join(x.as("b")).where(
$"b.age" - $"a.age" <= 10 and
$"a.age" > $"b.age").show()
name
age
name
age
Max
33
Zim
41
Adam
32
Max
33
Adam
32
Zim
41
That is my desired result.
Now, conceptually if I have a very big dataset, I might want to use bucketing to reduce search space.
So, doing bucketing with:
val buck_x = x.withColumn("buc_age", floor($"age"/ 10))
which gives me:
name
age
buck_age
Max
33
3
Adam
32
3
Zim
41
4
Muller
62
6
After explode, I get the following result:
val exp_x = buck_x.withColumn("buc_age", explode(array($"buc_age" -1, $"buc_age", $"buc_age" + 1)))
name
age
buck_age
Max
33
2
Max
33
3
Max
33
4
Adam
32
2
Adam
32
3
Adam
32
4
Zim
41
3
Zim
41
4
Zim
41
5
Muller
62
5
Muller
62
6
Muller
62
7
Now, after final query,
exp_x.as("a").join(exp_x.as("b")).where(
$"a.buc_age" === $"b.buc_age" and
$"b.age" - $"a.age" <= 10 and
$"b.age" > $"a.age").show()
I get the following result.
name
age
buc_age
name
age
buc_age
Max
33
3
Zim
41
3
Max
33
4
Zim
41
4
Adam
32
2
Max
33
2
Adam
32
3
Zim
41
3
Adam
32
3
Max
33
3
Adam
32
4
Zim
41
4
Adam
32
4
Max
33
4
Clearly, it's not the same as my expectation, I am getting more rows than expected. How to solve this while using bucket?

Drop your bucketing columns and then select distinct rows, essentially undoing the duplication caused by explode:
exp_x.select(res1.columns.map(c => col(c).as(c + "_a")) : _*).join(exp_x.select(res1.columns.map(c => col(c).as(c + "_b")) : _*)).where(
$"buc_age_a" === $"buc_age_b" and
$"age_b" - $"age_a" <= 10 and
$"age_b" > $"age_a").
drop("buc_age_a", "buc_age_b").
distinct.
show
+------+-----+------+-----+
|name_a|age_a|name_b|age_b|
+------+-----+------+-----+
| Adam| 32| Zim| 41|
| Adam| 32| Max| 33|
| Max| 33| Zim| 41|
+------+-----+------+-----+

There is really no need for an explode.
Instead, this approach unions two inner self joins. The two joins find cases where:
A and B are in the same bucket, and B is older
B is one bucket more, but no more than 10 years older
This should perform better than using the explode, since fewer comparisons are performed (because the sets being joined here are one third of the exploded size).
val namesDF = Seq(("Max", 33), ("Adam", 32), ("Zim", 41), ("Muller", 62)).toDF("name", "age")
val buck_x = namesDF.withColumn("buc_age", floor($"age" / 10))
// same bucket where b is still older
val same = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" === $"b.buc_age" && $"b.age" > $"a.age"), "inner")
// different buckets -- b is one bucket higher but still no more than 10 ages different
val diff = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" + 1 === $"b.buc_age" && $"b.age" <= $"a.age" + 10), "inner")
val result = same.union(diff)
The result (you can do a drop to remove excess columns like in Charlie's answer):
result.show(false)
+----+---+-------+----+---+-------+
|name|age|buc_age|name|age|buc_age|
+----+---+-------+----+---+-------+
|Adam|32 |3 |Max |33 |3 |
|Max |33 |3 |Zim |41 |4 |
|Adam|32 |3 |Zim |41 |4 |
+----+---+-------+----+---+-------+

Related

How to get count of group by two columns

The below is myDf
fi_Sk sec_SK END_DATE
89 42 20160122
89 42 20150330
51 43 20140116
51 43 20130616
82 43 20100608
82 43 20160608
The below is my code:
val count = myDf.withColumn("END_DATE", unix_timestamp(col("END_DATE"), dateFormat))
.groupBy(col("sec_SK"),col("fi_Sk"))
.agg(count("sec_SK").as("Visits"), max("END_DATE").as("Recent_Visit"))
.withColumn("Recent_Visit", from_unixtime(col("Recent_Visit"), dateFormat))
I am getting visits incorrectly,i need to group by(fi_Sk and sec_SK) for counting visits
the result should be like below :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
82 43 2 20160608
currently i am getting :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
groupBy and aggregation would aggregate all the rows in group into one row but the expected output seems that you want to populate the count for each row in the group. Window function is the appropriate solution for you
import org.apache.spark.sql.expressions.Window
def windowSpec = Window.partitionBy("fi_Sk", "sec_SK")
import org.apache.spark.sql.functions._
df.withColumn("Visits", count("fi_Sk").over(windowSpec))
// .sort("fi_Sk", "END_DATE")
// .show(false)
//
// +-----+------+--------+------+
// |fi_Sk|sec_SK|END_DATE|Visits|
// +-----+------+--------+------+
// |51 |42 |20130616|2 |
// |51 |42 |20140116|2 |
// |89 |44 |20100608|1 |
// |89 |42 |20150330|2 |
// |89 |42 |20160122|2 |
// +-----+------+--------+------+

reshape dataframe from column to rows in scala

I want to reshape a dataframe in Spark using scala . I found most of the example uses groupBy andpivot. In my case i dont want to use groupBy. This is how my dataframe looks like
tagid timestamp value
1 1 2016-12-01 05:30:00 5
2 1 2017-12-01 05:31:00 6
3 1 2017-11-01 05:32:00 4
4 1 2017-11-01 05:33:00 5
5 2 2016-12-01 05:30:00 100
6 2 2017-12-01 05:31:00 111
7 2 2017-11-01 05:32:00 109
8 2 2016-12-01 05:34:00 95
And i want my dataframe to look like this,
timestamp 1 2
1 2016-12-01 05:30:00 5 100
2 2017-12-01 05:31:00 6 111
3 2017-11-01 05:32:00 4 109
4 2017-11-01 05:33:00 5 NA
5 2016-12-01 05:34:00 NA 95
i used pivot without groupBy and it throws error.
df.pivot("tagid")
error: value pivot is not a member of org.apache.spark.sql.DataFrame.
How do i convert this? Thank you.
Doing the following should solve your issue.
df.groupBy("timestamp").pivot("tagId").agg(first($"value"))
you should have final dataframe as
+-------------------+----+----+
|timestamp |1 |2 |
+-------------------+----+----+
|2017-11-01 05:33:00|5 |null|
|2017-11-01 05:32:00|4 |109 |
|2017-12-01 05:31:00|6 |111 |
|2016-12-01 05:30:00|5 |100 |
|2016-12-01 05:34:00|null|95 |
+-------------------+----+----+
for more information you can checkout databricks blog

Merge rows into List for similar values in SPARK

Spark version 2.0.2.6 and Scala Version 2.11.11
I am having the following csv file.
sno name number
1 hello 1
1 hello 2
2 hai 12
2 hai 22
2 hai 32
3 how 43
3 how 44
3 how 45
3 how 46
4 are 33
4 are 34
4 are 45
4 are 44
4 are 43
I want output as:
sno name number
1 hello [1,2]
2 hai [12,22,32]
3 how [43,44,45,46]
4 are [33,34,44,45,43]
Order of the elements in the list is not important.
Using dataframes or RDD's which ever is appropriate.
Thanks
Tom
import org.apache.spark.sql.functions._
scala> df.groupBy("sno", "name").agg(collect_list("number").alias("number")).sort("sno").show()
+---+-----+--------------------+
|sno| name| number|
+---+-----+--------------------+
| 1|hello| [1, 2]|
| 2| hai| [12, 22, 32]|
| 3| how| [43, 44, 45, 46]|
| 4| are|[33, 34, 45, 44, 43]|
+---+-----+--------------------+

Dataframe groupBy, get corresponding rows value, based on result of aggregate function [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 6 years ago.
I have dataframe with column by name c1, c2, c3, c4. I want to group it on a column and use agg function on other column eg min/max/agg.. etc and get the corresponding other column value based on result of agg function
Example :
c1 c2 c3 c4
1 23 1 1
1 45 2 2
1 91 3 3
1 90 4 4
1 71 5 5
1 42 6 6
1 72 7 7
1 44 8 8
1 55 9 9
1 21 0 0
Should result:
c1 c2 c3 c4
1 91 3 3
let dataframe be df
df.groupBy($"c1").agg(max($"c2"), ??, ??)
can someone please help what should go inplace of ??
i know solution of this problem using RDD. Wanted to explore if this can be solve in easier way using Dataframe/Dataset api
You can do this in two steps:
calculate the aggregated data frame;
join the data frame back with the original data frame and filter based on the condition;
so:
val maxDF = df.groupBy("c1").agg(max($"c2").as("maxc2"))
// maxDF: org.apache.spark.sql.DataFrame = [c1: int, maxc2: int]
df.join(maxDF, Seq("c1")).where($"c2" === $"maxc2").drop($"maxc2").show
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 1| 91| 3| 3|
+---+---+---+---+

de-aggregate for table columns in Greenplum

I am using Greenplum, and I have data like:
id | val
----+-----
12 | 12
12 | 23
12 | 34
13 | 23
13 | 34
13 | 45
(6 rows)
somehow I want the result like:
id | step
----+-----
12 | 12
12 | 11
12 | 11
13 | 23
13 | 11
13 | 11
(6 rows)
How it comes:
First there should be a Window function, which execute a de-aggreagte function based on partition by id
the column val is cumulative value, and what I want to get is the step values.
Maybe I can do it like:
select deagg(val) over (partition by id) from table_name;
So I need the deagg function.
Thanks for your help!
P.S and Greenplum is based on postgresql v8.2
You can just use the LAG function:
SELECT id,
val - lag(val, 1, 0) over (partition BY id ORDER BY val) as step
FROM yourTable
Note carefully that lag() has three parameters. The first is the column for which to find the lag, the second indicates to look at the previous record, and the third will cause lag to return a default value of zero.
Here is a table showing the table this query would generate:
id | val | lag(val, 1, 0) | val - lag(val, 1, 0)
----+-----+----------------+----------------------
12 | 12 | 0 | 12
12 | 23 | 12 | 11
12 | 34 | 23 | 11
13 | 23 | 0 | 23
13 | 34 | 23 | 11
13 | 45 | 34 | 11
Second note: This answer assumes that you want to compute your rolling difference in order of val ascending. If you want a different order you can change the ORDER BY clause of the partition.
val seems to be a cumulative sum. You can "unaggregate" it by subtracting the previous val from the current val, e.g., by using the lag function. Just note you'll have to treat the first value in each group specially, as lag will return null:
SELECT id, val - COALESCE(LAG(val) OVER (PARTITION BY id ORDER BY val), 0) AS val
FROM mytable;