How to get count of group by two columns - scala

The below is myDf
fi_Sk sec_SK END_DATE
89 42 20160122
89 42 20150330
51 43 20140116
51 43 20130616
82 43 20100608
82 43 20160608
The below is my code:
val count = myDf.withColumn("END_DATE", unix_timestamp(col("END_DATE"), dateFormat))
.groupBy(col("sec_SK"),col("fi_Sk"))
.agg(count("sec_SK").as("Visits"), max("END_DATE").as("Recent_Visit"))
.withColumn("Recent_Visit", from_unixtime(col("Recent_Visit"), dateFormat))
I am getting visits incorrectly,i need to group by(fi_Sk and sec_SK) for counting visits
the result should be like below :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
82 43 2 20160608
currently i am getting :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116

groupBy and aggregation would aggregate all the rows in group into one row but the expected output seems that you want to populate the count for each row in the group. Window function is the appropriate solution for you
import org.apache.spark.sql.expressions.Window
def windowSpec = Window.partitionBy("fi_Sk", "sec_SK")
import org.apache.spark.sql.functions._
df.withColumn("Visits", count("fi_Sk").over(windowSpec))
// .sort("fi_Sk", "END_DATE")
// .show(false)
//
// +-----+------+--------+------+
// |fi_Sk|sec_SK|END_DATE|Visits|
// +-----+------+--------+------+
// |51 |42 |20130616|2 |
// |51 |42 |20140116|2 |
// |89 |44 |20100608|1 |
// |89 |42 |20150330|2 |
// |89 |42 |20160122|2 |
// +-----+------+--------+------+

Related

Perform bucketing properly on spark query

Let's consider a dataset:
name
age
Max
33
Adam
32
Zim
41
Muller
62
Now, if we run this query on dataset x:
x.as("a").join(x.as("b")).where(
$"b.age" - $"a.age" <= 10 and
$"a.age" > $"b.age").show()
name
age
name
age
Max
33
Zim
41
Adam
32
Max
33
Adam
32
Zim
41
That is my desired result.
Now, conceptually if I have a very big dataset, I might want to use bucketing to reduce search space.
So, doing bucketing with:
val buck_x = x.withColumn("buc_age", floor($"age"/ 10))
which gives me:
name
age
buck_age
Max
33
3
Adam
32
3
Zim
41
4
Muller
62
6
After explode, I get the following result:
val exp_x = buck_x.withColumn("buc_age", explode(array($"buc_age" -1, $"buc_age", $"buc_age" + 1)))
name
age
buck_age
Max
33
2
Max
33
3
Max
33
4
Adam
32
2
Adam
32
3
Adam
32
4
Zim
41
3
Zim
41
4
Zim
41
5
Muller
62
5
Muller
62
6
Muller
62
7
Now, after final query,
exp_x.as("a").join(exp_x.as("b")).where(
$"a.buc_age" === $"b.buc_age" and
$"b.age" - $"a.age" <= 10 and
$"b.age" > $"a.age").show()
I get the following result.
name
age
buc_age
name
age
buc_age
Max
33
3
Zim
41
3
Max
33
4
Zim
41
4
Adam
32
2
Max
33
2
Adam
32
3
Zim
41
3
Adam
32
3
Max
33
3
Adam
32
4
Zim
41
4
Adam
32
4
Max
33
4
Clearly, it's not the same as my expectation, I am getting more rows than expected. How to solve this while using bucket?
Drop your bucketing columns and then select distinct rows, essentially undoing the duplication caused by explode:
exp_x.select(res1.columns.map(c => col(c).as(c + "_a")) : _*).join(exp_x.select(res1.columns.map(c => col(c).as(c + "_b")) : _*)).where(
$"buc_age_a" === $"buc_age_b" and
$"age_b" - $"age_a" <= 10 and
$"age_b" > $"age_a").
drop("buc_age_a", "buc_age_b").
distinct.
show
+------+-----+------+-----+
|name_a|age_a|name_b|age_b|
+------+-----+------+-----+
| Adam| 32| Zim| 41|
| Adam| 32| Max| 33|
| Max| 33| Zim| 41|
+------+-----+------+-----+
There is really no need for an explode.
Instead, this approach unions two inner self joins. The two joins find cases where:
A and B are in the same bucket, and B is older
B is one bucket more, but no more than 10 years older
This should perform better than using the explode, since fewer comparisons are performed (because the sets being joined here are one third of the exploded size).
val namesDF = Seq(("Max", 33), ("Adam", 32), ("Zim", 41), ("Muller", 62)).toDF("name", "age")
val buck_x = namesDF.withColumn("buc_age", floor($"age" / 10))
// same bucket where b is still older
val same = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" === $"b.buc_age" && $"b.age" > $"a.age"), "inner")
// different buckets -- b is one bucket higher but still no more than 10 ages different
val diff = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" + 1 === $"b.buc_age" && $"b.age" <= $"a.age" + 10), "inner")
val result = same.union(diff)
The result (you can do a drop to remove excess columns like in Charlie's answer):
result.show(false)
+----+---+-------+----+---+-------+
|name|age|buc_age|name|age|buc_age|
+----+---+-------+----+---+-------+
|Adam|32 |3 |Max |33 |3 |
|Max |33 |3 |Zim |41 |4 |
|Adam|32 |3 |Zim |41 |4 |
+----+---+-------+----+---+-------+

Full outer join in RDD scala spark

I have two file below:
file1
0000003 杉山______ 26 F
0000005 崎村______ 50 F
0000007 梶川______ 42 F
file2
0000005 82 79 16 21 80
0000001 46 39 8 5 21
0000004 58 71 20 10 6
0000009 60 89 33 18 6
0000003 30 50 71 36 30
0000007 50 2 33 15 62
Now, I would like join columns have the same value in field 1.
I want something like this:
0000005 崎村______ 50 F 82 79 16 21 80
0000003 杉山______ 26 F 30 50 71 36 30
0000007 梶川______ 42 F 50 2 33 15 62
You can use Data Frame Join concept instead of RDD joining. That will be easy. You can refer my sample code below. Hope that will help you.
I am considering your data is in same format as you mentioned above. If it is in CSV or any other format then you can skip Step-2 and update Step-1 as per data format. If you required output in RDD format then you can use Step-5 otherwise you can ignore it as per comment mentioned in code snippet.
I have modified data (like A_____, B_____, C____) just for readability.
//Step1: Loading file1 and file2 to corresponding DataFrame in text format
val df1 = spark.read.format("text").load("<path of file1>")
val df2 = spark.read.format("text").load("<path of file2>")
//Step2: Spliting single column "value" into multiple column for join Key
val file1 = ((((df1.withColumn("col1", split($"value", " ")(0)))
.withColumn("col2", split($"value", " ")(1)))
.withColumn("col3", split($"value", " ")(2)))
.withColumn("col4", split($"value", " ")(3)))
.select("col1","col2", "col3", "col4")
/*
+-------+-------+----+----+
|col1 |col2 |col3|col4|
+-------+-------+----+----+
|0000003|A______|26 |F |
|0000005|B______|50 |F |
|0000007|C______|42 |F |
+-------+-------+----+----+
*/
val file2 = ((((((df2.withColumn("col1", split($"value", " ")(0)))
.withColumn("col2", split($"value", " ")(1)))
.withColumn("col3", split($"value", " ")(2)))
.withColumn("col4", split($"value", " ")(3)))
.withColumn("col5", split($"value", " ")(4)))
.withColumn("col6", split($"value", " ")(5)))
.select("col1","col2", "col3", "col4","col5","col6")
/*
+-------+----+----+----+----+----+
|col1 |col2|col3|col4|col5|col6|
+-------+----+----+----+----+----+
|0000005|82 |79 |16 |21 |80 |
|0000001|46 |39 |8 |5 |21 |
|0000004|58 |71 |20 |10 |6 |
|0000009|60 |89 |33 |18 |6 |
|0000003|30 |50 |71 |36 |30 |
|0000007|50 |2 |33 |15 |62 |
+-------+----+----+----+----+----+
*/
//Step3: you can do alias to refer column name with aliases to increase readablity
val file01 = file1.as("f1")
val file02 = file2.as("f2")
//Step4: Joining files on Key
file01.join(file02,col("f1.col1") === col("f2.col1"))
/*
+-------+-------+----+----+-------+----+----+----+----+----+
|col1 |col2 |col3|col4|col1 |col2|col3|col4|col5|col6|
+-------+-------+----+----+-------+----+----+----+----+----+
|0000005|B______|50 |F |0000005|82 |79 |16 |21 |80 |
|0000003|A______|26 |F |0000003|30 |50 |71 |36 |30 |
|0000007|C______|42 |F |0000007|50 |2 |33 |15 |62 |
+-------+-------+----+----+-------+----+----+----+----+----+
*/
// Step5: if you want file data in RDD format the you can use below command
file01.join(file02,col("f1.col1") === col("f2.col1")).rdd.collect
/*
Array[org.apache.spark.sql.Row] = Array([0000005,B______,50,F,0000005,82,79,16,21,80], [0000003,A______,26,F,0000003,30,50,71,36,30], [0000007,C______,42,F,0000007,50,2,33,15,62])
*/
i found the solution , here is my code :
val rddPair1 = logData1.map { x =>
var data = x.split(" ")
var index = 0
var value=""
var key = data(index)
for( i <- 0 to data.length-1){
if(i!=index){
value+= data(i)+" "
}
}
new Tuple2(key, value.trim)
}
val rddPair2 = logData2.map { x =>
var data = x.split(" ")
var index = 0
var value=""
var key = data(index)
for( i <- 0 to data.length-1){
if(i!=index){
value+= data(i)+" "
}
}
new Tuple2(key, value.trim)
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2
)})
}
result:
0000003 杉山______ 26 F 30 50 71 36 30
0000005 崎村______ 50 F 82 79 16 21 80
0000007 梶川______ 42 F 50 2 33 15 62

DataFrame user-defined function not applied unless I change column name

I want to convert my DataFrame column using implicits functions definition.
I have my DataFrame type defined, which contains additional functions:
class MyDF(df: DataFrame) {
def bytes2String(colName: String): DataFrame = df
.withColumn(colname + "_tmp", udf((x: Array[Byte]) => bytes2String(x)).apply(col(colname)))
.drop(colname)
.withColumnRenamed(colname + "_tmp", colname)
}
Then I define my implicit conversion class:
object NpDataFrameImplicits {
implicit def toNpDataFrame(df: DataFrame): NpDataFrame = new NpDataFrame(df)
}
So finally, here is what I do in a small FunSuite unit test:
test("example: call to bytes2String") {
val df: DataFrame = ...
df.select("header.ID").show() // (1)
df.bytes2String("header.ID").withColumnRenamed("header.ID", "id").select("id").show() // (2)
df.bytes2String("header.ID").select("header.ID").show() // (3)
}
Show #1
+-------------------------------------------------+
|ID |
+-------------------------------------------------+
|[62 BF 58 0C 6C 59 48 9C 91 13 7B 97 E7 29 C0 2F]|
|[5C 54 49 07 00 24 40 F4 B3 0E E7 2C 03 B8 06 3C]|
|[5C 3E A2 21 01 D9 4C 1B 80 4E F9 92 1D 4A FE 26]|
|[08 C1 55 89 CE 0D 45 8C 87 0A 4A 04 90 2D 51 56]|
+-------------------------------------------------+
Show #2
+------------------------------------+
|id |
+------------------------------------+
|62bf580c-6c59-489c-9113-7b97e729c02f|
|5c544907-0024-40f4-b30e-e72c03b8063c|
|5c3ea221-01d9-4c1b-804e-f9921d4afe26|
|08c15589-ce0d-458c-870a-4a04902d5156|
+------------------------------------+
Show #3
+-------------------------------------------------+
|ID |
+-------------------------------------------------+
|[62 BF 58 0C 6C 59 48 9C 91 13 7B 97 E7 29 C0 2F]|
|[5C 54 49 07 00 24 40 F4 B3 0E E7 2C 03 B8 06 3C]|
|[5C 3E A2 21 01 D9 4C 1B 80 4E F9 92 1D 4A FE 26]|
|[08 C1 55 89 CE 0D 45 8C 87 0A 4A 04 90 2D 51 56]|
+-------------------------------------------------+
As you can witness here, the third show (aka without the column renaming) does not work as expected and shows us a non-converted ID column. Anyone knows why?
EDIT:
Output of df.select(col("header.ID") as "ID").bytes2String("ID").show():
+------------------------------------+
|ID |
+------------------------------------+
|62bf580c-6c59-489c-9113-7b97e729c02f|
|5c544907-0024-40f4-b30e-e72c03b8063c|
|5c3ea221-01d9-4c1b-804e-f9921d4afe26|
|08c15589-ce0d-458c-870a-4a04902d5156|
+------------------------------------+
Let me explain, what is happening on your conversion function with bellow example.
First Create data frame:
val jsonString: String =
"""{
| "employee": {
| "id": 12345,
| "name": "krishnan"
| },
| "_id": 1
|}""".stripMargin
val jsonRDD: RDD[String] = sc.parallelize(Seq(jsonString, jsonString))
val df: DataFrame = sparkSession.read.json(jsonRDD)
df.printSchema()
Output structure:
root
|-- _id: long (nullable = true)
|-- employee: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
Conversion function similar to your's:
def myConversion(myDf: DataFrame, colName: String): DataFrame = {
myDf.withColumn(colName + "_tmp", udf((x: Long) => (x+1).toString).apply(col(colName)))
.drop(colName)
.withColumnRenamed(colName + "_tmp", colName)
}
Scenario 1#
Do the conversion for root level field.
myConversion(df, "_id").show()
myConversion(df, "_id").select("_id").show()
Result:
+----------------+---+
| employee|_id|
+----------------+---+
|[12345,krishnan]| 2|
|[12345,krishnan]| 2|
+----------------+---+
+---+
|_id|
+---+
| 2|
| 2|
+---+
Scenario 2# do the conversion for employee.id. Here, when we use employee.id means, data frame got added with new field id at root level. This is the correct behavior.
myConversion(df, "employee.id").show()
myConversion(df, "employee.id").select("employee.id").show()
Result:
+---+----------------+-----------+
|_id| employee|employee.id|
+---+----------------+-----------+
| 1|[12345,krishnan]| 12346|
| 1|[12345,krishnan]| 12346|
+---+----------------+-----------+
+-----+
| id|
+-----+
|12345|
|12345|
+-----+
Scenario 3# Select the inner field to root level and then perform conversion.
myConversion(df.select("employee.id"), "id").show()
Result:
+-----+
| id|
+-----+
|12346|
|12346|
+-----+
My new conversion function, takes struct type field and perform conversion and store it into struct type field itself. Here, pass employee field and convert the id field alone, but changes are done field employee at root level.
case class Employee(id: String, name: String)
def myNewConversion(myDf: DataFrame, colName: String): DataFrame = {
myDf.withColumn(colName + "_tmp", udf((row: Row) => Employee((row.getLong(0)+1).toString, row.getString(1))).apply(col(colName)))
.drop(colName)
.withColumnRenamed(colName + "_tmp", colName)
}
Your scenario number 3# using my conversion function.
myNewConversion(df, "employee").show()
myNewConversion(df, "employee").select("employee.id").show()
Result#
+---+----------------+
|_id| employee|
+---+----------------+
| 1|[12346,krishnan]|
| 1|[12346,krishnan]|
+---+----------------+
+-----+
| id|
+-----+
|12346|
|12346|
+-----+

Spark Dataframe Group by having New Indicator Column

I need to group by "KEY" Column and need to check whether "TYPE_CODE" column has both "PL" and "JL" values , if so then i need to add a Indicator Column as "Y" else "N"
Example :
//Input Values
val values = List(List("66","PL") ,
List("67","JL") , List("67","PL"),List("67","PO"),
List("68","JL"),List("68","PO")).map(x =>(x(0), x(1)))
import spark.implicits._
//created a dataframe
val cmc = values.toDF("KEY","TYPE_CODE")
cmc.show(false)
------------------------
KEY |TYPE_CODE |
------------------------
66 |PL |
67 |JL |
67 |PL |
67 |PO |
68 |JL |
68 |PO |
-------------------------
Expected Output :
For each "KEY", If it has "TYPE_CODE" has both PL & JL then Y
else N
-----------------------------------------------------
KEY |TYPE_CODE | Indicator
-----------------------------------------------------
66 |PL | N
67 |JL | Y
67 |PL | Y
67 |PO | Y
68 |JL | N
68 |PO | N
---------------------------------------------------
For example,
67 has both PL & JL - So "Y"
66 has only PL - So "N"
68 has only JL - So "N"
One option:
1) collect TYPE_CODE as list;
2) check if it contains the specific strings;
3) then flatten the list with explode:
(cmc.groupBy("KEY")
.agg(collect_list("TYPE_CODE").as("TYPE_CODE"))
.withColumn("Indicator",
when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N"))
.withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show
+---+---------+---------+
|KEY|TYPE_CODE|Indicator|
+---+---------+---------+
| 68| JL| N|
| 68| PO| N|
| 67| JL| Y|
| 67| PL| Y|
| 67| PO| Y|
| 66| PL| N|
+---+---------+---------+
Another option:
Group by KEY and use agg to create two separate indicator columns (one for JL and on for PL), then calculate the combined indicator
join with the original DataFrame
Altogether:
val indicators = cmc.groupBy("KEY").agg(
sum(when($"TYPE_CODE" === "PL", 1).otherwise(0)) as "pls",
sum(when($"TYPE_CODE" === "JL", 1).otherwise(0)) as "jls"
).withColumn("Indicator", when($"pls" > 0 && $"jls" > 0, "Y").otherwise("N"))
val result = cmc.join(indicators, "KEY")
.select("KEY", "TYPE_CODE", "Indicator")
This might be slower than #Psidom's answer, but might be safer - collect_list might be problematic if you have a huge number of matches for a specific key (that list would have to be stored in a single worker's memory).
EDIT:
In case the input is known to be unique (i.e. JL / PL would only appear once per key, at most), indicators could be created using simple count aggregation, which is (arguably) easier to read:
val indicators = cmc
.where($"TYPE_CODE".isin("PL", "JL"))
.groupBy("KEY").count()
.withColumn("Indicator", when($"count" === 2, "Y").otherwise("N"))

de-aggregate for table columns in Greenplum

I am using Greenplum, and I have data like:
id | val
----+-----
12 | 12
12 | 23
12 | 34
13 | 23
13 | 34
13 | 45
(6 rows)
somehow I want the result like:
id | step
----+-----
12 | 12
12 | 11
12 | 11
13 | 23
13 | 11
13 | 11
(6 rows)
How it comes:
First there should be a Window function, which execute a de-aggreagte function based on partition by id
the column val is cumulative value, and what I want to get is the step values.
Maybe I can do it like:
select deagg(val) over (partition by id) from table_name;
So I need the deagg function.
Thanks for your help!
P.S and Greenplum is based on postgresql v8.2
You can just use the LAG function:
SELECT id,
val - lag(val, 1, 0) over (partition BY id ORDER BY val) as step
FROM yourTable
Note carefully that lag() has three parameters. The first is the column for which to find the lag, the second indicates to look at the previous record, and the third will cause lag to return a default value of zero.
Here is a table showing the table this query would generate:
id | val | lag(val, 1, 0) | val - lag(val, 1, 0)
----+-----+----------------+----------------------
12 | 12 | 0 | 12
12 | 23 | 12 | 11
12 | 34 | 23 | 11
13 | 23 | 0 | 23
13 | 34 | 23 | 11
13 | 45 | 34 | 11
Second note: This answer assumes that you want to compute your rolling difference in order of val ascending. If you want a different order you can change the ORDER BY clause of the partition.
val seems to be a cumulative sum. You can "unaggregate" it by subtracting the previous val from the current val, e.g., by using the lag function. Just note you'll have to treat the first value in each group specially, as lag will return null:
SELECT id, val - COALESCE(LAG(val) OVER (PARTITION BY id ORDER BY val), 0) AS val
FROM mytable;