Spark dataframe groupby and order group? - scala

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~

To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

Related

how to solve following issue with apache spark with optimal solution

i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+

join 2 DF with diferent dimension scala

Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))

Spark - How to apply rules defined in a dataframe to another dataframe

I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+

Pyspark Join Tables

I'm new in Pyspark. I have 'Table A' and 'Table B' and I need join both to get 'Table C'. Can anyone help-me please?
I'm using DataFrames...
I don't know how to join that tables all together in the right way...
Table A:
+--+----------+-----+
|id|year_month| qt |
+--+----------+-----+
| 1| 2015-05| 190 |
| 2| 2015-06| 390 |
+--+----------+-----+
Table B:
+---------+-----+
year_month| sem |
+---------+-----+
| 2016-01| 1 |
| 2015-02| 1 |
| 2015-03| 1 |
| 2016-04| 1 |
| 2015-05| 1 |
| 2015-06| 1 |
| 2016-07| 2 |
| 2015-08| 2 |
| 2015-09| 2 |
| 2016-10| 2 |
| 2015-11| 2 |
| 2015-12| 2 |
+---------+-----+
Table C:
The join add columns and also add rows...
+--+----------+-----+-----+
|id|year_month| qt | sem |
+--+----------+-----+-----+
| 1| 2015-05 | 0 | 1 |
| 1| 2016-01 | 0 | 1 |
| 1| 2015-02 | 0 | 1 |
| 1| 2015-03 | 0 | 1 |
| 1| 2016-04 | 0 | 1 |
| 1| 2015-05 | 190 | 1 |
| 1| 2015-06 | 0 | 1 |
| 1| 2016-07 | 0 | 2 |
| 1| 2015-08 | 0 | 2 |
| 1| 2015-09 | 0 | 2 |
| 1| 2016-10 | 0 | 2 |
| 1| 2015-11 | 0 | 2 |
| 1| 2015-12 | 0 | 2 |
| 2| 2015-05 | 0 | 1 |
| 2| 2016-01 | 0 | 1 |
| 2| 2015-02 | 0 | 1 |
| 2| 2015-03 | 0 | 1 |
| 2| 2016-04 | 0 | 1 |
| 2| 2015-05 | 0 | 1 |
| 2| 2015-06 | 390 | 1 |
| 2| 2016-07 | 0 | 2 |
| 2| 2015-08 | 0 | 2 |
| 2| 2015-09 | 0 | 2 |
| 2| 2016-10 | 0 | 2 |
| 2| 2015-11 | 0 | 2 |
| 2| 2015-12 | 0 | 2 |
+--+----------+-----+-----+
Code:
from pyspark import HiveContext
sqlContext = HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()
It's not really a join more a cartesian product (cross join)
Spark 2
import pyspark.sql.functions as psf
tableA.crossJoin(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
Spark 1.6
tableA.join(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
| 1| 0| 2015-02| 1|
| 1| 0| 2015-03| 1|
| 1|190| 2015-05| 1|
| 1| 0| 2015-06| 1|
| 1| 0| 2016-01| 1|
| 1| 0| 2016-04| 1|
| 1| 0| 2015-08| 2|
| 1| 0| 2015-09| 2|
| 1| 0| 2015-11| 2|
| 1| 0| 2015-12| 2|
| 1| 0| 2016-07| 2|
| 1| 0| 2016-10| 2|
| 2| 0| 2015-02| 1|
| 2| 0| 2015-03| 1|
| 2| 0| 2015-05| 1|
| 2|390| 2015-06| 1|
| 2| 0| 2016-01| 1|
| 2| 0| 2016-04| 1|
| 2| 0| 2015-08| 2|
| 2| 0| 2015-09| 2|
| 2| 0| 2015-11| 2|
| 2| 0| 2015-12| 2|
| 2| 0| 2016-07| 2|
| 2| 0| 2016-10| 2|
+---+---+----------+---+

Spark SQL DataFrame transformation involving partitioning and lagging

I want to transform a Spark SQL DataFrame like this:
animal value
------------
cat 8
cat 5
cat 6
dog 2
dog 4
dog 3
rat 7
rat 4
rat 9
into a DataFrame like this:
animal value previous-value
-----------------------------
cat 8 0
cat 5 8
cat 6 5
dog 2 0
dog 4 2
dog 3 4
rat 7 0
rat 4 7
rat 9 4
I sort of want to partition by animal, and then, for each animal, previous-value lags one row behind value (with a default value of 0), and then put the partitions back together again.
This can be accomplished using a window function.
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = sc.parallelize(Seq(("cat", 8, "01:00"),("cat", 5, "02:00"),("cat", 6, "03:00"),("dog", 2, "02:00"),("dog", 4, "04:00"),("dog", 3, "06:00"),("rat", 7, "01:00"),("rat", 4, "03:00"),("rat", 9, "05:00"))).toDF("animal", "value", "time")
df.show
+------+-----+-----+
|animal|value| time|
+------+-----+-----+
| cat| 8|01:00|
| cat| 5|02:00|
| cat| 6|03:00|
| dog| 2|02:00|
| dog| 4|04:00|
| dog| 3|06:00|
| rat| 7|01:00|
| rat| 4|03:00|
| rat| 9|05:00|
+------+-----+-----+
I've added a "time" field to illustrate orderBy.
val w1 = Window.partitionBy($"animal").orderBy($"time")
val previous_value = lag($"value", 1).over(w1)
val df1 = df.withColumn("previous", previous_value)
df1.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| null|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| null|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| null|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
If you want to replace nulls with 0:
val df2 = df1.na.fill(0)
df2.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| 0|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| 0|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| 0|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
This peice of code would work:
val df = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
val df2 = df.groupBy("animal").agg(collect_list("value") as "listValue")
val desiredDF = df2.rdd.flatMap{row=>
val animal=row.getAs[String]("animal")
val valueList=row.getAs[Seq[String]]("listValue").toList
val newlist=valueList zip "0"::valueList
newlist.map(a=>(animal,a._1,a._2))
}.toDF("animal","value","previousValue")
On the Spark shell:
scala> val df=spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
df: org.apache.spark.sql.DataFrame = [animal: string, value: string]
scala> df.show()
+------+-----+
|animal|value|
+------+-----+
| cat| 8|
| cat| 5|
| cat| 6|
| dog| 2|
| dog| 4|
| dog| 3|
| rat| 7|
| rat| 4 |
| rat| 9|
+------+-----+
scala> val df2=df.groupBy("animal").agg(collect_list("value") as "listValue")
df2: org.apache.spark.sql.DataFrame = [animal: string, listValue: array<string>]
scala> df2.show()
+------+----------+
|animal| listValue|
+------+----------+
| rat|[7, 4 , 9]|
| dog| [2, 4, 3]|
| cat| [8, 5, 6]|
+------+----------+
scala> val desiredDF=df2.rdd.flatMap{row=>
| val animal=row.getAs[String]("animal")
| val valueList=row.getAs[Seq[String]]("listValue").toList
| val newlist=valueList zip "0"::valueList
| newlist.map(a=>(animal,a._1,a._2))
| }.toDF("animal","value","previousValue")
desiredDF: org.apache.spark.sql.DataFrame = [animal: string, value: string ... 1 more field]
scala> desiredDF.show()
+------+-----+-------------+
|animal|value|previousValue|
+------+-----+-------------+
| rat| 7| 0|
| rat| 4 | 7|
| rat| 9| 4 |
| dog| 2| 0|
| dog| 4| 2|
| dog| 3| 4|
| cat| 8| 0|
| cat| 5| 8|
| cat| 6| 5|
+------+-----+-------------+