How to declare hundreds of features in Spark using Scala

How to declare hundreds of features in Spark using Scala - scala

I have a very big table in the following stucture:
user, product, action
user1, productA, actionA
user1, productA, actionB
user1, productA, actionB
user2, productF, actionA
user3, productZ, actionC
I would like to transpose it to the following:
Stage1: retrieve specific products X actions
user, productA_actionA, productB_actionA, …, productA_actionB, productB_actionB…
user1, 1, 0, ..., 0,0, ...
user1, 0, 0, ..., 1,0, ...
user1, 0, 0, ..., 1,0, ...
user2, 0, 0, ..., 0,0, ...
I have the array that contains the specific combinations:
[(productA,actionA) ,(productB,actionA) ,… ,(productA,actionB) ,(productB,actionB) …]
Stage2: group my users, and summing their products and actions
user, productA_actionA, productB_actionA, …, productA_actionB, productB_actionB…
user1, 1, 0, ..., **2**,0, ...
user2, 0, 0, ..., 0,0, ...
I tried using the withColumn function for each feature but this takes forever:
for ( (productID,productAction) <- productsCombination ) {
newTable = newTable.withColumn("Product_"+productID+"_"+productAction, when(col("product_action_id") === productAction and col("product_id") === productID, $"product_count").otherwise(0))
Here's an example shows what I want to do :
Any advice?

I wasn't able to understand the question properly but I considered your screenshot and this is based on the output of your screenshot.
As T. Gawęda said, you should use Pivot. Note that pivot is only available with Spark 1.6+
Considering this is your source DataFrame
scala> df.show()
+-----+-------+---------------+
| User|Product| Action|
+-----+-------+---------------+
|user1| A| Viewed|
|user1| A| Viewed|
|user1| A| Viewed|
|user1| C| AddToCart|
|user1| A|RemovedFromCart|
|user2| B| Viewed|
|user2| B| Viewed|
|user3| A| Viewed|
|user3| A| AddToCart|
|user4| B| AddToCart|
|user5| A| Viewed|
+-----+-------+---------------+
Now since you need to Pivot on two columns, you can concat them into one using the concat_ws function provided by Apache Spark and then Pivot the concatenated column, perform a groupBy on Users and use count on Products as the aggregate function.
df.withColumn("combined", concat_ws("_", $"Product", $"Action"))
.groupBy("User")
.pivot("combined")
.agg(count($"Product")).show()
+-----+-----------+-----------------+--------+-----------+--------+-----------+
| User|A_AddToCart|A_RemovedFromCart|A_Viewed|B_AddToCart|B_Viewed|C_AddToCart|
+-----+-----------+-----------------+--------+-----------+--------+-----------+
|user1| 0| 1| 3| 0| 0| 1|
|user2| 0| 0| 0| 0| 2| 0|
|user3| 1| 0| 1| 0| 0| 0|
|user4| 0| 0| 0| 1| 0| 0|
|user5| 0| 0| 1| 0| 0| 0|
+-----+-----------+-----------------+--------+-----------+--------+-----------+

Related

How to apply conditional counts (with reset) to grouped data in PySpark?

I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)

Create a column zero_or_first by checking whether the size is zero or the row is the first row. Then sum.
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2),
(55, ['banana'], '2017-01-01', 1, 1)
],
('ID', 'X', 'date', 'size', 'group')
)
w = Window.partitionBy('ID').orderBy('date')
df2 = df2.withColumn('row', F.row_number().over(w))
df2 = df2.withColumn('zero_or_first', F.when((F.col('size')==0)|(F.col('row')==1), 1).otherwise(0))
df2 = df2.withColumn('grp', F.sum('zero_or_first').over(w))
df2.orderBy('ID').show()
Here' the output. You can see that column group == grp. Where group is the expected results.
+---+---------------+----------+----+-----+---+-------------+---+
| ID| X| date|size|group|row|zero_or_first|grp|
+---+---------------+----------+----+-----+---+-------------+---+
| 33| []|2017-01-01| 0| 1| 1| 1| 1|
| 33| [banana]|2017-01-04| 1| 2| 4| 0| 2|
| 33|[apple, orange]|2017-01-02| 2| 1| 2| 0| 1|
| 33| []|2017-01-03| 0| 2| 3| 1| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1| 1| 1|
| 55| [banana]|2017-01-01| 1| 1| 2| 0| 1|
| 55| []|2017-01-03| 0| 2| 3| 1| 2|
+---+---------------+----------+----+-----+---+-------------+---+

I added a window function, and created an index within each ID. Then I expanded the conditional statement to also reference that index. The following seems to produce my desired output dataframe - but I am interested in knowing if there is a more efficient way to do this.
window = Window.partitionBy('ID').orderBy('date')
df \
.withColumn('size', size(col('X'))) \
.withColumn('index', rank().over(window).alias('index')) \
.withColumn(
"grp",
sum(((col('size') == 0) | (col('index') == 1)).cast("int")).over(window)
).show()
which yields
+---+---------------+----------+----+-----+---+
| ID| X| date|size|index|grp|
+---+---------------+----------+----+-----+---+
| 33| []|2017-01-01| 0| 1| 1|
| 33|[apple, orange]|2017-01-02| 2| 2| 1|
| 33| []|2017-01-03| 0| 3| 2|
| 33| [banana]|2017-01-04| 1| 4| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1|
| 55| []|2017-01-03| 0| 2| 2|
+---+---------------+----------+----+-----+---+

Rank per row over multiple columns in Spark Dataframe

I am using spark with Scala to transform a Dataframe , where I would like to compute a new variable which calculates the rank of one variable per row within many variables.
Example -
Input DF-
+---+---+---+
|c_0|c_1|c_2|
+---+---+---+
| 11| 11| 35|
| 22| 12| 66|
| 44| 22| 12|
+---+---+---+
Expected DF-
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 11| 35| 2| 3| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
This has aleady been answered using R - Rank per row over multiple columns in R,
but I need to do the same in spark-sql using scala. Thanks for the Help!
Edit- 4/1 . Encountered one scenario where if the values are same the ranks should be different. Editing first row for replicating the situation.

If I understand correctly, you want to have the rank of each column, within each row.
Let's first define the data, and the columns to "rank".
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
val cols = df.columns
Then we define a UDF that finds the index of an element in an array.
val pos = udf((a : Seq[Int], elt : Int) => a.indexOf(elt)+1)
We finally create a sorted array (in descending order) and use the UDF to find the rank of each column.
val ranks = cols.map(c => pos(col("array"), col(c)).as(c+"_rank"))
df.withColumn("array", sort_array(array(cols.map(col) : _*), false))
.select((cols.map(col)++ranks) :_*).show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 12| 35| 3| 2| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
EDIT:
As of Spark 2.4, the pos UDF that I defined can be replaced by the built in function array_position(column: Column, value: Any) that works exactly the same way (the first index is 1). This avoids using UDFs that can be slightly less efficient.
EDIT2:
The code above will generate duplicated indices in case you have duplidated keys. If you want to avoid it, you can create the array, zip it to remember which column is which, sort it and zip it again to get the final rank. It would look like this:
val colMap = df.columns.zipWithIndex.map(_.swap).toMap
val zip = udf((s: Seq[Int]) => s
.zipWithIndex
.sortBy(-_._1)
.map(_._2)
.zipWithIndex
.toMap
.mapValues(_+1))
val ranks = (0 until cols.size)
.map(i => 'zip.getItem(i) as colMap(i) + "_rank")
val result = df
.withColumn("zip", zip(array(cols.map(col) : _*)))
.select(cols.map(col) ++ ranks :_*)

One way to go about this would be to use windows.
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
d.withColumn(column+"_rank", rank() over Window.orderBy(desc(column))))
.show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 22| 12| 66| 2| 3| 1|
| 11| 21| 35| 3| 2| 2|
| 44| 22| 12| 1| 1| 3|
+---+---+---+--------+--------+--------+
But this is not a good idea. All the data will end up in one partition which will cause an OOM error if all the data does not fit inside one executor.
Another way would require to sort the dataframe three times, but at least that would scale to any size of data.
Let's define a function that zips a dataframe with consecutive indices (it exists for RDDs but not for dataframes)
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ (i+1)) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
And let's use it on the same dataframe df:
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
zipWithIndex(d.orderBy(desc(column)), column+"_rank"))
.show
which provides the exact same result as above.

You could probably create a window function. Do note that this is susceptible to OOM if you have too much data. But, I just wanted to introduce to the concept of window functions here.
inputDF.createOrReplaceTempView("my_df")
val expectedDF = spark.sql("""
select
c_0
, c_1
, c_2
, rank(c_0) over (order by c_0 desc) c_0_rank
, rank(c_1) over (order by c_1 desc) c_1_rank
, rank(c_2) over (order by c_2 desc) c_2_rank
from my_df""")
expectedDF.show()
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 44| 22| 12| 3| 3| 1|
| 11| 21| 35| 1| 2| 2|
| 22| 12| 66| 2| 1| 3|
+---+---+---+--------+--------+--------+

Filtering dataframe using hashmap

I have a hashmap in which I stored the values
Map(862304021470656 -> List(0.0, 0.0, 0.0, 0.0, 1.540980096E9, 74.365111, 22.302669, 0.0),866561010400483 -> List(0.0, 1.0, 1.0, 2.0, 1.543622306E9, 78.0204, 10.005262, 56.0))
This is the dataframe
| id| lt| ln| evt| lstevt| s| d|agl|chg| d1| d2| d3| d4|ebt|ibt|port| a1| a2| a3| a4|nos|dfrmd|
+---------------+---------+---------+----------+----------+---+---+---+---+---+---+---+---+---+---+----+---+---+---+---+---+-----+
|862304021470656|25.284158|82.435973|1540980095|1540980095| 0| 39|298| 0| 0| 1| 1| 2| 0| 5| 97| 12| -1| -1| 22| 0| 0|
|862304021470656|25.284158|82.435973|1540980105|1540980105| 0| 0|298| 0| 0| 1| 1| 2| 0| 5| 97| 12| -1| -1| 22| 0| 0|
|862304021470656|25.284724|82.434222|1540980155|1540980155| 14| 47|289| 0| 0| 1| 1| 2| 0| 5| 97| 11| -1| -1| 22| 0| 0|
|866561010400483|25.284858|82.433831|1544980165|1540980165| 12| 42|295| 0| 0| 1| 1| 2| 0| 5| 97| 12| -1| -1| 22| 0| 0|
I want to just filter those value from dataframe, comparing the 4th index of list from the evt column,picking only the rows whose evt value is greater than that 4th index value of list,key in the map is id column of dataframe.

Here's one way using a UDF to fetch the evt value for comparison:
import org.apache.spark.sql.functions._
val df = Seq(
(862304021470656L, 25.284158, 82.435973, 1540980095),
(862304021470656L, 25.284158, 82.435973, 1540980105),
(862304021470656L, 25.284724, 82.434222, 1540980155),
(866561010400483L, 25.284858, 82.433831, 1544980165)
).toDF("id", "lt", "ln", "evt")
val listMap = Map(
862304021470656L -> List(0.0, 0.0, 0.0, 0.0, 1.540980096E9, 74.365111, 22.302669, 0.0),
866561010400483L -> List(0.0, 1.0, 1.0, 2.0, 1.543622306E9, 78.0204, 10.005262, 56.0)
)
def evtLimit(m: Map[Long, List[Double]], evtIdx: Int) = udf(
(id: Long) => m.get(id) match {
case Some(ls) => if (evtIdx < ls.size) ls(evtIdx) else Double.MaxValue
case None => Double.MaxValue
}
)
df.where($"evt" > evtLimit(listMap, 4)($"id")).show
// +---------------+---------+---------+----------+
// | id| lt| ln| evt|
// +---------------+---------+---------+----------+
// |862304021470656|25.284158|82.435973|1540980105|
// |862304021470656|25.284724|82.434222|1540980155|
// |866561010400483|25.284858|82.433831|1544980165|
// +---------------+---------+---------+----------+
Note that the UDF returns Double.MaxValue in case of non-matching key or invalid value in the provided Map. That can certainly be revised for specific business requirement.

You can get this with a simple sql:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = ... //your main Dataframe
val map = Map(..your data here..).toDF("id", "list")
val join = df.join(map, "id").filter(length($"list") >= 5 /* <-- just in case */)
val res = join.filter($"evt" > $"list"(4))

Understanding pivot and agg

I have the following columns in DataFrame df:
c_id p_id type values
278230 57371100 11 1
278230 57371100 12 1
...
I execute the following code and expect to see columns 11_total and 12_total:
df
.groupBy($"c_id",$"p_id")
.pivot("type")
.agg(sum("values") as "total")
.na.fill(0)
.show()
Instead, I get columns 11 and 12:
+-----------+----------+---+---+
| c_id| p_id| 11| 12|
+-----------+----------+---+---+
| 278230| 57371100| 0| 1|
| 337790| 72031970| 3| 0|
| 320710| 71904400| 0| 1|
Why?

That's because Spark appends aliases to the pivot column values only when there are multiple aggregations for clarity:
val df = Seq(
(278230, 57371100, 11, 1),
(278230, 57371100, 12, 2),
(337790, 72031970, 11, 1),
(337790, 72031970, 11, 2),
(337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total")).
show
// +------+--------+---+---+
// | c_id| p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970| 3| 3|
// |278230|57371100| 1| 2|
// +------+--------+---+---+
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total"), max("values").as("max")).
show
// +------+--------+--------+------+--------+------+
// | c_id| p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970| 3| 2| 3| 3|
// |278230|57371100| 1| 1| 2| 2|
// +------+--------+--------+------+--------+------+

Find and replace not working - dataframe spark scala

I have the following dataframe:
df.show
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 2|
|2017-05-20| 1|
|2017-06-23| 2|
|2017-06-16| 3|
|2017-06-30| 1|
I want to replace the count values by 0, where it is greater than 1, i.e., the resultant dataframe should be:
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 0|
|2017-05-20| 1|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| 1|
I tried the following expression:
df.withColumn("count", when(($"count" > 1), 0)).show
but the output was
+----------+--------+
| createdon| count|
+----------+--------+
|2017-06-28| null|
|2017-06-17| 0|
|2017-05-20| null|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| null|
I am not able to understand, why for the value 1, null is getting displayed and how to overcome that. Can anyone help me?

You need to chain otherwise after when to specify the values where the conditions don't hold; In your case, it would be count column itself:
df.withColumn("count", when(($"count" > 1), 0).otherwise($"count"))

This can be done using udf function too
def replaceWithZero = udf((col: Int) => if(col > 1) 0 else col) //udf function
df.withColumn("count", replaceWithZero($"count")).show(false) //calling udf function
Note : udf functions should always be the choice only when there is no inbuilt functions as it requires serialization and deserialization of column data.