Run a Cumulative/Iterative Costum Method on a Column in Spark Scala - scala

Hi I am new to Spark/Scala, I have been trying - AKA failing, to create a column in a spark dataframe based on a particular recursive formula:
Here it is in pseudo code.
someDf.col2[0] = 0
for i > 0
someDf.col2[i] = x * someDf.col1[i-1] + (1-x) * someDf.col2[i-1]
To dive into more details, here is my starting point:
this dataframe is the result of aggregations both on the level of dates and individual id's.
all further calculations have to happen with respect to that particular id, and have to take into consideration what happened in the previous week.
to illustrate this I have simplified the values to zeros and ones and removed the multiplier x and 1-x, and I also have initialized the col2 to zero.
var someDf = Seq(("2016-01-10 00:00:00.0","385608",0,0),
("2016-01-17 00:00:00.0","385608",0,0),
("2016-01-24 00:00:00.0","385608",1,0),
("2016-01-31 00:00:00.0","385608",1,0),
("2016-02-07 00:00:00.0","385608",1,0),
("2016-02-14 00:00:00.0","385608",1,0),
("2016-01-17 00:00:00.0","105010",0,0),
("2016-01-24 00:00:00.0","105010",1,0),
("2016-01-31 00:00:00.0","105010",0,0),
("2016-02-07 00:00:00.0","105010",1,0)
).toDF("dates", "id", "col1","col2" )
someDf.show()
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 0|
|2016-02-07 00:00:...|385608| 1| 0|
|2016-02-14 00:00:...|385608| 1| 0|
+--------------------+------+----+----+
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 0|
|2016-02-07 00:00:...|105010| 1| 0|
+--------------------+------+----+----+
what I have tried so far vs what is desired
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val date_id_window = Window.partitionBy("id").orderBy(asc("dates"))
someDf.withColumn("col2", lag($"col1",1 ).over(date_id_window) +
lag($"col2",1 ).over(date_id_window) ).show()
+--------------------+------+----+----+ / +--------------------+
| dates| id|col1|col2| / | what_col2_should_be|
+--------------------+------+----+----+ / +--------------------+
|2016-01-17 00:00:...|105010| 0|null| / | 0|
|2016-01-24 00:00:...|105010| 1| 0| / | 0|
|2016-01-31 00:00:...|105010| 0| 1| / | 1|
|2016-02-07 00:00:...|105010| 1| 0| / | 1|
+-------------------------------------+ / +--------------------+
|2016-01-10 00:00:...|385608| 0|null| / | 0|
|2016-01-17 00:00:...|385608| 0| 0| / | 0|
|2016-01-24 00:00:...|385608| 1| 0| / | 0|
|2016-01-31 00:00:...|385608| 1| 1| / | 1|
|2016-02-07 00:00:...|385608| 1| 1| / | 2|
|2016-02-14 00:00:...|385608| 1| 1| / | 3|
+--------------------+------+----+----+ / +--------------------+
Is there a way to do this with Spark dataframe, I have seen multiple cumulative type computations, but never including the same column, I believe the problem is that the newly computed value for row i-1 is not considered, instead the old i-1 is used which is always 0.
Any help would be appreciated.

Dataset should work just fine:
val x = 0.1
case class Record(dates: String, id: String, col1: Int)
someDf.drop("col2").as[Record].groupByKey(_.id).flatMapGroups((_, records) => {
val sorted = records.toSeq.sortBy(_.dates)
sorted.scanLeft((null: Record, 0.0)){
case ((_, col2), record) => (record, x * record.col1 + (1 - x) * col2)
}.tail
}).select($"_1.*", $"_2".alias("col2"))

You can use rowsBetween api with the Window function you are using and you should have desired output
val date_id_window = Window.partitionBy("id").orderBy(asc("dates"))
someDf.withColumn("col2", sum(lag($"col1", 1).over(date_id_window)).over(date_id_window.rowsBetween(Long.MinValue, 0)))
.withColumn("col2", when($"col2".isNull, lit(0)).otherwise($"col2"))
.show()
Given input dataframe as
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 0|
|2016-02-07 00:00:...|385608| 1| 0|
|2016-02-14 00:00:...|385608| 1| 0|
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 0|
|2016-02-07 00:00:...|105010| 1| 0|
+--------------------+------+----+----+
You should have output dataframe after applying above logic as
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 1|
|2016-02-07 00:00:...|105010| 1| 1|
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 1|
|2016-02-07 00:00:...|385608| 1| 2|
|2016-02-14 00:00:...|385608| 1| 3|
+--------------------+------+----+----+
I hope the answer is helpful

You should apply transformations to your dataframe rather than treating it as a var. One way to get what you want is to use Window's rowsBetween to cumulatively sum value of col1 for rows within each window partition through the previous row (i.e. row -1):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("id").orderBy("dates").rowsBetween(Long.MinValue, -1)
val newDF = someDf.
withColumn(
"col2", sum($"col1").over(window)
).withColumn(
"col2", when($"col2".isNull, 0).otherwise($"col2")
).orderBy("id", "dates")
newDF.show
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 1|
|2016-02-07 00:00:...|105010| 1| 1|
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 1|
|2016-02-07 00:00:...|385608| 1| 2|
|2016-02-14 00:00:...|385608| 1| 3|
+--------------------+------+----+----+

Related

Is there any generic function to finding the column names in pyspark

how to print column names in generic way. I want col1,col2,… instead of _1,_2,…
+---+---+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|_11|_12|
+---+---+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
assuming df is your dataframe, you can juste rename :
for col in df.columns:
df = df.withColumnRenamed(col, col.replace("_", "col"))

Filtering on multiple columns in Spark dataframes

Suppose I have a dataframe in Spark as shown below -
val df = Seq(
(0,0,0,0.0),
(1,0,0,0.1),
(0,1,0,0.11),
(0,0,1,0.12),
(1,1,0,0.24),
(1,0,1,0.27),
(0,1,1,0.30),
(1,1,1,0.40)
).toDF("A","B","C","rate")
Here is how it looks like -
scala> df.show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 0| 0| 0.0|
| 1| 0| 0| 0.1|
| 0| 1| 0|0.11|
| 0| 0| 1|0.12|
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
| 1| 1| 1| 0.4|
+---+---+---+----+
A,B and C are the advertising channels in this case. 0 and 1 represent absence and presence of channels respectively. 2^3 shows 8 combinations in the data-frame.
I want to filter records from this data-frame that shows presence of 2 channels at a time( AB, AC, BC) . Here is how I want my output to be -
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+
I can write 3 statements to get the output by doing -
scala> df.filter($"A" === 1 && $"B" === 1 && $"C" === 0).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
+---+---+---+----+
scala> df.filter($"A" === 1 && $"B" === 0 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 0| 1|0.27|
+---+---+---+----+
scala> df.filter($"A" === 0 && $"B" === 1 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 1| 1| 0.3|
+---+---+---+----+
However, I want to achieve this using either a single statement that does my job or a function that helps me get the output.
I was thinking of using a case statement to match the values. However in general my dataframe might consist of more than 3 channels -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 0| 0| 0.0|
| 0| 0| 0| 1| 0.1|
| 0| 0| 1| 0| 0.1|
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 0| 0.1|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 0| 1| 1| 1| 0.4|
| 1| 0| 0| 0| 0.0|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 0| 1| 1| 0.1|
| 1| 1| 0| 0|0.79|
| 1| 1| 0| 1| 0.1|
| 1| 1| 1| 0| 0.1|
| 1| 1| 1| 1| 0.1|
+---+---+---+---+----+
In this scenario I would want my output as -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 1| 0| 0|0.79|
+---+---+---+---+----+
which shows rates for paired presence of channels => (AB, AC, AD, BC, BD, CD).
Kindly help.
One way could be to sum the columns and then filter only when the result of the sum is 2.
import org.apache.spark.sql.functions._
df.withColumn("res", $"A" + $"B" + $"C").filter($"res" === lit(2)).drop("res").show
The output is:
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+

How to union 2 dataframe without creating additional rows?

I have 2 dataframes and I wanted to do .filter($"item" === "a") while keeping the "S/N" in number numbers.
I tried the following but it ended up with additional rows when I use union. Is there a way to union 2 dataframes without creating additional rows?
var DF1 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
var DF2 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
DF2 = DF2.filter($"item"==="a")
DF3=DF1.withColumn("item",lit(0)).withColumn("value",lit(0))
DF1.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| b| 3|
| 4| b| 4|
| 5| a| 2|
+---+----+-----+
DF2.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
+---+----+-----+
DF3.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
DF2.union(someDF3).show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
Left outer join your S/Ns with filtered dataframe, then use coalesce to get rid of nulls:
val DF3 = DF1.select("S/N")
val DF4 = (DF3.join(DF2, Seq("S/N"), joinType="leftouter")
.withColumn("item", coalesce($"item", lit(0)))
.withColumn("value", coalesce($"value", lit(0))))
DF4.show
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| 0| 0|
| 4| 0| 0|
| 5| a| 2|
+---+----+-----+

How to transform row-information into columns?

I have the following DataFrame in Spark 2.2 and Scala 2.11.8:
+--------+---------+-------+-------+----+-------+
|event_id|person_id|channel| group|num1| num2|
+--------+---------+-------+-------+----+-------+
| 560| 9410| web| G1| 0| 5|
| 290| 1430| web| G1| 0| 3|
| 470| 1370| web| G2| 0| 18|
| 290| 1430| web| G2| 0| 5|
| 290| 1430| mob| G2| 1| 2|
+--------+---------+-------+-------+----+-------+
Here is the equivalent DataFrame in Scala:
df = sqlCtx.createDataFrame(
[(560,9410,"web","G1",0,5),
(290,1430,"web","G1",0,3),
(470,1370,"web","G2",0,18),
(290,1430,"web","G2",0,5),
(290,1430,"mob","G2",1,2)],
["event_id","person_id","channel","group","num1","num2"]
)
The column group can only have two values: G1 and G2. I need to transform these values of the column group into new columns as follows:
+--------+---------+-------+--------+-------+--------+-------+
|event_id|person_id|channel| num1_G1|num2_G1| num1_G2|num2_G2|
+--------+---------+-------+--------+-------+--------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 0|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| web| 0| 0| 0| 5|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+--------+-------+--------+-------+
How can I do it?
AFAIK (at least i couldn't find a way to perform PIVOT without aggregation) we must use aggregation function when doing pivoting in Spark
Scala version:
scala> df.groupBy("event_id","person_id","channel")
.pivot("group")
.agg(max("num1") as "num1", max("num2") as "num2")
.na.fill(0)
.show
+--------+---------+-------+-------+-------+-------+-------+
|event_id|person_id|channel|G1_num1|G1_num2|G2_num1|G2_num2|
+--------+---------+-------+-------+-------+-------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 5|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+-------+-------+-------+-------+

Spark: Transpose DataFrame Without Aggregating

I have looked at a number of questions online, but they don't seem to do what I'm trying to achieve.
I'm using Apache Spark 2.0.2 with Scala.
I have a dataframe:
+----------+-----+----+----+----+----+----+
|segment_id| val1|val2|val3|val4|val5|val6|
+----------+-----+----+----+----+----+----+
| 1| 100| 0| 0| 0| 0| 0|
| 2| 0| 50| 0| 0| 20| 0|
| 3| 0| 0| 0| 0| 0| 0|
| 4| 0| 0| 0| 0| 0| 0|
+----------+-----+----+----+----+----+----+
which I want to transpose to
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+-----+----+----+----+
I've tried using pivot() but I couldn't get to the right answer. I ended up looping through my val{x} columns, and pivoting each as per below, but this is proving to be very slow.
val d = df.select('segment_id, 'val1)
+----------+-----+
|segment_id| val1|
+----------+-----+
| 1| 100|
| 2| 0|
| 3| 0|
| 4| 0|
+----------+-----+
d.groupBy('val1).sum().withColumnRenamed('val1', 'vals')
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
+----+-----+----+----+----+
Then using union() on each iteration of val{x} to my first dataframe.
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val2| 0| 50| 0| 0|
+----+-----+----+----+----+
Is there a more efficient way of a transpose where I do not want to aggregate data?
Thanks :)
Unfortunately there is no case when:
Spark DataFrame is justified considering amount of data.
Transposition of data is feasible.
You have to remember that DataFrame, as implemented in Spark, is a distributed collection of rows and each row is stored and processed on a single node.
You could express transposition on a DataFrame as pivot:
val kv = explode(array(df.columns.tail.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
df
.withColumn("kv", kv)
.select($"segment_id", $"kv.k", $"kv.v")
.groupBy($"k")
.pivot("segment_id")
.agg(first($"v"))
.orderBy($"k")
.withColumnRenamed("k", "vals")
but it is merely a toy code with no practical applications. In practice it is not better than collecting data:
val (header, data) = df.collect.map(_.toSeq.toArray).transpose match {
case Array(h, t # _*) => {
(h.map(_.toString), t.map(_.collect { case x: Int => x }))
}
}
val rows = df.columns.tail.zip(data).map { case (x, ys) => Row.fromSeq(x +: ys) }
val schema = StructType(
StructField("vals", StringType) +: header.map(StructField(_, IntegerType))
)
spark.createDataFrame(sc.parallelize(rows), schema)
For DataFrame defined as:
val df = Seq(
(1, 100, 0, 0, 0, 0, 0),
(2, 0, 50, 0, 0, 20, 0),
(3, 0, 0, 0, 0, 0, 0),
(4, 0, 0, 0, 0, 0, 0)
).toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
both would you give you the desired result:
+----+---+---+---+---+
|vals| 1| 2| 3| 4|
+----+---+---+---+---+
|val1|100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+---+---+---+---+
That being said if you need an efficient transpositions on distributed data structure you'll have to look somewhere else. There is a number of structures, including core CoordinateMatrix and BlockMatrix, which can distribute data across both dimensions and can be transposed.
In python, this can be done in a simple way
I normally use transpose function in Pandas by converting the spark DataFrame
spark_df.toPandas().T
Here is the solution for Pyspark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
Here is the solution code for your problem:
Step1: Choose columns
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
This code part can form the data frame like this:
+----------+-----+----+----+----+----+----+
| val1|val2|val3|val4|val5|val6|segment_id
+----------+-----+----+----+----+----+----+
| 100| 0| 0| 0| 0| 0| 1 |
| 0| 50| 0| 0| 20| 0| 2 |
| 0| 0| 0| 0| 0| 0| 3 |
| 0| 0| 0| 0| 0| 0| 4 |
+----------+-----+----+----+----+----+----+
Step 2: Transpose the whole table.
d_transposed = d.T.sort_index()
This code part can form the data frame like this:
+----+-----+----+----+----+----+-
|segment_id| 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Step 3: You need to rename the segment_id to vals:
d_transposed.withColumnRenamed("segment_id","vals")
+----+-----+----+----+----+----+-
|vals | 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Here is your full code:
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
d_transposed = d.T.sort_index()
d_transposed.withColumnRenamed("segment_id","vals")
This should be a perfect solution.
val seq = Seq((1,100,0,0,0,0,0),(2,0,50,0,0,20,0),(3,0,0,0,0,0,0),(4,0,0,0,0,0,0))
val df1 = seq.toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
df1.show()
val schema = df1.schema
val df2 = df1.flatMap(row => {
val metric = row.getInt(0)
(1 until row.size).map(i => {
(metric, schema(i).name, row.getInt(i))
})
})
val df3 = df2.toDF("metric", "vals", "value")
df3.show()
import org.apache.spark.sql.functions._
val df4 = df3.groupBy("vals").pivot("metric").agg(first("value"))
df4.show()