Comparing Previous data with Current data in spark scala - scala

I want to compare Prev.data with Current data by monthly . I am having data like below.
Data-set 1 : (Prev) Data-set 2 : (Latest)
Year-month Sum-count Year-Month Sum-count
-- -- 201808 48
201807 30 201807 22
201806 20 201806 20
201805 35 201805 20
201804 12 201804 9
201803 15 -- --
I have data sets like as shown above. I want to compare the both data sets based on year-month column and sum-count and need to find out difference in percentage.
I am using spark 2.3.0 and Scala 2.11.
Here is mode :
import org.apache.spark.sql.functions.lag
val mdf = spark.read.format("csv").
option("InferSchema","true").
option("header","true").
option("delimiter",",").
option("charset","utf-8").
load("c:\\test.csv")
mdf.createOrReplaceTempView("test")
val res= spark.sql("select year-month,SUM(Sum-count) as SUM_AMT from test d group by year-month")
val win = org.apache.spark.sql.expressions.Window.orderBy("data_ym")
val res1 = res.withColumn("Prev_month", lag("SUM_AMT", 1,0).over(win)).withColumn("percentage",col("Prev_month") / sum("SUM_AMT").over()).show()
I need output like this :
if percentage is more than 10% then i need to set flag as F.
set1 cnt set2 cnt output(Percentage) Flag
201807 30 201807 22 7% T
201806 20 201806 20 0% T
201805 35 201805 20 57% F
Please help me on this.

Can be done in such way:
val data1 = List(
("201807", 30),
("201806", 20),
("201805", 35),
("201804", 12),
("201803", 15)
)
val data2 = List(
("201808", 48),
("201807", 22),
("201806", 20),
("201805", 20),
("201804", 9)
)
val df1 = data1.toDF("Year-month", "Sum-count")
val df2 = data2.toDF("Year-month", "Sum-count")
val joined = df1.alias("df1").join(df2.alias("df2"), "Year-month")
joined
.withColumn("output(Percentage)", abs($"df1.Sum-count" - $"df2.Sum-count").divide($"df1.Sum-count"))
.withColumn("Flag", when($"output(Percentage)" > 0.1, "F").otherwise("T"))
.show(false)
Output:
+----------+---------+---------+-------------------+----+
|Year-month|Sum-count|Sum-count|output(Percentage) |Flag|
+----------+---------+---------+-------------------+----+
|201807 |30 |22 |0.26666666666666666|F |
|201806 |20 |20 |0.0 |T |
|201805 |35 |20 |0.42857142857142855|F |
|201804 |12 |9 |0.25 |F |
+----------+---------+---------+-------------------+----+

Here's my solution :
val values1 = List(List("1201807", "30")
,List("1201806", "20") ,
List("1201805", "35"),
List("1201804","12"),
List("1201803","15")
).map(x =>(x(0), x(1)))
val values2 = List(List("201808", "48")
,List("1201807", "22") ,
List("1201806", "20"),
List("1201805","20"),
List("1201804","9")
).map(x =>(x(0), x(1)))
import spark.implicits._
import org.apache.spark.sql.functions
val df1 = values1.toDF
val df2 = values2.toDF
df1.join(df2, Seq("_1"), "full").toDF("set", "cnt1", "cnt2")
.withColumn("percentage1", col("cnt1")/sum("cnt1").over() * 100)
.withColumn("percentage2", col("cnt2")/sum("cnt2").over() * 100)
.withColumn("percentage", abs(col("percentage2") - col("percentage1")))
.withColumn("flag", when(col("percentage") > 10, "F").otherwise("T")).na.drop().show()
Here's the result :
+-------+----+----+------------------+------------------+------------------+----+
| set|cnt1|cnt2| percentage1| percentage2| percentage|flag|
+-------+----+----+------------------+------------------+------------------+----+
|1201804| 12| 9|10.714285714285714| 7.563025210084033| 3.15126050420168| T|
|1201807| 30| 22|26.785714285714285|18.487394957983195| 8.29831932773109| T|
|1201806| 20| 20|17.857142857142858| 16.80672268907563|1.0504201680672267| T|
|1201805| 35| 20| 31.25| 16.80672268907563|14.443277310924369| F|
+-------+----+----+------------------+------------------+------------------+----+
I hope it helps :)

Related

Dividing a set of columns by its average in Pyspark

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)
You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 1| 40| 56| john jones|
| 2| 45| 55|tracey smith|
| 3| 33| 23| amy sanders|
+----+----+----+------------+
from pyspark.sql import Window
def avg_scaling(df, cols_to_scale):
w = Window.partitionBy(F.lit(1))
for col in cols_to_scale:
df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
return df
new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 0.5|1.02|1.25| john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+

How to sum(case when then) in SparkSQL DataFrame just like sql?

I'm new to SparkSQL, and I want to calculate the percentage in my data with every status.
Here is my data like below:
A B
11 1
11 3
12 1
13 3
12 2
13 1
11 1
12 2
So,I can do it in SQL like this:
select (C.oneTotal / C.total) as onePercentage,
(C.twoTotal / C.total) as twotPercentage,
(C.threeTotal / C.total) as threPercentage
from (select count(*) as total,
A,
sum(case when B = '1' then 1 else 0 end) as oneTotal,
sum(case when B = '2' then 1 else 0 end) as twoTotal,
sum(case when B = '3' then 1 else 0 end) as threeTotal
from test
group by A) as C;
But in the SparkSQL DataFrame, first I calculate totalCount in every status like below:
// wrong code
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1)).otherwise(0)).alias("oneTotal")
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal")
The problem is the sum(when)'s result is zero.
Do I have wrong use with it?
How to implement it in SparkSQL just like my above SQL? And then calculate the portion of every status?
Thank you for your help. In the end, I solve it with sum(when). below is my current code.
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1).otherwise(0)).alias("oneTotal"),
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal"))
.select(col("total"),
col("A"),
col("oneTotal") / col("total").alias("oneRate"),
col("twoTotal") / col("total").alias("twoRate"))
Thanks again.
you can use sum(when(... or also count(when.., the second option being shorter to write:
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df
.groupBy($"A")
.agg(
count("*").as("total"),
count(when($"B"==="1",$"A")).as("oneTotal"),
count(when($"B"==="2",$"A")).as("twoTotal"),
count(when($"B"==="3",$"A")).as("threeTotal")
)
.select(
$"A",
($"oneTotal"/$"total").as("onePercentage"),
($"twoTotal"/$"total").as("twoPercentage"),
($"threeTotal"/$"total").as("threePercentage")
)
.show()
gives
+---+------------------+------------------+------------------+
| A| onePercentage| twoPercentage| threePercentage|
+---+------------------+------------------+------------------+
| 12|0.3333333333333333|0.6666666666666666| 0.0|
| 13| 0.5| 0.0| 0.5|
| 11|0.6666666666666666| 0.0|0.3333333333333333|
+---+------------------+------------------+------------------+
alternatively, you could produce a "long" table with window-functions:
df
.groupBy($"A",$"B").count()
.withColumn("total",sum($"count").over(Window.partitionBy($"A")))
.select(
$"A",
$"B",
($"count"/$"total").as("percentage")
).orderBy($"A",$"B")
.show()
+---+---+------------------+
| A| B| percentage|
+---+---+------------------+
| 11| 1|0.6666666666666666|
| 11| 3|0.3333333333333333|
| 12| 1|0.3333333333333333|
| 12| 2|0.6666666666666666|
| 13| 1| 0.5|
| 13| 3| 0.5|
+---+---+------------------+
As far as I understood you want to implement the logic like above sql showed in the question.
one way is like below example
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object AggTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df.show(false)
df.createOrReplaceTempView("test")
spark.sql(
"""
|select (C.oneTotal / C.total) as onePercentage,
| (C.twoTotal / C.total) as twotPercentage,
| (C.threeTotal / C.total) as threPercentage
|from (select count(*) as total,
| A,
| sum(case when B = '1' then 1 else 0 end) as oneTotal,
| sum(case when B = '2' then 1 else 0 end) as twoTotal,
| sum(case when B = '3' then 1 else 0 end) as threeTotal
| from test
| group by A) as C
""".stripMargin).show
}
Result :
+---+---+
|A |B |
+---+---+
|11 |1 |
|11 |3 |
|12 |1 |
|13 |3 |
|12 |2 |
|13 |1 |
|11 |1 |
|12 |2 |
+---+---+
+------------------+------------------+------------------+
| onePercentage| twotPercentage| threPercentage|
+------------------+------------------+------------------+
|0.3333333333333333|0.6666666666666666| 0.0|
| 0.5| 0.0| 0.5|
|0.6666666666666666| 0.0|0.3333333333333333|
+------------------+------------------+------------------+

Spark: create a sessionId based on timestamp

I like to do the following transformation. Given a data frame that records whether a user is logged. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT = 20.
A session period is defined as : [first record --> first record + Timeout]
For instance, the original DataFrame would look like the following:
scala> val df = sc.parallelize(List(
("user1",0),
("user1",3),
("user1",15),
("user1",22),
("user1",28),
("user1",41),
("user1",45),
("user1",85),
("user1",90)
)).toDF("user_id","timestamp")
df: org.apache.spark.sql.DataFrame = [user_id: string, timestamp: int]
+-------+---------+
|user_id|timestamp|
+-------+---------+
|user1 |0 |
|user1 |3 |
|user1 |15 |
|user1 |22 |
|user1 |28 |
|user1 |41 |
|user1 |45 |
|user1 |85 |
|user1 |90 |
+-------+---------+
The goal is:
+-------+---------+----------+
|user_id|timestamp|session_id|
+-------+---------+----------+
|user1 |0 | 0 |-> first record (session 0: period [0->20])
|user1 |3 | 0 |
|user1 |15 | 0 |
|user1 |22 | 1 |-> 22 not in [0->20]->new session(period 22->42)
|user1 |28 | 1 |
|user1 |41 | 1 |
|user1 |45 | 2 |-> 45 not in [22->42]->newsession(period 45->65)
|user1 |85 | 3 |
|user1 |90 | 3 |
+-------+---------+----------+
Are there any elegant solution to solve this problem, preferably in Scala.
Thanks in advance!
This may not be an elegant solution but this worked for given data format.
sc.parallelize(List(
("user1", 0),
("user1", 3),
("user1", 15),
("user1", 22),
("user1", 28),
("user1", 41),
("user1", 45),
("user1", 85),
("user1", 90))).toDF("user_id", "timestamp").map { x =>
val userId = x.getAs[String]("user_id")
val timestamp = x.getAs[Int]("timestamp")
val session = timestamp / 20
(userId, timestamp, session)
}.toDF("user_id", "timestamp", "session").show()
Result
You can change timestamp / 20 according to your need.
Please see my code.
Two issues here:
1,I think the performance is not well.
2,I use the "userid" to join, if this doesn't meet your requirement. You add a new column with a same value to timeSetFrame and newSessionSec.
var newSession = ss.sparkContext.parallelize(List(
("user1", 0), ("user1", 3), ("user1", 15), ("user1", 22),
("user1", 28), ("user1", 41), ("user1", 45), ("user1", 85),
("user1", 90))).zipWithIndex().toDF("tmp", "index")
val getUser_id = udf( ( s : Row) => {
s.getString(0)
})
val gettimestamp = udf( ( s : Row) => {
s.getInt(1)
})
val newSessionSec = newSession.withColumn( "user_id", getUser_id($"tmp"))
.withColumn( "timestamp", gettimestamp($"tmp")).drop( "tmp") //.show()
val timeSet : Array[Int] = newSessionSec.select("timestamp").collect().map( s=>s.getInt(0))
val timeSetFrame = ss.sparkContext.parallelize( Seq(( "user1",timeSet))).toDF( "user_id", "tset")
val newSessionThird = newSessionSec.join( timeSetFrame, Seq("user_id"), "outer") // .show
val getSessionID = udf( ( ts: Int, aa: Seq[Int]) => {
var result = 0
var begin = 0
val loop = new Breaks
loop.breakable {
for (time <- aa) {
if (time > (begin + 20)) {
begin = time
result += 1
}
if (time == ts) {
loop.break;
}
}
}
result
})
newSessionThird.withColumn( "sessionID", getSessionID( $"timestamp", $"tset")).drop("tset", "index").show()

How to pivot a table into a timeseries table in Scala

I have the following table:
index 0 1 2 id
1 9.69 1.18 0.59 62
2 7.38 2.18 0.87 62
3 10.02 1.16 0.29 62
That I'm trying to pivot into a time series like table.
Expected Output:
data id
[9.69, 7.38, 10.02] 62
[1.18, 2.18, 1.16] 62
[0.59, 0.87, 0.29] 62
I tried the following code
val table = df.groupBy(df.col("id")).pivot("index").sum("0").cache()
val tablets = table.map(x => new transform(1.until(x.length).map(x.getDouble(_)).toList, x.getString(0)))
case class transform(data:List[Double], start:String)
But it's given only this output
[9.69, 7.38, 10.02] 62
How can I iterate through all columns and get the desired output table as above?
class pivot (df: DataFrame) {
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*)
}
In your solution you have used groupBy and sum which will generate aggregated one row for each group. Thats why you were getting one result in your result.
The solution to your problem is a bit complex. I have used combination of withColumn, explode, array, struct, pivot, groupBy, agg, drop, col, select and alias. Following is the solution
val df = Seq((1, 9.69, 1.18, 0.59, 62),
(2, 7.38, 2.18, 0.87, 62),
(3, 10.02, 1.16, 0.29, 62)).toDF("index", "0", "1", "2", "id")
As defined in your question you must already have dataframe as below by reading input as above
+-----+-----+----+----+---+
|index|0 |1 |2 |id |
+-----+-----+----+----+---+
|1 |9.69 |1.18|0.59|62 |
|2 |7.38 |2.18|0.87|62 |
|3 |10.02|1.16|0.29|62 |
+-----+-----+----+----+---+
If yes then following solution should work.
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.orderBy("k")
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*).sort($"data".desc)
You should be getting following output
+---+-------------------+
|id |data |
+---+-------------------+
|62 |[9.69, 7.38, 10.02]|
|62 |[1.18, 2.18, 1.16] |
|62 |[0.59, 0.87, 0.29] |
+---+-------------------+

Spark dataframe data aggregation

I have a below requirement to aggregate the data on Spark dataframe in scala.
I have a spark dataframe with two columns.
mo_id sales
201601 11.01
201602 12.01
201603 13.01
201604 14.01
201605 15.01
201606 16.01
201607 17.01
201608 18.01
201609 19.01
201610 20.01
201611 21.01
201612 22.01
As shown above the dataframe has two columns 'mo_id' and 'sales'.
I want to add a new column (agg_sales)to the dataframe which should have the sum of sales upto the current month like as shown below.
mo_id sales agg_sales
201601 10 10
201602 20 30
201603 30 60
201604 40 100
201605 50 150
201606 60 210
201607 70 280
201608 80 360
201609 90 450
201610 100 550
201611 110 660
201612 120 780
Description:
For the month 201603 agg_sales will be sum of sales from 201601 to 201603.
For the month 201604 agg_sales will be sum of sales from 201601 to 201604.
and so on.
Can anyone please help to do this.
Versions using : Spark 1.6.2 and Scala 2.10
You are looking for a cumulative sum which can be accomplished with a window function:
scala> val df = sc.parallelize(Seq((201601, 10), (201602, 20), (201603, 30), (201604, 40), (201605, 50), (201606, 60), (201607, 70), (201608, 80), (201609, 90), (201610, 100), (201611, 110), (201612, 120))).toDF("id","sales")
df: org.apache.spark.sql.DataFrame = [id: int, sales: int]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ordering = Window.orderBy("id")
ordering: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#75d454a4
scala> df.withColumn("agg_sales", sum($"sales").over(ordering)).show
16/12/27 21:11:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-----+-------------+
| id|sales| agg_sales |
+------+-----+-------------+
|201601| 10| 10|
|201602| 20| 30|
|201603| 30| 60|
|201604| 40| 100|
|201605| 50| 150|
|201606| 60| 210|
|201607| 70| 280|
|201608| 80| 360|
|201609| 90| 450|
|201610| 100| 550|
|201611| 110| 660|
|201612| 120| 780|
+------+-----+-------------+
Note that I defined the ordering on the ids, you would probably want some sort of time stamp to order the summation.