Spark: create a sessionId based on timestamp - scala

I like to do the following transformation. Given a data frame that records whether a user is logged. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT = 20.
A session period is defined as : [first record --> first record + Timeout]
For instance, the original DataFrame would look like the following:
scala> val df = sc.parallelize(List(
("user1",0),
("user1",3),
("user1",15),
("user1",22),
("user1",28),
("user1",41),
("user1",45),
("user1",85),
("user1",90)
)).toDF("user_id","timestamp")
df: org.apache.spark.sql.DataFrame = [user_id: string, timestamp: int]
+-------+---------+
|user_id|timestamp|
+-------+---------+
|user1 |0 |
|user1 |3 |
|user1 |15 |
|user1 |22 |
|user1 |28 |
|user1 |41 |
|user1 |45 |
|user1 |85 |
|user1 |90 |
+-------+---------+
The goal is:
+-------+---------+----------+
|user_id|timestamp|session_id|
+-------+---------+----------+
|user1 |0 | 0 |-> first record (session 0: period [0->20])
|user1 |3 | 0 |
|user1 |15 | 0 |
|user1 |22 | 1 |-> 22 not in [0->20]->new session(period 22->42)
|user1 |28 | 1 |
|user1 |41 | 1 |
|user1 |45 | 2 |-> 45 not in [22->42]->newsession(period 45->65)
|user1 |85 | 3 |
|user1 |90 | 3 |
+-------+---------+----------+
Are there any elegant solution to solve this problem, preferably in Scala.
Thanks in advance!

This may not be an elegant solution but this worked for given data format.
sc.parallelize(List(
("user1", 0),
("user1", 3),
("user1", 15),
("user1", 22),
("user1", 28),
("user1", 41),
("user1", 45),
("user1", 85),
("user1", 90))).toDF("user_id", "timestamp").map { x =>
val userId = x.getAs[String]("user_id")
val timestamp = x.getAs[Int]("timestamp")
val session = timestamp / 20
(userId, timestamp, session)
}.toDF("user_id", "timestamp", "session").show()
Result
You can change timestamp / 20 according to your need.

Please see my code.
Two issues here:
1,I think the performance is not well.
2,I use the "userid" to join, if this doesn't meet your requirement. You add a new column with a same value to timeSetFrame and newSessionSec.
var newSession = ss.sparkContext.parallelize(List(
("user1", 0), ("user1", 3), ("user1", 15), ("user1", 22),
("user1", 28), ("user1", 41), ("user1", 45), ("user1", 85),
("user1", 90))).zipWithIndex().toDF("tmp", "index")
val getUser_id = udf( ( s : Row) => {
s.getString(0)
})
val gettimestamp = udf( ( s : Row) => {
s.getInt(1)
})
val newSessionSec = newSession.withColumn( "user_id", getUser_id($"tmp"))
.withColumn( "timestamp", gettimestamp($"tmp")).drop( "tmp") //.show()
val timeSet : Array[Int] = newSessionSec.select("timestamp").collect().map( s=>s.getInt(0))
val timeSetFrame = ss.sparkContext.parallelize( Seq(( "user1",timeSet))).toDF( "user_id", "tset")
val newSessionThird = newSessionSec.join( timeSetFrame, Seq("user_id"), "outer") // .show
val getSessionID = udf( ( ts: Int, aa: Seq[Int]) => {
var result = 0
var begin = 0
val loop = new Breaks
loop.breakable {
for (time <- aa) {
if (time > (begin + 20)) {
begin = time
result += 1
}
if (time == ts) {
loop.break;
}
}
}
result
})
newSessionThird.withColumn( "sessionID", getSessionID( $"timestamp", $"tset")).drop("tset", "index").show()

Related

Scala - Return the largest string within each group

DataSet:
+---+--------+
|age| name|
+---+--------+
| 33| Will|
| 26|Jean-Luc|
| 55| Hugh|
| 40| Deanna|
| 68| Quark|
| 59| Weyoun|
| 37| Gowron|
| 54| Will|
| 38| Jadzia|
| 27| Hugh|
+---+--------+
Here is my attempt but it just returns the size of the largest string rather than the largest string:
AgeName.groupBy("age")
.agg(max(length(AgeName("name")))).show()
The usual row_number trick should work if you specify the Window correctly. Using #LeoC's example,
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
val df2 = df.withColumn(
"rownum",
expr("row_number() over (partition by age order by length(name) desc)")
).filter("rownum = 1").drop("rownum")
df2.show
+---+---------+
|age| name|
+---+---------+
| 22|Alexander|
| 35| Michelle|
+---+---------+
Here's one approach using Spark higher-order function, aggregate, as shown below:
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
df.
groupBy("age").agg(collect_list("name").as("names")).
withColumn(
"longest_name",
expr("aggregate(names, '', (acc, x) -> case when length(acc) < length(x) then x else acc end)")
).
show(false)
// +---+----------------------------+------------+
// |age|names |longest_name|
// +---+----------------------------+------------+
// |22 |[Jennifer, Alexander, Celia]|Alexander |
// |35 |[John, Michelle] |Michelle |
// +---+----------------------------+------------+
Note that higher-order functions are available only on Spark 2.4+.
object BasicDatasetTest {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local[*]")
.appName("BasicDatasetTest")
.getOrCreate()
val pairs=List((33,"Will"),(26,"Jean-Luc"),
(55, "Hugh"),
(26, "Deanna"),
(26, "Quark"),
(55, "Weyoun"),
(33, "Gowron"),
(55, "Will"),
(26, "Jadzia"),
(27, "Hugh"))
val schema=new StructType(Array(
StructField("age",IntegerType,false),
StructField("name",StringType,false))
)
val dataRDD=spark.sparkContext.parallelize(pairs).map(record=>Row(record._1,record._2))
val dataset=spark.createDataFrame(dataRDD,schema)
val ageNameGroup=dataset.groupBy("age","name")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageNameGroup.printSchema()
val ageGroup=dataset.groupBy("age")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageGroup.printSchema()
ageGroup.createOrReplaceTempView("age_group")
ageNameGroup.createOrReplaceTempView("age_name_group")
spark.sql("select ag.age,ang.name from age_group as ag, age_name_group as ang " +
"where ag.age=ang.age and ag.length=ang.length")
.show()
}
}

How to sum(case when then) in SparkSQL DataFrame just like sql?

I'm new to SparkSQL, and I want to calculate the percentage in my data with every status.
Here is my data like below:
A B
11 1
11 3
12 1
13 3
12 2
13 1
11 1
12 2
So,I can do it in SQL like this:
select (C.oneTotal / C.total) as onePercentage,
(C.twoTotal / C.total) as twotPercentage,
(C.threeTotal / C.total) as threPercentage
from (select count(*) as total,
A,
sum(case when B = '1' then 1 else 0 end) as oneTotal,
sum(case when B = '2' then 1 else 0 end) as twoTotal,
sum(case when B = '3' then 1 else 0 end) as threeTotal
from test
group by A) as C;
But in the SparkSQL DataFrame, first I calculate totalCount in every status like below:
// wrong code
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1)).otherwise(0)).alias("oneTotal")
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal")
The problem is the sum(when)'s result is zero.
Do I have wrong use with it?
How to implement it in SparkSQL just like my above SQL? And then calculate the portion of every status?
Thank you for your help. In the end, I solve it with sum(when). below is my current code.
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1).otherwise(0)).alias("oneTotal"),
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal"))
.select(col("total"),
col("A"),
col("oneTotal") / col("total").alias("oneRate"),
col("twoTotal") / col("total").alias("twoRate"))
Thanks again.
you can use sum(when(... or also count(when.., the second option being shorter to write:
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df
.groupBy($"A")
.agg(
count("*").as("total"),
count(when($"B"==="1",$"A")).as("oneTotal"),
count(when($"B"==="2",$"A")).as("twoTotal"),
count(when($"B"==="3",$"A")).as("threeTotal")
)
.select(
$"A",
($"oneTotal"/$"total").as("onePercentage"),
($"twoTotal"/$"total").as("twoPercentage"),
($"threeTotal"/$"total").as("threePercentage")
)
.show()
gives
+---+------------------+------------------+------------------+
| A| onePercentage| twoPercentage| threePercentage|
+---+------------------+------------------+------------------+
| 12|0.3333333333333333|0.6666666666666666| 0.0|
| 13| 0.5| 0.0| 0.5|
| 11|0.6666666666666666| 0.0|0.3333333333333333|
+---+------------------+------------------+------------------+
alternatively, you could produce a "long" table with window-functions:
df
.groupBy($"A",$"B").count()
.withColumn("total",sum($"count").over(Window.partitionBy($"A")))
.select(
$"A",
$"B",
($"count"/$"total").as("percentage")
).orderBy($"A",$"B")
.show()
+---+---+------------------+
| A| B| percentage|
+---+---+------------------+
| 11| 1|0.6666666666666666|
| 11| 3|0.3333333333333333|
| 12| 1|0.3333333333333333|
| 12| 2|0.6666666666666666|
| 13| 1| 0.5|
| 13| 3| 0.5|
+---+---+------------------+
As far as I understood you want to implement the logic like above sql showed in the question.
one way is like below example
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object AggTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df.show(false)
df.createOrReplaceTempView("test")
spark.sql(
"""
|select (C.oneTotal / C.total) as onePercentage,
| (C.twoTotal / C.total) as twotPercentage,
| (C.threeTotal / C.total) as threPercentage
|from (select count(*) as total,
| A,
| sum(case when B = '1' then 1 else 0 end) as oneTotal,
| sum(case when B = '2' then 1 else 0 end) as twoTotal,
| sum(case when B = '3' then 1 else 0 end) as threeTotal
| from test
| group by A) as C
""".stripMargin).show
}
Result :
+---+---+
|A |B |
+---+---+
|11 |1 |
|11 |3 |
|12 |1 |
|13 |3 |
|12 |2 |
|13 |1 |
|11 |1 |
|12 |2 |
+---+---+
+------------------+------------------+------------------+
| onePercentage| twotPercentage| threPercentage|
+------------------+------------------+------------------+
|0.3333333333333333|0.6666666666666666| 0.0|
| 0.5| 0.0| 0.5|
|0.6666666666666666| 0.0|0.3333333333333333|
+------------------+------------------+------------------+

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

Add new record before another in Spark

I have a Dataframe:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:02 2
I want to add a new record with Spark-Scala before with the same time minus 1 second when the value is 2.
The output would be:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:01 2
1 17:04:02 2
Thanks
You need a .flatMap()
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
val data = (spark.createDataset(Seq(
(1, "15:00:01", 3),
(1, "17:04:02", 2)
)).toDF("ID", "TIMESTAMP_STR", "VALUE")
.withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
.drop("TIMESTAMP_STR")
.select("ID", "TIMESTAMP", "VALUE")
)
data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
if(r._3 == 2) {
Seq(
(r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
(r._1, r._2, r._3)
)
} else {
Some(r._1, r._2, r._3)
}
}).toDF("ID", "TIMESTAMP", "VALUE").show()
Which results in:
+---+-------------------+-----+
| ID| TIMESTAMP|VALUE|
+---+-------------------+-----+
| 1|2019-03-04 15:00:01| 3|
| 1|2019-03-04 17:04:01| 2|
| 1|2019-03-04 17:04:02| 2|
+---+-------------------+-----+
You can introduce a new column array - when value =2 then Array(-1,0) else Array(0), then explode that column and add it with the timestamp as seconds. The below one should work for you. Check this out:
scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]
scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]
scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]
scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp |value|newc |
+---+-------------------+-----+-------+
|1 |2019-03-04 15:00:01|3 |[0] |
|1 |2019-03-04 17:04:02|2 |[-1, 0]|
+---+-------------------+-----+-------+
scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]
scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2 |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:01|2 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala>
If you want the time part alone, then you can do like
scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1 |15:00:01 |3 |
|1 |17:04:01 |2 |
|1 |17:04:02 |2 |
+---+---------+-----+

Comparing Previous data with Current data in spark scala

I want to compare Prev.data with Current data by monthly . I am having data like below.
Data-set 1 : (Prev) Data-set 2 : (Latest)
Year-month Sum-count Year-Month Sum-count
-- -- 201808 48
201807 30 201807 22
201806 20 201806 20
201805 35 201805 20
201804 12 201804 9
201803 15 -- --
I have data sets like as shown above. I want to compare the both data sets based on year-month column and sum-count and need to find out difference in percentage.
I am using spark 2.3.0 and Scala 2.11.
Here is mode :
import org.apache.spark.sql.functions.lag
val mdf = spark.read.format("csv").
option("InferSchema","true").
option("header","true").
option("delimiter",",").
option("charset","utf-8").
load("c:\\test.csv")
mdf.createOrReplaceTempView("test")
val res= spark.sql("select year-month,SUM(Sum-count) as SUM_AMT from test d group by year-month")
val win = org.apache.spark.sql.expressions.Window.orderBy("data_ym")
val res1 = res.withColumn("Prev_month", lag("SUM_AMT", 1,0).over(win)).withColumn("percentage",col("Prev_month") / sum("SUM_AMT").over()).show()
I need output like this :
if percentage is more than 10% then i need to set flag as F.
set1 cnt set2 cnt output(Percentage) Flag
201807 30 201807 22 7% T
201806 20 201806 20 0% T
201805 35 201805 20 57% F
Please help me on this.
Can be done in such way:
val data1 = List(
("201807", 30),
("201806", 20),
("201805", 35),
("201804", 12),
("201803", 15)
)
val data2 = List(
("201808", 48),
("201807", 22),
("201806", 20),
("201805", 20),
("201804", 9)
)
val df1 = data1.toDF("Year-month", "Sum-count")
val df2 = data2.toDF("Year-month", "Sum-count")
val joined = df1.alias("df1").join(df2.alias("df2"), "Year-month")
joined
.withColumn("output(Percentage)", abs($"df1.Sum-count" - $"df2.Sum-count").divide($"df1.Sum-count"))
.withColumn("Flag", when($"output(Percentage)" > 0.1, "F").otherwise("T"))
.show(false)
Output:
+----------+---------+---------+-------------------+----+
|Year-month|Sum-count|Sum-count|output(Percentage) |Flag|
+----------+---------+---------+-------------------+----+
|201807 |30 |22 |0.26666666666666666|F |
|201806 |20 |20 |0.0 |T |
|201805 |35 |20 |0.42857142857142855|F |
|201804 |12 |9 |0.25 |F |
+----------+---------+---------+-------------------+----+
Here's my solution :
val values1 = List(List("1201807", "30")
,List("1201806", "20") ,
List("1201805", "35"),
List("1201804","12"),
List("1201803","15")
).map(x =>(x(0), x(1)))
val values2 = List(List("201808", "48")
,List("1201807", "22") ,
List("1201806", "20"),
List("1201805","20"),
List("1201804","9")
).map(x =>(x(0), x(1)))
import spark.implicits._
import org.apache.spark.sql.functions
val df1 = values1.toDF
val df2 = values2.toDF
df1.join(df2, Seq("_1"), "full").toDF("set", "cnt1", "cnt2")
.withColumn("percentage1", col("cnt1")/sum("cnt1").over() * 100)
.withColumn("percentage2", col("cnt2")/sum("cnt2").over() * 100)
.withColumn("percentage", abs(col("percentage2") - col("percentage1")))
.withColumn("flag", when(col("percentage") > 10, "F").otherwise("T")).na.drop().show()
Here's the result :
+-------+----+----+------------------+------------------+------------------+----+
| set|cnt1|cnt2| percentage1| percentage2| percentage|flag|
+-------+----+----+------------------+------------------+------------------+----+
|1201804| 12| 9|10.714285714285714| 7.563025210084033| 3.15126050420168| T|
|1201807| 30| 22|26.785714285714285|18.487394957983195| 8.29831932773109| T|
|1201806| 20| 20|17.857142857142858| 16.80672268907563|1.0504201680672267| T|
|1201805| 35| 20| 31.25| 16.80672268907563|14.443277310924369| F|
+-------+----+----+------------------+------------------+------------------+----+
I hope it helps :)