Spark DataFrames Scala - jump to next group during a loop - scala

I have a dataframe as below and it records the quarter and date during which same incident occurs to which IDs.
I would like to mark an ID and date if the incident happen at two consecutive quarters. And this is how I did it.
val arrArray = dtf.collect.map(x => (x(0).toString, x(1).toString, x(2).toString.toInt))
if (arrArray.length > 0) {
var bufBAQDate = ArrayBuffer[Any]()
for (a <- 1 to arrArray.length - 1) {
val (strBAQ1, strDate1, douTime1) = arrArray(a - 1)
val (strBAQ2, strDate2, douTime2) = arrArray(a)
if (douTime2 - douTime1 == 15 && strBAQ1 == strBAQ2 && strDate1 == strDate2) {
bufBAQDate = (strBAQ2, strDate2) +: bufBAQDate
//println(strBAQ2+" "+strDate2+" "+douTime2)
}
}
val vecBAQDate = bufBAQDate.distinct.toVector.reverse
Is there a better way of doing it? As the same insident can happen many times to one ID during a single day, it is better to jump to the next ID and/or date once an ID and a date is marked. I dont want to create nested loops to filter dataframe.

Note that you current solution misses 20210103 as 1400 - 1345 = 55
I think this does the trick
val windowSpec = Window.partitionBy("ID")
.orderBy("datetime_seconds")
val consecutiveIncidents = dtf.withColumn("raw_datetime", concat($"Date", $"Time"))
.withColumn("datetime", to_timestamp($"raw_datetime", "yyyyMMddHHmm"))
.withColumn("datetime_seconds", $"datetime".cast(LongType))
.withColumn("time_delta", ($"datetime_seconds" - lag($"datetime_seconds", 1).over(windowSpec)) / 60)
.filter($"time_delta" === lit(15))
.select("ID", "Date")
.distinct()
.collect()
.map { case Row(id, date) => (id, date) }
.toList
Basically - convert the datetimes to timestamps, then look for records with the same ID and consecutive times, with their times separated by 15 minutes.
This is done by using lag over a window grouped by ID and ordered by the time.
In order to calculate the time difference The timestamp is converted to unix epoch seconds.
If you don't want to count day-crossing incidients, you can add the date to the groupyBy clause of the window

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!
I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

How to calculate rolling statistics in Scala

I have knowledge with respect to handling Dataframes in Python but I am facing the below problem while writing in scala.
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double)
I created a list like this:
val transactions: List[Transaction] = transactionslines.map {
line => val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3),split(4).toDouble)
}.toList
Contents of the list:
Transaction(T000942,A38,28,EE,694.54)
Transaction(T000943,A35,28,CC,828.57)
Transaction(T000944,A26,28,DD,290.18)
Transaction(T000945,A17,28,CC,627.62)
Transaction(T000946,A42,28,FF,616.73)
Transaction(T000947,A20,28,FF,86.84)
Transaction(T000948,A14,28,BB,620.32)
Transaction(T000949,A14,28,AA,260.02)
Transaction(T000950,A32,28,AA,600.34)
Can anyone help me on how to calculate statistics for each account number for the previous five days of transactions, not including transactions from the day statistics are being calculated for. For example, on day 10 you should consider only the transactions from days 5 to 9 (this is called a rolling time window of five days). The statistics I need to calculate are:
•The total transaction value of transactions type “AA” in the previous 5 days per account
•The average transaction value of the previous 5 days of transactions per account
The output ideally should contain one line per day per account id and each line should contain each of the calculated statistics:
My code for the first 5 days looks like:
val a = transactions.
filter(trans => trans.transactionDay <= 5).
groupBy(_.accountId).
mapValues(trans => (trans.map(amount =>
amount.transactionAmount).sum/trans.map(amount =>
amount.transactionAmount).size,trans.filter(trans =>
trans.category == "AA").map(amount =>
amount.transactionAmount).sum)
a.foreach{println}
I would like to know if there is any elegant way to calculate those statistics. Note that transcation day range from [1..29] so ideally I would like a code that calculate those rolling statistics up to 29th day and not only for the first 5 days.
Thanks a lot!!
Perhaps not 'elegant' but the following will provide the two desired statistics by accountID for a particular day:
def calcStats(ts: List[Transaction], day: Int): Map[String, (Double, Double)] = {
// all transactions for date range, grouped by accountID
val atsById = ts
.filter(trans => trans.transactionDay >= day - 5 && trans.transactionDay < day)
.groupBy(_.accountId)
// "AA" transactions summmed by account id
val aaSums = atsById.mapValues(_.filter(_.category == "AA"))
.mapValues(_.map(_.transactionAmount).sum)
// sum of all transactions by account id
val allSums = atsById.mapValues(_.map(_.transactionAmount).sum)
// count of all transactions by account id
val allCounts = atsById.mapValues(_.length)
// output is Map(account id -> (aaSum, allAve))
allSums.map { case (id, sum) =>
id -> (aaSums.getOrElse(id, 0.0), sum / allCounts(id)) }
}
examples (using your provided data):
scala> calcStats(transactions, 30)
res13: Map[String,(Double, Double)] = Map(A32 -> (600.34,600.34), A26 ->
(0.0,290.18), A38 -> (0.0,694.54), A20 -> (0.0,86.84), A17 -> (0.0,627.62),
A42 -> (0.0,616.73), A35 -> (0.0,828.57), A14 -> (260.02,440.17))
scala> calcStats(transactions, 1)
res14: Map[String,(Double, Double)] = Map()
object ScalaStatistcs extends App {
import scala.io.Source
// Define a case class Transaction which represents a transaction
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double
)
// The full path to the file to import
val fileName = "C:\\Users\\Downloads\\transactions.txt"
// The lines of the CSV file (dropping the first to remove the header)
val transactionslines = Source.fromFile(fileName).getLines().drop(1)
// Here we split each line up by commas and construct Transactions
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble)
}.toList
println(transactions)
/* Data
* transactionId accountId transactionDay category transactionAmount
T0001 A27 1 GG 338.11
T0002 A5 1 BB 677.89
T0003 A32 1 DD 499.86
T0004 A42 1 DD 801.81
T0005 A19 1 BB 14.42
T0006 A46 1 FF 476.88
T0007 A29 1 FF 439.49
T0008 A49 1 DD 848.9
T0009 A47 1 BB 400.42
T00010 A23 1 BB 416.36
T00011 A45 1 GG 662.98
T00012 A2 1 DD 775.37
T00013 A33 1 BB 185.4
T00014 A44 1 CC 397.19
T00015 A43 1 CC 825.05
T00016 A16 1 BB 786.14
T00017 A33 1 DD 352.64
T00018 A14 1 DD 733.77
T00019 A40 1 FF 435.5
T00020 A32 1 EE 284.34
T00021 A25 1 AA 291.76
*
*/
/* Question :1
* Calculate the total transaction value for all transactions for each day.
The output should contain one line for each day and each line should include the day and the total value
*/
val question1 = transactions
.groupBy(_.transactionDay)
.mapValues(_.map(_.transactionAmount).sum)
println(question1)
/* Question :2
* Calculate the average value of transactions per account for each type of transaction (there are seven in total).
The output should contain one line per account, each line should include the account id and the average value
for each transaction type (ie 7 fields containing the average values).
*/
val question2 = transactions
.groupBy(trans => (trans.accountId, trans.category))
.mapValues(trans =>
(trans.map(amount => amount.transactionAmount).sum / trans
.map(amount => amount.transactionAmount)
.size)
)
println(question2)
/* Question :3
* For each day, calculate statistics for each account number for the previous five days of transactions,
* not including transactions from the day statistics are being calculated for. For example, on day 10 you
* should consider only the transactions from days 5 to 9 (this is called a rolling time window of five days).
* The statistics we require to be calculated are:
 The maximum transaction value in the previous 5 days of transactions per account
 The average transaction value of the previous 5 days of transactions per account
 The total transaction value of transactions types “AA”, “CC” and “FF” in the previous 5 days per account
*/
var day = 0
def day_factory(day: Int) =
for (i <- 1 until day)
yield i
day_factory(30).foreach {
case (i) => {
day = i
val question3_initial = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans => trans.map(day => day -> day.transactionAmount))
.mapValues(trans =>
(
trans.map(t => day).max,
trans.map(t => t._2).max,
trans.map(t => t._2).sum / trans.map(t => t._2).size
)
)
val question3_AA = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "AA")
.map(amount => amount.transactionAmount)
.sum
)
val question3_CC = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "CC")
.map(amount => amount.transactionAmount)
.sum
)
val question3_FF = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "FF")
.map(amount => amount.transactionAmount)
.sum
)
val question3_final = question3_initial.map { case (k, v) =>
(k, v._1) -> (v._2, v._3, question3_AA(k), question3_CC(
k
), question3_FF(k))
}
println(question3_final)
}
}
}

Watermarking for Spark structured streaming with three way joins

I have 3 streams of data: foo, bar and baz.
There's a necessity to join these streams with LEFT OUTER JOIN in a following chain: foo -> bar -> baz.
Here's an attempt to mimic these streams with built-in rate stream:
val rateStream = session.readStream
.format("rate")
.option("rowsPerSecond", 5)
.option("numPartitions", 1)
.load()
val fooStream = rateStream
.select(col("value").as("fooId"), col("timestamp").as("fooTime"))
val barStream = rateStream
.where(rand() < 0.5) // Introduce misses for ease of debugging
.select(col("value").as("barId"), col("timestamp").as("barTime"))
val bazStream = rateStream
.where(rand() < 0.5) // Introduce misses for ease of debugging
.select(col("value").as("bazId"), col("timestamp").as("bazTime"))
That's the first approach to join all together these streams, with assumption that potential delays for foo, bar and baz are small (~ 5 seconds):
val foobarStream = fooStream
.withWatermark("fooTime", "5 seconds")
.join(
barStream.withWatermark("barTime", "5 seconds"),
expr("""
barId = fooId AND
fooTime >= barTime AND
fooTime <= barTime + interval 5 seconds
"""),
joinType = "leftOuter"
)
val foobarbazQuery = foobarStream
.join(
bazStream.withWatermark("bazTime", "5 seconds"),
expr("""
bazId = fooId AND
bazTime >= fooTime AND
bazTime <= fooTime + interval 5 seconds
"""),
joinType = "leftOuter")
.writeStream
.format("console")
.start()
With setup from above, I'm able to observe following tuples of data:
(some_foo, some_bar, some_baz)
(some_foo, some_bar, null)
but still missing (some_foo, null, some_baz) and (some_foo, null, null).
Any ideas, how to properly configure watermarks in order to get all combinations?
UPDATE:
After adding additional watermark for foobarStream surprisingly on barTime:
val foobarbazQuery = foobarStream
.withWatermark("barTime", "1 minute")
.join(/* ... */)`
I'm able to get this (some_foo, null, some_baz) combination, but still missing (some_foo, null, null)...
I'm leaving some information just for reference.
Chaining stream-stream joins doesn't work correctly because Spark only supports global watermark (instead of operator-wise watermark) which may lead to drop intermediate outputs between joins.
Apache Spark community indicated this issue and discussed while ago. Here's a link for more details:
https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a#%3Cdev.spark.apache.org%3E
(Disclaimer: I'm the author initiated the mail thread.)

How to filter rows before and after a certain period (date)?

My objective is to select dates before/after a certain period. I have a start period and an end period. I want to filter rows where close_time is included between two periods (and some other filters, like category and origin): start period <= close_time >= end period.
I have tried using:
var StartTime == '2017-03-14'
var EndTime == '2017-03-14'
val df1 = df.withColumn(
"X_Field",
when($"category" === "incident" and $"origin" === "phone" and StartTime <== $"close_time" >== EndTime, 1).otherwise(0)
)
I have errors. What is the right syntax to do this ? Thx !
First - unlike with equality, the right operators to use for greater-or-equal and little-or-equal are <= and >= and not <== and >==.
Second, the expression StartTime <= $"close_time" >= EndTime is not valid - the first part (StartTime <= $"close_time") evaluates into a Boolean condition, which you then try to compare to another String (>= EndTime).
Instead, you can use between:
val df1 = df.withColumn("X_Field", when(
$"category" === "incident" and
$"origin" === "phone" and
($"close_time" between (StartTime, EndTime)), 1).otherwise(0)
)
Which is simply shorthand for:
val df1 = df.withColumn("X_Field", when(
$"category" === "incident" and
$"origin" === "phone" and
($"close_time" >= StartTime and $"close_time" <= EndTime), 1).otherwise(0)
)