How to calculate rolling statistics in Scala - scala

I have knowledge with respect to handling Dataframes in Python but I am facing the below problem while writing in scala.
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double)
I created a list like this:
val transactions: List[Transaction] = transactionslines.map {
line => val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3),split(4).toDouble)
}.toList
Contents of the list:
Transaction(T000942,A38,28,EE,694.54)
Transaction(T000943,A35,28,CC,828.57)
Transaction(T000944,A26,28,DD,290.18)
Transaction(T000945,A17,28,CC,627.62)
Transaction(T000946,A42,28,FF,616.73)
Transaction(T000947,A20,28,FF,86.84)
Transaction(T000948,A14,28,BB,620.32)
Transaction(T000949,A14,28,AA,260.02)
Transaction(T000950,A32,28,AA,600.34)
Can anyone help me on how to calculate statistics for each account number for the previous five days of transactions, not including transactions from the day statistics are being calculated for. For example, on day 10 you should consider only the transactions from days 5 to 9 (this is called a rolling time window of five days). The statistics I need to calculate are:
•The total transaction value of transactions type “AA” in the previous 5 days per account
•The average transaction value of the previous 5 days of transactions per account
The output ideally should contain one line per day per account id and each line should contain each of the calculated statistics:
My code for the first 5 days looks like:
val a = transactions.
filter(trans => trans.transactionDay <= 5).
groupBy(_.accountId).
mapValues(trans => (trans.map(amount =>
amount.transactionAmount).sum/trans.map(amount =>
amount.transactionAmount).size,trans.filter(trans =>
trans.category == "AA").map(amount =>
amount.transactionAmount).sum)
a.foreach{println}
I would like to know if there is any elegant way to calculate those statistics. Note that transcation day range from [1..29] so ideally I would like a code that calculate those rolling statistics up to 29th day and not only for the first 5 days.
Thanks a lot!!

Perhaps not 'elegant' but the following will provide the two desired statistics by accountID for a particular day:
def calcStats(ts: List[Transaction], day: Int): Map[String, (Double, Double)] = {
// all transactions for date range, grouped by accountID
val atsById = ts
.filter(trans => trans.transactionDay >= day - 5 && trans.transactionDay < day)
.groupBy(_.accountId)
// "AA" transactions summmed by account id
val aaSums = atsById.mapValues(_.filter(_.category == "AA"))
.mapValues(_.map(_.transactionAmount).sum)
// sum of all transactions by account id
val allSums = atsById.mapValues(_.map(_.transactionAmount).sum)
// count of all transactions by account id
val allCounts = atsById.mapValues(_.length)
// output is Map(account id -> (aaSum, allAve))
allSums.map { case (id, sum) =>
id -> (aaSums.getOrElse(id, 0.0), sum / allCounts(id)) }
}
examples (using your provided data):
scala> calcStats(transactions, 30)
res13: Map[String,(Double, Double)] = Map(A32 -> (600.34,600.34), A26 ->
(0.0,290.18), A38 -> (0.0,694.54), A20 -> (0.0,86.84), A17 -> (0.0,627.62),
A42 -> (0.0,616.73), A35 -> (0.0,828.57), A14 -> (260.02,440.17))
scala> calcStats(transactions, 1)
res14: Map[String,(Double, Double)] = Map()

object ScalaStatistcs extends App {
import scala.io.Source
// Define a case class Transaction which represents a transaction
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double
)
// The full path to the file to import
val fileName = "C:\\Users\\Downloads\\transactions.txt"
// The lines of the CSV file (dropping the first to remove the header)
val transactionslines = Source.fromFile(fileName).getLines().drop(1)
// Here we split each line up by commas and construct Transactions
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble)
}.toList
println(transactions)
/* Data
* transactionId accountId transactionDay category transactionAmount
T0001 A27 1 GG 338.11
T0002 A5 1 BB 677.89
T0003 A32 1 DD 499.86
T0004 A42 1 DD 801.81
T0005 A19 1 BB 14.42
T0006 A46 1 FF 476.88
T0007 A29 1 FF 439.49
T0008 A49 1 DD 848.9
T0009 A47 1 BB 400.42
T00010 A23 1 BB 416.36
T00011 A45 1 GG 662.98
T00012 A2 1 DD 775.37
T00013 A33 1 BB 185.4
T00014 A44 1 CC 397.19
T00015 A43 1 CC 825.05
T00016 A16 1 BB 786.14
T00017 A33 1 DD 352.64
T00018 A14 1 DD 733.77
T00019 A40 1 FF 435.5
T00020 A32 1 EE 284.34
T00021 A25 1 AA 291.76
*
*/
/* Question :1
* Calculate the total transaction value for all transactions for each day.
The output should contain one line for each day and each line should include the day and the total value
*/
val question1 = transactions
.groupBy(_.transactionDay)
.mapValues(_.map(_.transactionAmount).sum)
println(question1)
/* Question :2
* Calculate the average value of transactions per account for each type of transaction (there are seven in total).
The output should contain one line per account, each line should include the account id and the average value
for each transaction type (ie 7 fields containing the average values).
*/
val question2 = transactions
.groupBy(trans => (trans.accountId, trans.category))
.mapValues(trans =>
(trans.map(amount => amount.transactionAmount).sum / trans
.map(amount => amount.transactionAmount)
.size)
)
println(question2)
/* Question :3
* For each day, calculate statistics for each account number for the previous five days of transactions,
* not including transactions from the day statistics are being calculated for. For example, on day 10 you
* should consider only the transactions from days 5 to 9 (this is called a rolling time window of five days).
* The statistics we require to be calculated are:
 The maximum transaction value in the previous 5 days of transactions per account
 The average transaction value of the previous 5 days of transactions per account
 The total transaction value of transactions types “AA”, “CC” and “FF” in the previous 5 days per account
*/
var day = 0
def day_factory(day: Int) =
for (i <- 1 until day)
yield i
day_factory(30).foreach {
case (i) => {
day = i
val question3_initial = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans => trans.map(day => day -> day.transactionAmount))
.mapValues(trans =>
(
trans.map(t => day).max,
trans.map(t => t._2).max,
trans.map(t => t._2).sum / trans.map(t => t._2).size
)
)
val question3_AA = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "AA")
.map(amount => amount.transactionAmount)
.sum
)
val question3_CC = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "CC")
.map(amount => amount.transactionAmount)
.sum
)
val question3_FF = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "FF")
.map(amount => amount.transactionAmount)
.sum
)
val question3_final = question3_initial.map { case (k, v) =>
(k, v._1) -> (v._2, v._3, question3_AA(k), question3_CC(
k
), question3_FF(k))
}
println(question3_final)
}
}
}

Related

Entity Framework equivalent of datepart(qq, getdate)

I have the following SQL query which I want to translate to EF query
Select Year(_r.IssueDate) as Year, datepart(qq, _r.IssueDate) as Quarter,count(*) as TotalBillableRecords, SUM(_r.BillableHours) as TotalHOurs
FROM
(select _b.Name, _i.IssueDate, _ie.BillableHours from Beneficiaries _b
join [Beneficiaries.Invoices] _i on _b.Id = _i.BeneficiaryId
join [Beneficiaries.Invoices.Entries] _ie on _ie.InvoiceId = _i.Id
where BeneficiaryId = 1) as _r
group by Year(_r.IssueDate), datepart(qq, _r.IssueDate)
This results in this
And this is what I would also want to have in my linq expression
How can I translate this to EF query ? I've tried like so
var query =
from beneficiary in _dbContext.Beneficiaries
where beneficiary.Id == beneficiaryId
from invoice in beneficiary.Invoices
//where invoice.IssueDate >= since
//where invoice.IssueDate.Month > notBefore && invoice.IssueDate.Month <= notAfter
from invoiceEntry in invoice.InvoiceEntries
group new
{
beneficiary,
invoiceEntry,
}
by new
{
Year = beneficiary.InvoiceMeta.IssueDate.Year,
Quarter = (beneficiary.InvoiceMeta.IssueDate.Month - 1) / 3 + 1,
}
into #group
select new
{
Year = #group.Key.Year,
Quarter = #group.Key.Quarter,
Hours = #group.Sum(x => x.invoiceEntry.BillableHours),
};
var y = query.ToList();
But the result is as following
[0] = Quarter = 3, Year = 2022, Hours = 6729.0
And this is it, only one entry.
What I'm noticing is that it only takes the last quarter composed of 3 months, (because the last would be the one which is not finished yet)
The class hierarchy is
Beneficiary 1-* Invoice 1-* Entries

Spark DataFrames Scala - jump to next group during a loop

I have a dataframe as below and it records the quarter and date during which same incident occurs to which IDs.
I would like to mark an ID and date if the incident happen at two consecutive quarters. And this is how I did it.
val arrArray = dtf.collect.map(x => (x(0).toString, x(1).toString, x(2).toString.toInt))
if (arrArray.length > 0) {
var bufBAQDate = ArrayBuffer[Any]()
for (a <- 1 to arrArray.length - 1) {
val (strBAQ1, strDate1, douTime1) = arrArray(a - 1)
val (strBAQ2, strDate2, douTime2) = arrArray(a)
if (douTime2 - douTime1 == 15 && strBAQ1 == strBAQ2 && strDate1 == strDate2) {
bufBAQDate = (strBAQ2, strDate2) +: bufBAQDate
//println(strBAQ2+" "+strDate2+" "+douTime2)
}
}
val vecBAQDate = bufBAQDate.distinct.toVector.reverse
Is there a better way of doing it? As the same insident can happen many times to one ID during a single day, it is better to jump to the next ID and/or date once an ID and a date is marked. I dont want to create nested loops to filter dataframe.
Note that you current solution misses 20210103 as 1400 - 1345 = 55
I think this does the trick
val windowSpec = Window.partitionBy("ID")
.orderBy("datetime_seconds")
val consecutiveIncidents = dtf.withColumn("raw_datetime", concat($"Date", $"Time"))
.withColumn("datetime", to_timestamp($"raw_datetime", "yyyyMMddHHmm"))
.withColumn("datetime_seconds", $"datetime".cast(LongType))
.withColumn("time_delta", ($"datetime_seconds" - lag($"datetime_seconds", 1).over(windowSpec)) / 60)
.filter($"time_delta" === lit(15))
.select("ID", "Date")
.distinct()
.collect()
.map { case Row(id, date) => (id, date) }
.toList
Basically - convert the datetimes to timestamps, then look for records with the same ID and consecutive times, with their times separated by 15 minutes.
This is done by using lag over a window grouped by ID and ordered by the time.
In order to calculate the time difference The timestamp is converted to unix epoch seconds.
If you don't want to count day-crossing incidients, you can add the date to the groupyBy clause of the window

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!
I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

LINQ Query Order by Month name non-alphabetically

How do I order this query by month not alphabetically? I have two columns in my db. One for the month name and one for month number.
Right now it orders months alphabetically.
var querythpmsick = (from r in db.SickLeaveRequestForms
where r.EmployeeID == id
group r by r.MonthOfHoliday into g
select new { Value = g.Key, Count1 = g.Sum(h => h.SickLeaveTaken) }
).OrderBy(e => e.Value);
If you have both the month name and the month number in your DB, then just group by the number instead of the month name:
var querythpmsick = (from r in db.SickLeaveRequestForms where r.EmployeeID == 2
group r by r.MonthOfHolidayInt into g
select new { Value = g.Key, Count1 = g.Sum(h => h.SickLeaveTaken), MonthName = g.Select(d => d.MonthOfHoliday) })
.OrderBy(e => e.Value);
Here is a DotNetFiddle with the example: https://dotnetfiddle.net/6vE4lo

Spark DataFrame - How to partition the data based on condition

Have some employee data set. in that i need to partition based employee salary based on some condition. Created DataFrame and converted to Custom DataFrame Object. Created Custom Partition for salary.
class SalaryPartition(override val numPartitions: Int) extends Partitioner {
override def getPartition(key: Any): Int =
{
import com.csc.emp.spark.tutorial.PartitonObj._
key.asInstanceOf[Emp].EMPLOYEE_ID match {
case salary if salary < 10000 => 1
case salary if salary >= 10001 && salary < 20000 => 2
case _ => 3
}
}
}
Question how can i invoke\call my custome partition. Couldn't find partitionBy in dataframe. Have any alternative way?
Just code for my comment:
val empDS = List(Emp(5, 1000), Emp(4, 15000), Emp(3, 30000), Emp(2, 2000)).toDS()
println(s"Original partitions number: ${empDS.rdd.partitions.size}")
println("-- Original partition: data --")
empDS.rdd.mapPartitionsWithIndex((index, it) => {
it.foreach(r => println(s"Partition $index: $r")); it
}).count()
val getSalaryGrade = (salary: Int) => salary match {
case salary if salary < 10000 => 1
case salary if salary >= 10001 && salary < 20000 => 2
case _ => 3
}
val getSalaryGradeUDF = udf(getSalaryGrade)
val salaryGraded = empDS.withColumn("salaryGrade", getSalaryGradeUDF($"salary"))
val repartitioned = salaryGraded.repartition($"salaryGrade")
println
println(s"Partitions number after: ${repartitioned.rdd.partitions.size}")
println("-- Reparitioned partition: data --")
repartitioned.as[Emp].rdd.mapPartitionsWithIndex((index, it) => {
it.foreach(r => println(s"Partition $index: $r")); it
}).count()
Output is:
Original partitions number: 2
-- Original partition: data --
Partition 1: Emp(3,30000)
Partition 0: Emp(5,1000)
Partition 1: Emp(2,2000)
Partition 0: Emp(4,15000)
Partitions number after: 5
-- Reparitioned partition: data --
Partition 1: Emp(3,30000)
Partition 3: Emp(5,1000)
Partition 3: Emp(2,2000)
Partition 4: Emp(4,15000)
Note: guess, several partitions possible with the same "salaryGrade".
Advice: "groupBy" or similar looks like more reliable solution.
For stay with Dataset entities, "groupByKey" can be used:
empDS.groupByKey(x => getSalaryGrade(x.salary)).mapGroups((index, it) => {
it.foreach(r => println(s"Group $index: $r")); index
}).count()
Output:
Group 1: Emp(5,1000)
Group 3: Emp(3,30000)
Group 1: Emp(2,2000)
Group 2: Emp(4,15000)