Spark DataFrame - How to partition the data based on condition - scala

Have some employee data set. in that i need to partition based employee salary based on some condition. Created DataFrame and converted to Custom DataFrame Object. Created Custom Partition for salary.
class SalaryPartition(override val numPartitions: Int) extends Partitioner {
override def getPartition(key: Any): Int =
{
import com.csc.emp.spark.tutorial.PartitonObj._
key.asInstanceOf[Emp].EMPLOYEE_ID match {
case salary if salary < 10000 => 1
case salary if salary >= 10001 && salary < 20000 => 2
case _ => 3
}
}
}
Question how can i invoke\call my custome partition. Couldn't find partitionBy in dataframe. Have any alternative way?

Just code for my comment:
val empDS = List(Emp(5, 1000), Emp(4, 15000), Emp(3, 30000), Emp(2, 2000)).toDS()
println(s"Original partitions number: ${empDS.rdd.partitions.size}")
println("-- Original partition: data --")
empDS.rdd.mapPartitionsWithIndex((index, it) => {
it.foreach(r => println(s"Partition $index: $r")); it
}).count()
val getSalaryGrade = (salary: Int) => salary match {
case salary if salary < 10000 => 1
case salary if salary >= 10001 && salary < 20000 => 2
case _ => 3
}
val getSalaryGradeUDF = udf(getSalaryGrade)
val salaryGraded = empDS.withColumn("salaryGrade", getSalaryGradeUDF($"salary"))
val repartitioned = salaryGraded.repartition($"salaryGrade")
println
println(s"Partitions number after: ${repartitioned.rdd.partitions.size}")
println("-- Reparitioned partition: data --")
repartitioned.as[Emp].rdd.mapPartitionsWithIndex((index, it) => {
it.foreach(r => println(s"Partition $index: $r")); it
}).count()
Output is:
Original partitions number: 2
-- Original partition: data --
Partition 1: Emp(3,30000)
Partition 0: Emp(5,1000)
Partition 1: Emp(2,2000)
Partition 0: Emp(4,15000)
Partitions number after: 5
-- Reparitioned partition: data --
Partition 1: Emp(3,30000)
Partition 3: Emp(5,1000)
Partition 3: Emp(2,2000)
Partition 4: Emp(4,15000)
Note: guess, several partitions possible with the same "salaryGrade".
Advice: "groupBy" or similar looks like more reliable solution.
For stay with Dataset entities, "groupByKey" can be used:
empDS.groupByKey(x => getSalaryGrade(x.salary)).mapGroups((index, it) => {
it.foreach(r => println(s"Group $index: $r")); index
}).count()
Output:
Group 1: Emp(5,1000)
Group 3: Emp(3,30000)
Group 1: Emp(2,2000)
Group 2: Emp(4,15000)

Related

Spark DataFrames Scala - jump to next group during a loop

I have a dataframe as below and it records the quarter and date during which same incident occurs to which IDs.
I would like to mark an ID and date if the incident happen at two consecutive quarters. And this is how I did it.
val arrArray = dtf.collect.map(x => (x(0).toString, x(1).toString, x(2).toString.toInt))
if (arrArray.length > 0) {
var bufBAQDate = ArrayBuffer[Any]()
for (a <- 1 to arrArray.length - 1) {
val (strBAQ1, strDate1, douTime1) = arrArray(a - 1)
val (strBAQ2, strDate2, douTime2) = arrArray(a)
if (douTime2 - douTime1 == 15 && strBAQ1 == strBAQ2 && strDate1 == strDate2) {
bufBAQDate = (strBAQ2, strDate2) +: bufBAQDate
//println(strBAQ2+" "+strDate2+" "+douTime2)
}
}
val vecBAQDate = bufBAQDate.distinct.toVector.reverse
Is there a better way of doing it? As the same insident can happen many times to one ID during a single day, it is better to jump to the next ID and/or date once an ID and a date is marked. I dont want to create nested loops to filter dataframe.
Note that you current solution misses 20210103 as 1400 - 1345 = 55
I think this does the trick
val windowSpec = Window.partitionBy("ID")
.orderBy("datetime_seconds")
val consecutiveIncidents = dtf.withColumn("raw_datetime", concat($"Date", $"Time"))
.withColumn("datetime", to_timestamp($"raw_datetime", "yyyyMMddHHmm"))
.withColumn("datetime_seconds", $"datetime".cast(LongType))
.withColumn("time_delta", ($"datetime_seconds" - lag($"datetime_seconds", 1).over(windowSpec)) / 60)
.filter($"time_delta" === lit(15))
.select("ID", "Date")
.distinct()
.collect()
.map { case Row(id, date) => (id, date) }
.toList
Basically - convert the datetimes to timestamps, then look for records with the same ID and consecutive times, with their times separated by 15 minutes.
This is done by using lag over a window grouped by ID and ordered by the time.
In order to calculate the time difference The timestamp is converted to unix epoch seconds.
If you don't want to count day-crossing incidients, you can add the date to the groupyBy clause of the window

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!
I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

How to calculate rolling statistics in Scala

I have knowledge with respect to handling Dataframes in Python but I am facing the below problem while writing in scala.
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double)
I created a list like this:
val transactions: List[Transaction] = transactionslines.map {
line => val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3),split(4).toDouble)
}.toList
Contents of the list:
Transaction(T000942,A38,28,EE,694.54)
Transaction(T000943,A35,28,CC,828.57)
Transaction(T000944,A26,28,DD,290.18)
Transaction(T000945,A17,28,CC,627.62)
Transaction(T000946,A42,28,FF,616.73)
Transaction(T000947,A20,28,FF,86.84)
Transaction(T000948,A14,28,BB,620.32)
Transaction(T000949,A14,28,AA,260.02)
Transaction(T000950,A32,28,AA,600.34)
Can anyone help me on how to calculate statistics for each account number for the previous five days of transactions, not including transactions from the day statistics are being calculated for. For example, on day 10 you should consider only the transactions from days 5 to 9 (this is called a rolling time window of five days). The statistics I need to calculate are:
•The total transaction value of transactions type “AA” in the previous 5 days per account
•The average transaction value of the previous 5 days of transactions per account
The output ideally should contain one line per day per account id and each line should contain each of the calculated statistics:
My code for the first 5 days looks like:
val a = transactions.
filter(trans => trans.transactionDay <= 5).
groupBy(_.accountId).
mapValues(trans => (trans.map(amount =>
amount.transactionAmount).sum/trans.map(amount =>
amount.transactionAmount).size,trans.filter(trans =>
trans.category == "AA").map(amount =>
amount.transactionAmount).sum)
a.foreach{println}
I would like to know if there is any elegant way to calculate those statistics. Note that transcation day range from [1..29] so ideally I would like a code that calculate those rolling statistics up to 29th day and not only for the first 5 days.
Thanks a lot!!
Perhaps not 'elegant' but the following will provide the two desired statistics by accountID for a particular day:
def calcStats(ts: List[Transaction], day: Int): Map[String, (Double, Double)] = {
// all transactions for date range, grouped by accountID
val atsById = ts
.filter(trans => trans.transactionDay >= day - 5 && trans.transactionDay < day)
.groupBy(_.accountId)
// "AA" transactions summmed by account id
val aaSums = atsById.mapValues(_.filter(_.category == "AA"))
.mapValues(_.map(_.transactionAmount).sum)
// sum of all transactions by account id
val allSums = atsById.mapValues(_.map(_.transactionAmount).sum)
// count of all transactions by account id
val allCounts = atsById.mapValues(_.length)
// output is Map(account id -> (aaSum, allAve))
allSums.map { case (id, sum) =>
id -> (aaSums.getOrElse(id, 0.0), sum / allCounts(id)) }
}
examples (using your provided data):
scala> calcStats(transactions, 30)
res13: Map[String,(Double, Double)] = Map(A32 -> (600.34,600.34), A26 ->
(0.0,290.18), A38 -> (0.0,694.54), A20 -> (0.0,86.84), A17 -> (0.0,627.62),
A42 -> (0.0,616.73), A35 -> (0.0,828.57), A14 -> (260.02,440.17))
scala> calcStats(transactions, 1)
res14: Map[String,(Double, Double)] = Map()
object ScalaStatistcs extends App {
import scala.io.Source
// Define a case class Transaction which represents a transaction
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double
)
// The full path to the file to import
val fileName = "C:\\Users\\Downloads\\transactions.txt"
// The lines of the CSV file (dropping the first to remove the header)
val transactionslines = Source.fromFile(fileName).getLines().drop(1)
// Here we split each line up by commas and construct Transactions
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble)
}.toList
println(transactions)
/* Data
* transactionId accountId transactionDay category transactionAmount
T0001 A27 1 GG 338.11
T0002 A5 1 BB 677.89
T0003 A32 1 DD 499.86
T0004 A42 1 DD 801.81
T0005 A19 1 BB 14.42
T0006 A46 1 FF 476.88
T0007 A29 1 FF 439.49
T0008 A49 1 DD 848.9
T0009 A47 1 BB 400.42
T00010 A23 1 BB 416.36
T00011 A45 1 GG 662.98
T00012 A2 1 DD 775.37
T00013 A33 1 BB 185.4
T00014 A44 1 CC 397.19
T00015 A43 1 CC 825.05
T00016 A16 1 BB 786.14
T00017 A33 1 DD 352.64
T00018 A14 1 DD 733.77
T00019 A40 1 FF 435.5
T00020 A32 1 EE 284.34
T00021 A25 1 AA 291.76
*
*/
/* Question :1
* Calculate the total transaction value for all transactions for each day.
The output should contain one line for each day and each line should include the day and the total value
*/
val question1 = transactions
.groupBy(_.transactionDay)
.mapValues(_.map(_.transactionAmount).sum)
println(question1)
/* Question :2
* Calculate the average value of transactions per account for each type of transaction (there are seven in total).
The output should contain one line per account, each line should include the account id and the average value
for each transaction type (ie 7 fields containing the average values).
*/
val question2 = transactions
.groupBy(trans => (trans.accountId, trans.category))
.mapValues(trans =>
(trans.map(amount => amount.transactionAmount).sum / trans
.map(amount => amount.transactionAmount)
.size)
)
println(question2)
/* Question :3
* For each day, calculate statistics for each account number for the previous five days of transactions,
* not including transactions from the day statistics are being calculated for. For example, on day 10 you
* should consider only the transactions from days 5 to 9 (this is called a rolling time window of five days).
* The statistics we require to be calculated are:
 The maximum transaction value in the previous 5 days of transactions per account
 The average transaction value of the previous 5 days of transactions per account
 The total transaction value of transactions types “AA”, “CC” and “FF” in the previous 5 days per account
*/
var day = 0
def day_factory(day: Int) =
for (i <- 1 until day)
yield i
day_factory(30).foreach {
case (i) => {
day = i
val question3_initial = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans => trans.map(day => day -> day.transactionAmount))
.mapValues(trans =>
(
trans.map(t => day).max,
trans.map(t => t._2).max,
trans.map(t => t._2).sum / trans.map(t => t._2).size
)
)
val question3_AA = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "AA")
.map(amount => amount.transactionAmount)
.sum
)
val question3_CC = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "CC")
.map(amount => amount.transactionAmount)
.sum
)
val question3_FF = transactions
.filter(trans =>
trans.transactionDay >= day - 5 && trans.transactionDay < day
)
.groupBy(_.accountId)
.mapValues(trans =>
trans
.filter(trans => trans.category == "FF")
.map(amount => amount.transactionAmount)
.sum
)
val question3_final = question3_initial.map { case (k, v) =>
(k, v._1) -> (v._2, v._3, question3_AA(k), question3_CC(
k
), question3_FF(k))
}
println(question3_final)
}
}
}

how to work on Index value

This is the Data in text file format. I need to find top salary for every city
first_name last_name city county salary
--------------------------------------------------------
James Butt New Orleans Orleans 250000
Josephine Darakjy Brighton Livingston 300000
Art Venere Bridgeport Gloucester 400000
Leota Dilliard Bridgeport Gloucester 430000
> val scq = sc.textFile("path.txt")
> scq.flatMap(al=>al.split("\n")).sortBy(_._5,ascending = false).collect.take(5).foreach(println)
// sorting on salary
But I am getting error as value _5 is not a member of String , when I use toString it is giving error value _5 is not a member of char.
How should it be handled?
Try this:
> val scq = sc.textFile("path.txt")
> val d = scq.map(_.split("\t")).sortBy(_.apply(4), ascending = false)
This will produce an RDD[Array[String]] as the output. If you want to view them as tuple, you can do the following:
> val d1 = d.map(c => (c(0), c(1), c(2), c(3), c(4))) // Prefer case class over this always
> d.collect.foreach(println)
This will produce the following output:
(Leota,Dilliard,Bridgeport,Gloucester,430000)
(Art,Venere,Bridgeport,Gloucester,400000)
(Josephine,Darakjy,Brighton,Livingston,300000)
(James,Butt,New Orleans,Orleans,250000)

Apache SPARK GroupByKey alternate

I have below columns in my table [col1,col2,key1,col3,txn_id,dw_last_updated]. Out of these txn_id , key1 are primary key columns. In my dataset I can have multiple records for the combination of (txn_id,key). Out of those records , I need to pick the latest one one based on dw_last_updated..
I'm using a logic this. I'm consistently hitting memory issue and I believe its partly because of groupByKey()... Is there a better alternate for this ?
case class Fact(col1: Int,
col2: Int,
key1: String,
col3: Int,
txn_id: Double,
dw_last_updated: Long)
sc.textFile(s3path).map { row =>
val parts = row.split("\t")
Fact(parts(0).toInt,
parts(1).toInt,
parts(2),
parts(3).toInt,
parts(4).toDouble,
parts(5).toLong)
}).map { t => ((t.txn_id, t.key1), t) }.groupByKey(512).map {
case ((txn_id, key1), sequence) =>
val newrecord = sequence.maxBy {
case Fact_Cp(col1, col2, key1, col3, txn_id, dw_last_updated) => dw_last_updated.toLong
}
(newrecord.col1 + "\t" + newrecord.col2 + "\t" + newrecord.key1 +
"\t" + newrecord.col3 + "\t" + newrecord.txn_id + "\t" + newrecord.dw_last_updated)
}
Appreciate your thoughts / suggestions...
rdd.groupByKey collects all values per key, requiring the necessary memory to hold the sequence of values for a key on a single node. Its use is discouraged. See [1].
Given that we are interested in only 1 value per key: max(dw_last_updated), a more memory efficient way would be to use rdd.reduceByKey where the reduce function here is to pick up the max of the two records for the same key using that timestamp as discriminant.
rdd.reduceByKey{case (record1,record2) => max(record1, record2)}
Applied to your case, it should look like this:
case class Fact(...)
object Fact {
def parse(s:String):Fact = ???
def maxByTs(f1:Fact, f2:Fact):Fact = if (f1.dw_last_updated.toLong > f2.dw_last_updated.toLong) f1 else f2
}
val factById = sc.textFile(s3path).map{row => val fact = Fact.parse(row); ((fact.txn_id, fact.key1),fact)}
val maxFactById = factById.reduceByKey(Fact.maxByTs)
Note that I've defined utility operations on the Fact companion object to keep the code tidy. I also advice to give named variables to each transformation step or logical group of steps. It makes it the program more readable.