Creating "running total" in Scala - scala

I have a history DataFrame that has the following structure
id amount date
12345 150 1/1/2016
12345 50 1/4/2016
12345 250 1/4/2016
12345 950 1/9/2016
I would like to have a cumulative sum of $ with respect to date, such that the resulting sum is calculated as the sum of all earlier days with the same ID. results should be generated even for dates that do not have entries in the source DataFrame, assuming they are between the start and end dates. The expected output for the example input can be seen below.
ID date cumulative_sum
12345 1/1/2016 150
12345 1/2/2016 150
12345 1/3/2016 150
12345 1/4/2016 450
12345 1/5/2016 450
12345 1/6/2016 450
12345 1/7/2016 450
12345 1/8/2016 450
12346 1/9/2016 1400
Does anyone know how to calculate this sort of running total?

Basically, you first find subtotals for each date (doesn't really have to happen as a separate step, but this makes things a little more generic - I'll explain why below):
val subtotals = data
.groupBy(_.date)
.mapValues(_.map(_.amount).sum)
.withDefault(_ => 0)
Now, you can scan through the date range, and sum things up with something like this:
(0 to numberOfMonths)
.map(startDate.plusMonths)
.scanLeft(null -> 0) { case ((_, sum), date) =>
date -> (subtotals(date) + sum)
}.drop(1)
This is how you would do this is in "plain scala". Now, because you have mentioned "data frame", in your question, I suspect, you are actually using spark. This makes it a little bit more complicated, because the data may be distributed. The good news is, while you may have a huge number of transaction, there aren't enough months in the history of the world to make it impossible for you to process the aggregated data as a single task.
So, you just need to replace the first step above with a distributed equivalent:
val subtotals = dataFrame
.rdd
.map(tx => tx.date -> tx.amount)
.reduceByKey(_ + _)
.collect
.toMap
And now you can to the second step in exactly the same way I showed above.

Related

Cumulative function in spark scala

I have tried this to calculate cumulate value but if the date field is same those values are added in the cumulative field, can someone suggestion solution Similar to this question
val windowval = (Window.partitionBy($"userID").orderBy($"lastModified")
.rangeBetween(Window.unboundedPreceding, 0))
val df_w_cumsum = ms1_userlogRewards.withColumn("totalRewards", sum($"noOfJumps").over(windowval)).orderBy($"lastModified".asc)
df_w_cumsum.filter($"batchType".isNull).filter($"userID"==="355163").select($"userID", $"noOfJumps", $"totalRewards",$"lastModified").show()
Note that your very first totalRewards=147 is the sum of the previous value 49 + all the values with timestamp "2019-08-07 18:25:06": 49 + (36 + 0 + 60 + 2) = 147.
The first option would be to aggregate all the values with the same timestamp fist e.g. groupBy($"userId", $"lastModified").agg(sum($"noOfJumps").as("noOfJumps")) (or something like that) and then run your aggregate sum. This will remove duplicate timestamps altogether.
The second option is to use row_number to define an order among rows with the same lastModified field first and then run your aggregate sum with .orderBy($"lastModified, $"row_number") (or something like that). This should keep all records and give you partial sum up along the way: totalRewards = 49 -> 85 -> 85 -> 145 -> 147 (or something similar depending on the order defined by row_number)
I think you want to sum by userid and timestamp.
So, You need to partition by userid and date and use window function to sym like the following:
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userID", "lastModified")
df.withColumn("cumulativeSum", sum(col("noOfJumps").over(window))

KDB: how to search a table with a list

I have the following table t:
t:([]sym:3#`ibm;time:10:01:01 10:01:04 10:01:08;price:100 101 105;val:("hello";"world";"test"))
How do I perform the following query:
select from t where val in ("hello"; "test")
Wherein I am expecting the following result:
sym time price val
---------------------------
ibm 10:01:01 100 hello
ibm 10:01:08 105 test
It looks like your query does return the result you require.
Alternatively, the keyword 'like' can be used.
When we use a where clause at the end of a select statement, the 'where' section expects a single boolean value to tell it whether the column should be selected.
When we do where val in "hello" , it will actually return a boolean for each element of the string that it matches on (when it isn't wrapped):
q)val:"hello"
q)val in "hello"
11111b
Therefore, to obtain a single boolean returned we use the keyword like
q)val like "hello"
1b
Furthermore, when passing a list of strings to a where clause, an 'each-right' adverb should be used to instruct the where clause to operate on each instance of the list.
q)val like/: ("hello";"test")
10b
However, we are again faced with multiple booleans, when the where clause expects a single
Therefore we use the keyword any to return results when either hello or test are present.
q)any val like/: ("hello";"test")
1b
And we can see, this gives the results required
q)select from t where any val like/: ("hello";"test")
sym time price val
--------------------------
ibm 10:01:01 100 "hello"
ibm 10:01:08 105 "test"
Hope this helps
q) select from t where any val like/: ("hello"; "test")
Output:
sym time price val
---------------------------
ibm 10:01:01 100 hello
ibm 10:01:08 105 test

Scala RDD Operation

I am new in scala.
I have a csv file stored in hdfs. I am reading that file in scala using
val salesdata = sc.textFile("hdfs://localhost:9000/home/jayshree/sales.csv")
Here is a small sample of data "sales".
C_ID T_ID ITEM_ID ITEM_Price
5 199 1 500
33 235 1 500
20 249 3 749
35 36 4 757
19 201 4 757
17 94 5 763
39 146 5 763
42 162 5 763
49 41 6 824
3 70 6 824
24 161 6 824
48 216 6 824
I have to perform the following operation on it.
1.Apply some discount on each item, on the column d(itemprice) suppose 30% of discount. The formula will be d=d-(30%(d)).
2.Find customer wise minimum and maximum item value after applying 30% discount to each item.
I tried to multiply 30 with the observation of column ITEM_Price. The problem is that the value of d as taken as string. When I am multiplying with a number in result it is showing the value that many time. like (500*3 = 500500500)
I can convert it into a dataframe and do it. But I just want to know that without converting it into a dataframe can we do these operation for an RDD.
Discount
case class Transaction(cId: Int, tId: Int, itemId: Int, itemPrice: Int)
val salesdata : RDD[String]=> Map the RDD, within the map split the line by your separator and then convert the Array to a case class called Transaction calling Array(i).toInt to cast the fields. In this step your target is to get a RDD[Transaction].
Map the RDD again and copy your transaction applying the discount ( t => t.copy(itemPrice=0.7*t.itemPrice))
You will have a new RDD[Transaction]
Customer wise
Take the last object, apply a keyBy(_.cId) to get RDD[Int, Transaction] where your key is the client.
Reduce By Key adding the prices for each item. Goal => RDD[Int, Int] where you get the total for each client.
Find your target clients!
Since you want more of a guide, let's look at this outside of Spark for a second and think about things as typical Scala collections.
Your data would look like this:
val data = Array(
(5, 199, 5, 100),
(33, 235, 5, 100),
...
)
I think you will have no trouble mapping your salesdata RDD of strings to an RDD of Array or Tuple4 using a split or regular expression or something.
Let's go with a tuple. Then you can do this:
data.map {
case (cId, tId, item, price) => (cId, tId, item, price * .7)
}
That maps the original RDD of tuples to another RDD of tuples where the last values, the prices, are reduced by 30%. So the result is a Tuple4[Int, Int, Int, Double].
To be honest, I don't know what you mean by customer-wise min and max, but maybe it is something like this:
data.map {
case (cId, tId, item, price) => (cId, tId, item, price * .7)
}.groupBy(_._1)
.mapValues { tuples =>
val discountedPrices = tuples.map(_._4)
(discountedPrices.min, discountedPrices.max)
}
First, I do a groupBy, which produces a Map from cId (the first value in the tuple, which explains the ._1) to a collection of full tuples--so a Map of cId to a collection of rows pertaining to that cId. In Spark, this would produce a PairRDD.
Map and PairRDD both have a mapValues function, which allows me to preserve the keys (the cIds) while transforming each collection of tuples. In this case, I simply map the collection to a collection of discounted prices by getting the 4th item in each tuple, the discounted prices. Then I call min and max on that collection and return a tuple of those values.
So the result is a Map of customer ID to a tuple of the min and max of the discounted prices. The beauty of the RDD API is that it follows the conventional Scala collection API so closely, so it is basically the same thing.

Iterate over current row values in kdb query

Consider the table:
q)trade
stock price amt time
-----------------------------
ibm 121.3 1000 09:03:06.000
bac 5.76 500 09:03:23.000
usb 8.19 800 09:04:01.000
and the list:
q)x: 10000 20000
The following query:
q)select from trade where price < x[first where (x - price) > 100f]
'length
fails as above. How can I pass the current row value of price in each iteration of the search query?
While price[0] in the square brackets above works, that's obviously not what I want. I even tried price[i] but that gives the same error.

how to use multiple arguments in kdb where query?

I want to select max elements from a table within next 5, 10, 30 minutes etc.
I suspect this is not possible with multiple elements in the where clause.
Using both normal < and </: is failing. My code/ query below:
`select max price from dat where time</: (09:05:00; 09:10:00; 09:30:00)`
Any ideas what am i doing wrong here?
The idea is to get the max price for each row within next 5, 10, 30... minutes of the time in that row and not just 3 max prices in the entire table.
select max price from dat where time</: time+\:(5 10 30)
This won't work but should give the general idea.
To further clarify, i want to calculate the max price in 5, 10, 30 minute intervals from time[i] of each row of the table. So for each table row max price within x+5, x+10, x+30 minutes where x is the time entry in that row.
You could try something like this:
select c1:max price[where time <09:05:00],c2:max price[where time <09:10:00],c3:max price from dat where time< 09:30:00
You can paramatize this query however you like. So if you have a list of times, l:09:05:00 09:10:00 09:15:00 09:20:00 ... You can create a function using a functional form of the query above to work for different lengths of l, something like:
q)f:{[t]?[dat;enlist (<;`time;max t);0b;(`$"c",/:string til count t)!flip (max;flip (`price;flip (where;((<),/:`time,/:t))))]}
q)f l
You can extend f to take different functions instead of max, work for different tables etc.
This works but takes a lot of time. For 20k records, ~20 seconds, too much!. Any way to make it faster
dat: update tmlst: time+\:mtf*60 from dat;
dat[`pxs]: {[x;y] {[x; ts] raze flip raze {[x;y] select min price from x where time<y}[x] each ts }[x; y`tmlst]} [dat] each dat;
this constructs a step dictionary to map the times to your buckets:
q)-1_select max price by(`s#{((neg w),x)!x,w:(type x)$0W}09:05:00 09:10:00 09:30:00)time from dat
you may also be able to abuse wj:
q)wj[{(prev x;x)}09:05:00 09:10:00 09:30:00;`time;([]time:09:05:00 09:10:00 09:30:00);(delete sym from dat;(max;`price))]
if all your buckets are the same size, it's much easier:
q)select max price by 300 xbar time from dat where time<09:30:00 / 300-second (5-min) buckets