Rolling sum over a Seq with Date as key - scala

I'm working on a small personal project and am trying to re-write the codebase from Python to Scala so that I can be a little more competent functional programmer.
I am working with a Seq that contains stock data and need to create a running sum of volume traded for each day.
My code so far is:
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
case class SymbolData(date: DateTime, open: Double, high: Double, low: Double, close: Double, adjClose: Double, volume: Int)
def dateTimeHelper(date: String): DateTime = {
DateTimeFormat.forPattern("yyyy-MM-dd").parseDateTime(date)
}
val sampleData: Seq[SymbolData] = Seq(
SymbolData(dateTimeHelper("2019-01-01"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(dateTimeHelper("2019-01-02"), 3.0, 2.0, 5.0, 2.0, 8.0, 20),
SymbolData(dateTimeHelper("2019-01-03"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(dateTimeHelper("2019-01-04"), 4.0, 3.0, 2.5, 2.3, 5.3, 7))
Not all dates may be present so I do not think using a sliding window will be appropriate. For the output I would need to get a Seq of ints that contain sum of last 2 days of data, for example:
Seq(10, 30, 30, 17) # 2019-01-01 has only 1 day with sum value of 10 since there is no data for 2018-12-31, 2019-01-02 would be 30 since we have 2nd and 1st of Jan present, etc...
This is not overly difficult to do in base python, however with Scala there seem to be quite a few options (recursive use of folds?) but I am struggling with the syntax and implementation. Would anyone be able to shed some light on this?

You say "not all dates may be present" but you don't specify how date gaps should be handled.
Here I guessed that output should include all 2-day sums, gap days included.
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
case class SymbolData(date : LocalDate
,open : Double
,high : Double
,low : Double
,close : Double
,adjClose : Double
,volume : Int)
val sampleData: List[SymbolData] = List(
SymbolData(LocalDate.parse("2019-01-01"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(LocalDate.parse("2019-01-02"), 3.0, 2.0, 5.0, 2.0, 8.0, 20),
SymbolData(LocalDate.parse("2019-01-03"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(LocalDate.parse("2019-01-04"), 4.0, 3.0, 2.5, 2.3, 5.3, 7),
// 1 day gap
SymbolData(LocalDate.parse("2019-01-06"), 4.4, 3.3, 2.2, 2.3, 1.3, 13),
// 2 day gap
SymbolData(LocalDate.parse("2019-01-09"), 2.4, 2.2, 1.5, 3.1, 0.9, 21),
SymbolData(LocalDate.parse("2019-01-10"), 2.4, 2.2, 1.5, 3.1, 0.9, 11)
)
val volByDate = sampleData.foldLeft(Map.empty[LocalDate,Int]){
case (m,sd) => m + (sd.date -> sd.volume)
}.withDefaultValue(0)
val startDate = sampleData.head.date
val endDate = sampleData.last.date
val rslt = List.unfold(startDate){ date => //<--Scala 2.13
if (date isAfter endDate) None
else
Some(volByDate(date) + volByDate(date.minus(1L,DAYS)) -> date.plus(1L,DAYS))
}
//rslt: List[Int] = List(10, 30, 30, 17, 7, 13, 13, 0, 21, 32)

Related

Swift reduce tuple array custom data model

I am attempting to reduce "y" value components for each similar "x" value in this CustomDataModel array, and return a CustomDataModel array (or simply an array of tuples [(Double, Double)].
My knowledge of Swift is rudimentary at best and the attempt to use a chaining function as follows reduces across the entire array and not per "x" value. I'm not sure how I can limit the reduce function to only the like values.
let reducedArray = yValues.filter( {$0.x == $0.x}).map({$0.y}).reduce(0, +)
the above function reduces all y values not per x.
Below is a the model and some dummy data:
struct CustomDataModel {
var x : Double
var y : Double
}
let yValues: [CustomDataModel] = [
CustomDataModel(x: 0.0, y: 10.0),
CustomDataModel(x: 1.0, y: 5.0),
CustomDataModel(x: 1.0, y: 12.5),
CustomDataModel(x: 1.0, y: 14.5),
CustomDataModel(x: 1.0, y: 18.45),
CustomDataModel(x: 5.0, y: 11.4),
CustomDataModel(x: 5.0, y: 9.4),
CustomDataModel(x: 5.0, y: 18.4),
CustomDataModel(x: 5.0, y: 5.0),
CustomDataModel(x: 9.0, y: 7.6),
CustomDataModel(x: 9.0, y: 13.5),
CustomDataModel(x: 9.0, y: 18.5),
CustomDataModel(x: 9.0, y: 17.6),
CustomDataModel(x: 9.0, y: 14.3),
CustomDataModel(x: 14.0, y: 19.6),
CustomDataModel(x: 14.0, y: 17.8),
CustomDataModel(x: 14.0, y: 20.1),
CustomDataModel(x: 14.0, y: 21.5),
CustomDataModel(x: 14.0, y: 23.4),
]
Ideally the output I would have would look like this:
print(reducedArray)
//[(0.0, 10.0), (1.0, 50.45), (5.0, 44.2), (9.0, 71.5), (14.0, 102.4)]
You need to group the yValues array by x into a Dictionary. And map the values to y and then perform reduce. Here's how:
let reducedArray = Dictionary(grouping: yValues, by: \.x).mapValues({ $0.map(\.y).reduce(0, +) })
print(reducedArray)
Update: If you require the result to be in [(Double, Double)] format you can map the above result to tuples.

Merge List[List[_]] conditionally

I want to merge a List[List[Double]] based on the values of the elements in the inner Lists. Here's what I have so far:
// inner Lists are (timestamp, ID, measurement)
val data = List(List(60, 0, 3.4), List(60, 1, 2.5), List(120, 0, 1.1),
List(180, 0, 5.6), List(180, 1, 4.4), List(180, 2, 6.7))
data
.foldLeft(List[List[Double]]())(
(ret, ll) =>
// if this is the first list, just add it to the return val
if (ret.isEmpty){
List(ll)
// if the timestamps match, add a new (ID, measurement) pair to this inner list
} else if (ret(0)(0) == ll(0)){
{{ret(0) :+ ll(1)} :+ ll(2)} :: ret.drop(1)
// if this is a new timestamp, add it to the beginning of the return val
} else {
ll :: ret
}
)
This works, but it doesn't smell optimal to me (especially the right-additions ':+'). For my use case, I have a pretty big (~25,000 inner Lists) List of elements, which are themselves all length-3 Lists. At most, there will be a fourfold degeneracy, because the inner lists are List(timestamp, ID, measurement) groups, and there are only four unique IDs. Essentially, I want to smush together all of the measurements that have the same timestamps.
Does anyone see a more optimal way of doing this?
I actually start with a List[Double] of timestamps and a List[Double] of measurements for each of the four IDs, if there's a better way of starting from that point.
Here is a slightly shorter way to do it:
data.
groupBy(_(0)).
mapValues(_.flatMap(_.tail)).
toList.
map(kv => kv._1 :: kv._2)
The result looks 1:1 exactly the same as what your algorithm produces:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
Explanation:
group by timestamp
in the grouped values, drop the redundant timestamps, and flatten to single list
tack the timestamp back onto the flat list of ids-&-measurements
Here is a possibility:
input
.groupBy(_(0))
.map { case (tstp, values) => tstp :: values.flatMap(_.tail) }
The idea is just to group inner lists by their first element and then flatten the resulting values.
which returns:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
What about representing your measurements with a case class?
case class Measurement(timestamp: Int, id: Int, value: Double)
val measurementData = List(Measurement(60, 0, 3.4), Measurement(60, 1, 2.5),
Measurement(120, 0, 1.1), Measurement(180, 0, 5.6),
Measurement(180, 1, 4.4), Measurement(180, 2, 6.7))
measurementData.foldLeft(List[Measurement]())({
case (Nil, m) => List(m)
case (x :: xs, m) if x.timestamp == m.timestamp => m :: xs
case (xs, m) => m :: xs
})

Displaying output under a certain format

I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0

How to define Tuple1 in Scala?

I try to use (1,), but doesn't work, what's the syntax to define Tuple1 in scala ?
scala> val a=(1,)
<console>:1: error: illegal start of simple expression
val a=(1,)
For tuple with cardinality 2 or more, you can use parentheses, however for with cardinality 1, you need to use Tuple1:
scala> val tuple1 = Tuple1(1)
tuple1: (Int,) = (1,)
scala> val tuple2 = ('a', 1)
tuple2: (Char, Int) = (a,1)
scala> val tuple3 = ('a', 1, "name")
tuple3: (Char, Int, java.lang.String) = (a,1,name)
scala> tuple1._1
res0: Int = 1
scala> tuple2._2
res1: Int = 1
scala> tuple3._1
res2: Char = a
scala> tuple3._3
res3: String = name
To declare the type, use Tuple1[T], for example val t : Tuple1[Int] = Tuple1(22)
A tuple is, by definition, an ordered list of elements. While Tuple1 exists, I haven't seen it used explicitly given you'd normally use a single element. Nevertheless, there is no sugar, you need to use Tuple1(1).
There is a valid use case in Spark that requires Tuple1: create a dataframe with one column.
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
data.toDF("features").show()
It will throw an error:
"value toDF is not a member of Seq[org.apache.spark.ml.linalg.Vector]"
To make it work, we have to convert each row to Tuple1:
val data = Seq(
Tuple1(Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))),
Tuple1(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)),
Tuple1(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
)
or a better way:
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
).map(Tuple1.apply)

Array of doubles in scala

I just want a quick way to create an array (or vector) of doubles that doesn't come out as type NumericRange.
Ive tried
val ys = Array(9. to 1. by -1.)
But this returns type Array[scala.collection.immutable.NumericRange[Double]]
Is there a way to coerce this to regular type Array[Double]?
scala> (9d to 1d by -1d).toArray
res0: Array[Double] = Array(9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)
I think it slightly more concise and readable:
Array(9d to 1 by -1 : _*)
res0: Array[Double] = Array(9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)