I am attempting to reduce "y" value components for each similar "x" value in this CustomDataModel array, and return a CustomDataModel array (or simply an array of tuples [(Double, Double)].
My knowledge of Swift is rudimentary at best and the attempt to use a chaining function as follows reduces across the entire array and not per "x" value. I'm not sure how I can limit the reduce function to only the like values.
let reducedArray = yValues.filter( {$0.x == $0.x}).map({$0.y}).reduce(0, +)
the above function reduces all y values not per x.
Below is a the model and some dummy data:
struct CustomDataModel {
var x : Double
var y : Double
}
let yValues: [CustomDataModel] = [
CustomDataModel(x: 0.0, y: 10.0),
CustomDataModel(x: 1.0, y: 5.0),
CustomDataModel(x: 1.0, y: 12.5),
CustomDataModel(x: 1.0, y: 14.5),
CustomDataModel(x: 1.0, y: 18.45),
CustomDataModel(x: 5.0, y: 11.4),
CustomDataModel(x: 5.0, y: 9.4),
CustomDataModel(x: 5.0, y: 18.4),
CustomDataModel(x: 5.0, y: 5.0),
CustomDataModel(x: 9.0, y: 7.6),
CustomDataModel(x: 9.0, y: 13.5),
CustomDataModel(x: 9.0, y: 18.5),
CustomDataModel(x: 9.0, y: 17.6),
CustomDataModel(x: 9.0, y: 14.3),
CustomDataModel(x: 14.0, y: 19.6),
CustomDataModel(x: 14.0, y: 17.8),
CustomDataModel(x: 14.0, y: 20.1),
CustomDataModel(x: 14.0, y: 21.5),
CustomDataModel(x: 14.0, y: 23.4),
]
Ideally the output I would have would look like this:
print(reducedArray)
//[(0.0, 10.0), (1.0, 50.45), (5.0, 44.2), (9.0, 71.5), (14.0, 102.4)]
You need to group the yValues array by x into a Dictionary. And map the values to y and then perform reduce. Here's how:
let reducedArray = Dictionary(grouping: yValues, by: \.x).mapValues({ $0.map(\.y).reduce(0, +) })
print(reducedArray)
Update: If you require the result to be in [(Double, Double)] format you can map the above result to tuples.
Related
I'm working on a small personal project and am trying to re-write the codebase from Python to Scala so that I can be a little more competent functional programmer.
I am working with a Seq that contains stock data and need to create a running sum of volume traded for each day.
My code so far is:
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
case class SymbolData(date: DateTime, open: Double, high: Double, low: Double, close: Double, adjClose: Double, volume: Int)
def dateTimeHelper(date: String): DateTime = {
DateTimeFormat.forPattern("yyyy-MM-dd").parseDateTime(date)
}
val sampleData: Seq[SymbolData] = Seq(
SymbolData(dateTimeHelper("2019-01-01"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(dateTimeHelper("2019-01-02"), 3.0, 2.0, 5.0, 2.0, 8.0, 20),
SymbolData(dateTimeHelper("2019-01-03"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(dateTimeHelper("2019-01-04"), 4.0, 3.0, 2.5, 2.3, 5.3, 7))
Not all dates may be present so I do not think using a sliding window will be appropriate. For the output I would need to get a Seq of ints that contain sum of last 2 days of data, for example:
Seq(10, 30, 30, 17) # 2019-01-01 has only 1 day with sum value of 10 since there is no data for 2018-12-31, 2019-01-02 would be 30 since we have 2nd and 1st of Jan present, etc...
This is not overly difficult to do in base python, however with Scala there seem to be quite a few options (recursive use of folds?) but I am struggling with the syntax and implementation. Would anyone be able to shed some light on this?
You say "not all dates may be present" but you don't specify how date gaps should be handled.
Here I guessed that output should include all 2-day sums, gap days included.
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
case class SymbolData(date : LocalDate
,open : Double
,high : Double
,low : Double
,close : Double
,adjClose : Double
,volume : Int)
val sampleData: List[SymbolData] = List(
SymbolData(LocalDate.parse("2019-01-01"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(LocalDate.parse("2019-01-02"), 3.0, 2.0, 5.0, 2.0, 8.0, 20),
SymbolData(LocalDate.parse("2019-01-03"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(LocalDate.parse("2019-01-04"), 4.0, 3.0, 2.5, 2.3, 5.3, 7),
// 1 day gap
SymbolData(LocalDate.parse("2019-01-06"), 4.4, 3.3, 2.2, 2.3, 1.3, 13),
// 2 day gap
SymbolData(LocalDate.parse("2019-01-09"), 2.4, 2.2, 1.5, 3.1, 0.9, 21),
SymbolData(LocalDate.parse("2019-01-10"), 2.4, 2.2, 1.5, 3.1, 0.9, 11)
)
val volByDate = sampleData.foldLeft(Map.empty[LocalDate,Int]){
case (m,sd) => m + (sd.date -> sd.volume)
}.withDefaultValue(0)
val startDate = sampleData.head.date
val endDate = sampleData.last.date
val rslt = List.unfold(startDate){ date => //<--Scala 2.13
if (date isAfter endDate) None
else
Some(volByDate(date) + volByDate(date.minus(1L,DAYS)) -> date.plus(1L,DAYS))
}
//rslt: List[Int] = List(10, 30, 30, 17, 7, 13, 13, 0, 21, 32)
I want to merge a List[List[Double]] based on the values of the elements in the inner Lists. Here's what I have so far:
// inner Lists are (timestamp, ID, measurement)
val data = List(List(60, 0, 3.4), List(60, 1, 2.5), List(120, 0, 1.1),
List(180, 0, 5.6), List(180, 1, 4.4), List(180, 2, 6.7))
data
.foldLeft(List[List[Double]]())(
(ret, ll) =>
// if this is the first list, just add it to the return val
if (ret.isEmpty){
List(ll)
// if the timestamps match, add a new (ID, measurement) pair to this inner list
} else if (ret(0)(0) == ll(0)){
{{ret(0) :+ ll(1)} :+ ll(2)} :: ret.drop(1)
// if this is a new timestamp, add it to the beginning of the return val
} else {
ll :: ret
}
)
This works, but it doesn't smell optimal to me (especially the right-additions ':+'). For my use case, I have a pretty big (~25,000 inner Lists) List of elements, which are themselves all length-3 Lists. At most, there will be a fourfold degeneracy, because the inner lists are List(timestamp, ID, measurement) groups, and there are only four unique IDs. Essentially, I want to smush together all of the measurements that have the same timestamps.
Does anyone see a more optimal way of doing this?
I actually start with a List[Double] of timestamps and a List[Double] of measurements for each of the four IDs, if there's a better way of starting from that point.
Here is a slightly shorter way to do it:
data.
groupBy(_(0)).
mapValues(_.flatMap(_.tail)).
toList.
map(kv => kv._1 :: kv._2)
The result looks 1:1 exactly the same as what your algorithm produces:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
Explanation:
group by timestamp
in the grouped values, drop the redundant timestamps, and flatten to single list
tack the timestamp back onto the flat list of ids-&-measurements
Here is a possibility:
input
.groupBy(_(0))
.map { case (tstp, values) => tstp :: values.flatMap(_.tail) }
The idea is just to group inner lists by their first element and then flatten the resulting values.
which returns:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
What about representing your measurements with a case class?
case class Measurement(timestamp: Int, id: Int, value: Double)
val measurementData = List(Measurement(60, 0, 3.4), Measurement(60, 1, 2.5),
Measurement(120, 0, 1.1), Measurement(180, 0, 5.6),
Measurement(180, 1, 4.4), Measurement(180, 2, 6.7))
measurementData.foldLeft(List[Measurement]())({
case (Nil, m) => List(m)
case (x :: xs, m) if x.timestamp == m.timestamp => m :: xs
case (xs, m) => m :: xs
})
I have a DataFrame that contains feature vectors created by the VectorAssembler, it also contains null values. I now want to replace the null values with a vector:
val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)
df.na.fill(nil) // does not work.
What is the right way to do this?
EDIT:
I found a way thanks to the answer:
val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)
import sc.implicits._
var fill = Seq(Tuple1(nil)).toDF("replacement")
val dates = data.schema.fieldNames.filter(e => e.contains("1"))
data = data.crossJoin(broadcast(fill))
for(e <- dates){
data = data.withColumn(e, coalesce(data.col(e), $"replacement"))
}
data = data.drop("replacement")
If the problem is created by adding some additional rows you join with replacement:
import org.apache.spark.sql.functions._
val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector")
val fill = Seq(Tuple1(nil)).toDF("replacement")
df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")
Suppose to have a simple graph like:
val users = sc.parallelize(Array(
(1L, Seq("M", 2014, 40376, null, "N", 1, "Rajastan")),
(2L, Seq("M", 2009, 20231, null, "N", 1, "Rajastan")),
(3L, Seq("F", 2016, 40376, null, "N", 1, "Rajastan"))
))
val edges = sc.parallelize(Array(
Edge(1L, 2L, ""),
Edge(1L, 3L, ""),
Edge(2L, 3L, "")))
val graph = Graph(users, edges)
I'd like to compute how much each vertex is similar to its neighbors on each attribute.
The ideal output (an RDD or DataFrame) would hold these results:
1L: 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0
2L: 0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0
3L: 0.0, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0
For instance, the first value for 1L means that on 2 neighbors, just 1 share the same value...
I am playing with aggregateMessage just to count how many neighbors have a similar attribute value but with no avail so far:
val result = graph.aggregateMessages[(Int, Seq[Any])](
// build the message
sendMsg = {
// map function
triplet =>
// send message to destination vertex
triplet.sendToDst(1, triplet.srcAttr)
// send message to source vertex
triplet.sendToSrc(1, triplet.dstAttr)
}, // trying to count neighbors with similar property
{ case ((cnt1, sender), (cnt2, receiver)) =>
val prop1 = if(sender(0) == receiver(0)) 1d else 0d
val prop2 = if(Math.abs(sender(1).asInstanceOf[Int] - receiver(1).asInstanceOf[Int])<3) 1d else 0d
val prop3 = if(sender(2) == receiver(2)) 1d else 0d
val prop4 = if(sender(3) == receiver(3)) 1d else 0d
val prop5 = if(sender(4) == receiver(4)) 1d else 0d
val prop6 = if(sender(5) == receiver(5)) 1d else 0d
val prop7 = if(sender(6) == receiver(6)) 1d else 0d
(cnt1 + cnt2, Seq(prop1, prop2, prop3, prop4, prop5, prop6, prop7))
}
)
this gives me the correct neighborhood size for each vertex but is not summing up the values right:
//> (1,(2,List(0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)))
//| (2,(2,List(0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)))
//| (3,(2,List(1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)))
It doesn't sum values because there is no sum in your code. Moreover your logic is wrong. mergeMsg receives messages not (message, current) pairs. Try something like this:
import breeze.linalg.DenseVector
def compareAttrs(xs: Seq[Any], ys: Seq[Any]) =
DenseVector(xs.zip(ys).map{ case (x, y) => if (x == y) 1L else 0L}.toArray)
val result = graph.aggregateMessages[(Long, DenseVector[Long])](
triplet => {
val comparedAttrs = compareAttrs(triplet.dstAttr, triplet.srcAttr)
triplet.sendToDst(1L, comparedAttrs)
triplet.sendToSrc(1L, comparedAttrs)
},
{ case ((cnt1, v1), (cnt2, v2)) => (cnt1 + cnt2, v1 + v2) }
)
result.mapValues(kv => (kv._2.map(_.toDouble) / kv._1.toDouble)).collect
// Array(
// (1,DenseVector(0.5, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0)),
// (2,DenseVector(0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)),
// (3,DenseVector(0.0, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0)))
I have those items in my MongoDB collection:
{x: 1, y: 60, z:100}
{x: 1, y: 60, z:100}
{x: 1, y: 60, z:100}
{x: 2, y: 60, z:100}
{x: 2, y: 60, z:100}
{x: 3, y: 60, z:100}
{x: 4, y: 60, z:100}
{x: 4, y: 60, z:100}
{x: 5, y: 60, z:100}
{x: 6, y: 60, z:100}
{x: 6, y: 60, z:100}
{x: 6, y: 60, z:100}
{x: 7, y: 60, z:100}
{x: 7, y: 60, z:100}
I want to query the distinct values of x (i.e. [1, 2, 3, 4, 5, 6, 7]) ... but I only want a part of them (similar to what we can obtain with skip(a) and limit(b)).
How do I do that with the java driver of MongoDB (or with spring-data-mongodb if possible) ?
in mongo shell is simple with aggregate framework:
db.collection.aggregate([{$group:{_id:'$x'}}, {$skip:3}, {$limit:5}])
for java look: use aggregation framework in java
Depending on your use case, you may find this approach to be more performant than aggregation. Here's a mongo shell example function.
function getDistinctValues(skip, limit) {
var q = {x:{$gt: MinKey()}}; // query
var s = {x:1}; // sort key
var results = [];
for(var i = 0; i < skip; i++) {
var result = db.test.find(q).limit(1).sort(s).toArray()[0];
if(!result) {
return results;
}
q.x.$gt = result.x;
}
for(var i = 0; i < limit; i++) {
var result = db.test.find(q).limit(1).sort(s).toArray()[0];
if(!result) {
break;
}
results.push(result.x);
q.x.$gt = result.x;
}
return results;
}
We are basically just finding the values one at a time, using the query and sort to skip past values we have already seen. You can easily improve on this by adding more arguments to make the function more flexible. Also, creating an index on the property you want to find distinct values for will improve performance.
A less obvious improvement would be to skip the "skip" phase all together and specify a value to continue from. Here's a mongo shell example function.
function getDistinctValues(limit, lastValue) {
var q = {x:{$gt: lastValue === undefined ? MinKey() : lastValue}}; // query
var s = {x:1}; // sort key
var results = [];
for(var i = 0; i < limit; i++) {
var result = db.test.find(q).limit(1).sort(s).toArray()[0];
if(!result) {
break;
}
results.push(result.x);
q.x.$gt = result.x;
}
return results;
}
If you do decide to go with the aggregation technique, make sure you add a $sort stage after the $group stage. Otherwise your results will not show up in a predictable order.