Spark fill DataFrame with Vector for null - scala

I have a DataFrame that contains feature vectors created by the VectorAssembler, it also contains null values. I now want to replace the null values with a vector:
val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)
df.na.fill(nil) // does not work.
What is the right way to do this?
EDIT:
I found a way thanks to the answer:
val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)
import sc.implicits._
var fill = Seq(Tuple1(nil)).toDF("replacement")
val dates = data.schema.fieldNames.filter(e => e.contains("1"))
data = data.crossJoin(broadcast(fill))
for(e <- dates){
data = data.withColumn(e, coalesce(data.col(e), $"replacement"))
}
data = data.drop("replacement")

If the problem is created by adding some additional rows you join with replacement:
import org.apache.spark.sql.functions._
val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector")
val fill = Seq(Tuple1(nil)).toDF("replacement")
df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")

Related

Swift reduce tuple array custom data model

I am attempting to reduce "y" value components for each similar "x" value in this CustomDataModel array, and return a CustomDataModel array (or simply an array of tuples [(Double, Double)].
My knowledge of Swift is rudimentary at best and the attempt to use a chaining function as follows reduces across the entire array and not per "x" value. I'm not sure how I can limit the reduce function to only the like values.
let reducedArray = yValues.filter( {$0.x == $0.x}).map({$0.y}).reduce(0, +)
the above function reduces all y values not per x.
Below is a the model and some dummy data:
struct CustomDataModel {
var x : Double
var y : Double
}
let yValues: [CustomDataModel] = [
CustomDataModel(x: 0.0, y: 10.0),
CustomDataModel(x: 1.0, y: 5.0),
CustomDataModel(x: 1.0, y: 12.5),
CustomDataModel(x: 1.0, y: 14.5),
CustomDataModel(x: 1.0, y: 18.45),
CustomDataModel(x: 5.0, y: 11.4),
CustomDataModel(x: 5.0, y: 9.4),
CustomDataModel(x: 5.0, y: 18.4),
CustomDataModel(x: 5.0, y: 5.0),
CustomDataModel(x: 9.0, y: 7.6),
CustomDataModel(x: 9.0, y: 13.5),
CustomDataModel(x: 9.0, y: 18.5),
CustomDataModel(x: 9.0, y: 17.6),
CustomDataModel(x: 9.0, y: 14.3),
CustomDataModel(x: 14.0, y: 19.6),
CustomDataModel(x: 14.0, y: 17.8),
CustomDataModel(x: 14.0, y: 20.1),
CustomDataModel(x: 14.0, y: 21.5),
CustomDataModel(x: 14.0, y: 23.4),
]
Ideally the output I would have would look like this:
print(reducedArray)
//[(0.0, 10.0), (1.0, 50.45), (5.0, 44.2), (9.0, 71.5), (14.0, 102.4)]
You need to group the yValues array by x into a Dictionary. And map the values to y and then perform reduce. Here's how:
let reducedArray = Dictionary(grouping: yValues, by: \.x).mapValues({ $0.map(\.y).reduce(0, +) })
print(reducedArray)
Update: If you require the result to be in [(Double, Double)] format you can map the above result to tuples.

Rolling sum over a Seq with Date as key

I'm working on a small personal project and am trying to re-write the codebase from Python to Scala so that I can be a little more competent functional programmer.
I am working with a Seq that contains stock data and need to create a running sum of volume traded for each day.
My code so far is:
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
case class SymbolData(date: DateTime, open: Double, high: Double, low: Double, close: Double, adjClose: Double, volume: Int)
def dateTimeHelper(date: String): DateTime = {
DateTimeFormat.forPattern("yyyy-MM-dd").parseDateTime(date)
}
val sampleData: Seq[SymbolData] = Seq(
SymbolData(dateTimeHelper("2019-01-01"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(dateTimeHelper("2019-01-02"), 3.0, 2.0, 5.0, 2.0, 8.0, 20),
SymbolData(dateTimeHelper("2019-01-03"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(dateTimeHelper("2019-01-04"), 4.0, 3.0, 2.5, 2.3, 5.3, 7))
Not all dates may be present so I do not think using a sliding window will be appropriate. For the output I would need to get a Seq of ints that contain sum of last 2 days of data, for example:
Seq(10, 30, 30, 17) # 2019-01-01 has only 1 day with sum value of 10 since there is no data for 2018-12-31, 2019-01-02 would be 30 since we have 2nd and 1st of Jan present, etc...
This is not overly difficult to do in base python, however with Scala there seem to be quite a few options (recursive use of folds?) but I am struggling with the syntax and implementation. Would anyone be able to shed some light on this?
You say "not all dates may be present" but you don't specify how date gaps should be handled.
Here I guessed that output should include all 2-day sums, gap days included.
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
case class SymbolData(date : LocalDate
,open : Double
,high : Double
,low : Double
,close : Double
,adjClose : Double
,volume : Int)
val sampleData: List[SymbolData] = List(
SymbolData(LocalDate.parse("2019-01-01"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(LocalDate.parse("2019-01-02"), 3.0, 2.0, 5.0, 2.0, 8.0, 20),
SymbolData(LocalDate.parse("2019-01-03"), 1.0, 1.0, 1.0, 1.0, 1.0, 10),
SymbolData(LocalDate.parse("2019-01-04"), 4.0, 3.0, 2.5, 2.3, 5.3, 7),
// 1 day gap
SymbolData(LocalDate.parse("2019-01-06"), 4.4, 3.3, 2.2, 2.3, 1.3, 13),
// 2 day gap
SymbolData(LocalDate.parse("2019-01-09"), 2.4, 2.2, 1.5, 3.1, 0.9, 21),
SymbolData(LocalDate.parse("2019-01-10"), 2.4, 2.2, 1.5, 3.1, 0.9, 11)
)
val volByDate = sampleData.foldLeft(Map.empty[LocalDate,Int]){
case (m,sd) => m + (sd.date -> sd.volume)
}.withDefaultValue(0)
val startDate = sampleData.head.date
val endDate = sampleData.last.date
val rslt = List.unfold(startDate){ date => //<--Scala 2.13
if (date isAfter endDate) None
else
Some(volByDate(date) + volByDate(date.minus(1L,DAYS)) -> date.plus(1L,DAYS))
}
//rslt: List[Int] = List(10, 30, 30, 17, 7, 13, 13, 0, 21, 32)

How to modify the size of two array/dataset to be same in Spark Scala?

I have two arrays/dataset.
scala> data1.collect
res2: Array[Array[Double]] = Array(Array(1.0, 100.0), Array(0.7, 100.0), Array(0.8, 50.0))
scala> data2.collect
res3: Array[Array[Double]] = Array(Array(0.25, 0.0, 0.0), Array(1.0, 125.0, 0.0), Array(0.5, 0.0, 20.0), Array(0.5, 0.0, 15.0))
I want that data1 and data2 size to be same (number of inside arrays should be same, in data1 number of inside array is 3 and in data2 it is 4). I want to add one Array(0.0, 0.0) or as many same number of Array(0.0, 0.0) in data1 as it is present in data2.
Please tell me a way to do that.
First, find out how many new rows you need to add to the data1 dataset. Using some example data from the question:
val data1 = Seq(Seq(1.0, 100.0), Seq(0.7, 100.0), Seq(0.8, 50.0))
.toDF("col1").as[Array[Double]]
val data2 = Seq(Seq(0.8, 50.0), Seq(1.0, 125.0, 0.0), Seq(0.5, 0.0, 20.0), Seq(0.5, 0.0, 15.0))
.toDF("col1").as[Array[Double]]
val diff = data2.count() - data1.count()
In this case diff will have a value of 1.
Next, create a new datasetwith the appropriate number of rows containing only the Array(0.0, 0.0) that should be appended. Then add this new dataset to data1 by using union:
val appendData = Seq.fill(diff.toInt)(Array(0.0, 0.0)).toDF("col1").as[Array[Double]]
val data3 = data1.union(appendData)
Result:
+------------+
| col1|
+------------+
|[1.0, 100.0]|
|[0.7, 100.0]|
| [0.8, 50.0]|
| [0.0, 0.0]|
+------------+

Merge List[List[_]] conditionally

I want to merge a List[List[Double]] based on the values of the elements in the inner Lists. Here's what I have so far:
// inner Lists are (timestamp, ID, measurement)
val data = List(List(60, 0, 3.4), List(60, 1, 2.5), List(120, 0, 1.1),
List(180, 0, 5.6), List(180, 1, 4.4), List(180, 2, 6.7))
data
.foldLeft(List[List[Double]]())(
(ret, ll) =>
// if this is the first list, just add it to the return val
if (ret.isEmpty){
List(ll)
// if the timestamps match, add a new (ID, measurement) pair to this inner list
} else if (ret(0)(0) == ll(0)){
{{ret(0) :+ ll(1)} :+ ll(2)} :: ret.drop(1)
// if this is a new timestamp, add it to the beginning of the return val
} else {
ll :: ret
}
)
This works, but it doesn't smell optimal to me (especially the right-additions ':+'). For my use case, I have a pretty big (~25,000 inner Lists) List of elements, which are themselves all length-3 Lists. At most, there will be a fourfold degeneracy, because the inner lists are List(timestamp, ID, measurement) groups, and there are only four unique IDs. Essentially, I want to smush together all of the measurements that have the same timestamps.
Does anyone see a more optimal way of doing this?
I actually start with a List[Double] of timestamps and a List[Double] of measurements for each of the four IDs, if there's a better way of starting from that point.
Here is a slightly shorter way to do it:
data.
groupBy(_(0)).
mapValues(_.flatMap(_.tail)).
toList.
map(kv => kv._1 :: kv._2)
The result looks 1:1 exactly the same as what your algorithm produces:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
Explanation:
group by timestamp
in the grouped values, drop the redundant timestamps, and flatten to single list
tack the timestamp back onto the flat list of ids-&-measurements
Here is a possibility:
input
.groupBy(_(0))
.map { case (tstp, values) => tstp :: values.flatMap(_.tail) }
The idea is just to group inner lists by their first element and then flatten the resulting values.
which returns:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
What about representing your measurements with a case class?
case class Measurement(timestamp: Int, id: Int, value: Double)
val measurementData = List(Measurement(60, 0, 3.4), Measurement(60, 1, 2.5),
Measurement(120, 0, 1.1), Measurement(180, 0, 5.6),
Measurement(180, 1, 4.4), Measurement(180, 2, 6.7))
measurementData.foldLeft(List[Measurement]())({
case (Nil, m) => List(m)
case (x :: xs, m) if x.timestamp == m.timestamp => m :: xs
case (xs, m) => m :: xs
})

how to compute vertex similarity to neighbors in graphx

Suppose to have a simple graph like:
val users = sc.parallelize(Array(
(1L, Seq("M", 2014, 40376, null, "N", 1, "Rajastan")),
(2L, Seq("M", 2009, 20231, null, "N", 1, "Rajastan")),
(3L, Seq("F", 2016, 40376, null, "N", 1, "Rajastan"))
))
val edges = sc.parallelize(Array(
Edge(1L, 2L, ""),
Edge(1L, 3L, ""),
Edge(2L, 3L, "")))
val graph = Graph(users, edges)
I'd like to compute how much each vertex is similar to its neighbors on each attribute.
The ideal output (an RDD or DataFrame) would hold these results:
1L: 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0
2L: 0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0
3L: 0.0, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0
For instance, the first value for 1L means that on 2 neighbors, just 1 share the same value...
I am playing with aggregateMessage just to count how many neighbors have a similar attribute value but with no avail so far:
val result = graph.aggregateMessages[(Int, Seq[Any])](
// build the message
sendMsg = {
// map function
triplet =>
// send message to destination vertex
triplet.sendToDst(1, triplet.srcAttr)
// send message to source vertex
triplet.sendToSrc(1, triplet.dstAttr)
}, // trying to count neighbors with similar property
{ case ((cnt1, sender), (cnt2, receiver)) =>
val prop1 = if(sender(0) == receiver(0)) 1d else 0d
val prop2 = if(Math.abs(sender(1).asInstanceOf[Int] - receiver(1).asInstanceOf[Int])<3) 1d else 0d
val prop3 = if(sender(2) == receiver(2)) 1d else 0d
val prop4 = if(sender(3) == receiver(3)) 1d else 0d
val prop5 = if(sender(4) == receiver(4)) 1d else 0d
val prop6 = if(sender(5) == receiver(5)) 1d else 0d
val prop7 = if(sender(6) == receiver(6)) 1d else 0d
(cnt1 + cnt2, Seq(prop1, prop2, prop3, prop4, prop5, prop6, prop7))
}
)
this gives me the correct neighborhood size for each vertex but is not summing up the values right:
//> (1,(2,List(0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)))
//| (2,(2,List(0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)))
//| (3,(2,List(1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)))
It doesn't sum values because there is no sum in your code. Moreover your logic is wrong. mergeMsg receives messages not (message, current) pairs. Try something like this:
import breeze.linalg.DenseVector
def compareAttrs(xs: Seq[Any], ys: Seq[Any]) =
DenseVector(xs.zip(ys).map{ case (x, y) => if (x == y) 1L else 0L}.toArray)
val result = graph.aggregateMessages[(Long, DenseVector[Long])](
triplet => {
val comparedAttrs = compareAttrs(triplet.dstAttr, triplet.srcAttr)
triplet.sendToDst(1L, comparedAttrs)
triplet.sendToSrc(1L, comparedAttrs)
},
{ case ((cnt1, v1), (cnt2, v2)) => (cnt1 + cnt2, v1 + v2) }
)
result.mapValues(kv => (kv._2.map(_.toDouble) / kv._1.toDouble)).collect
// Array(
// (1,DenseVector(0.5, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0)),
// (2,DenseVector(0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)),
// (3,DenseVector(0.0, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0)))