how to compute vertex similarity to neighbors in graphx - scala

Suppose to have a simple graph like:
val users = sc.parallelize(Array(
(1L, Seq("M", 2014, 40376, null, "N", 1, "Rajastan")),
(2L, Seq("M", 2009, 20231, null, "N", 1, "Rajastan")),
(3L, Seq("F", 2016, 40376, null, "N", 1, "Rajastan"))
))
val edges = sc.parallelize(Array(
Edge(1L, 2L, ""),
Edge(1L, 3L, ""),
Edge(2L, 3L, "")))
val graph = Graph(users, edges)
I'd like to compute how much each vertex is similar to its neighbors on each attribute.
The ideal output (an RDD or DataFrame) would hold these results:
1L: 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0
2L: 0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0
3L: 0.0, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0
For instance, the first value for 1L means that on 2 neighbors, just 1 share the same value...
I am playing with aggregateMessage just to count how many neighbors have a similar attribute value but with no avail so far:
val result = graph.aggregateMessages[(Int, Seq[Any])](
// build the message
sendMsg = {
// map function
triplet =>
// send message to destination vertex
triplet.sendToDst(1, triplet.srcAttr)
// send message to source vertex
triplet.sendToSrc(1, triplet.dstAttr)
}, // trying to count neighbors with similar property
{ case ((cnt1, sender), (cnt2, receiver)) =>
val prop1 = if(sender(0) == receiver(0)) 1d else 0d
val prop2 = if(Math.abs(sender(1).asInstanceOf[Int] - receiver(1).asInstanceOf[Int])<3) 1d else 0d
val prop3 = if(sender(2) == receiver(2)) 1d else 0d
val prop4 = if(sender(3) == receiver(3)) 1d else 0d
val prop5 = if(sender(4) == receiver(4)) 1d else 0d
val prop6 = if(sender(5) == receiver(5)) 1d else 0d
val prop7 = if(sender(6) == receiver(6)) 1d else 0d
(cnt1 + cnt2, Seq(prop1, prop2, prop3, prop4, prop5, prop6, prop7))
}
)
this gives me the correct neighborhood size for each vertex but is not summing up the values right:
//> (1,(2,List(0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)))
//| (2,(2,List(0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)))
//| (3,(2,List(1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)))

It doesn't sum values because there is no sum in your code. Moreover your logic is wrong. mergeMsg receives messages not (message, current) pairs. Try something like this:
import breeze.linalg.DenseVector
def compareAttrs(xs: Seq[Any], ys: Seq[Any]) =
DenseVector(xs.zip(ys).map{ case (x, y) => if (x == y) 1L else 0L}.toArray)
val result = graph.aggregateMessages[(Long, DenseVector[Long])](
triplet => {
val comparedAttrs = compareAttrs(triplet.dstAttr, triplet.srcAttr)
triplet.sendToDst(1L, comparedAttrs)
triplet.sendToSrc(1L, comparedAttrs)
},
{ case ((cnt1, v1), (cnt2, v2)) => (cnt1 + cnt2, v1 + v2) }
)
result.mapValues(kv => (kv._2.map(_.toDouble) / kv._1.toDouble)).collect
// Array(
// (1,DenseVector(0.5, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0)),
// (2,DenseVector(0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0)),
// (3,DenseVector(0.0, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0)))

Related

Update specific indices of Seq by another Seq in Scala

I have a first seq, for example:
val s: Seq[Double] = List.fill(6)(0.0)
and a sub sequences of the indices of s:
val subInd: Seq[Int] = List(2, 4, 5)
Now, what I want to do is update s on the positions 2, 4 and 5 by another Seq, which has the length of subInd:
val t: Seq[Double] = List(5.0, 6.0, 7.0)
such, that:
val updateBySeq(s, subInd, t): List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)
I have searched on this site and found Update multiple values in a sequence where the second answer comes close to the functionality I want to have.
However, the difference is, that the function provided would update s on the indices contained in subInd by one value. I, on the other hand, would want them to correspond to multiple, unique values in a third Seq t.
I have tried various things, like using recursion and ListBuffers, instead of Lists, to incrementally update the elements of s, but either they left s unchanged or I got an error because I violated some immutability constraint.
This should work:
def updateListByIndexes[T](data: List[T])(indexes: List[Int], update: List[T]): List[T] = {
val updateMap = (indexes lazyZip update).toMap
data.iterator.zipWithIndex.map {
case (elem, idx) =>
updateMap.getOrElse(key = idx, default = elem)
}.toList
}
Which you can use like this:
val s = List.fill(6)(0.0)
// s: List[Double] = List(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
val subInd = List(2, 4, 5)
// subInd: List[Int] = List(2, 4, 5)
val t = List(5.0, 6.0, 7.0)
// t: List[Double] = List(5.0, 6.0, 7.0)
updateListByIndexes(s)(subInd, t)
// res: List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)

Merge List[List[_]] conditionally

I want to merge a List[List[Double]] based on the values of the elements in the inner Lists. Here's what I have so far:
// inner Lists are (timestamp, ID, measurement)
val data = List(List(60, 0, 3.4), List(60, 1, 2.5), List(120, 0, 1.1),
List(180, 0, 5.6), List(180, 1, 4.4), List(180, 2, 6.7))
data
.foldLeft(List[List[Double]]())(
(ret, ll) =>
// if this is the first list, just add it to the return val
if (ret.isEmpty){
List(ll)
// if the timestamps match, add a new (ID, measurement) pair to this inner list
} else if (ret(0)(0) == ll(0)){
{{ret(0) :+ ll(1)} :+ ll(2)} :: ret.drop(1)
// if this is a new timestamp, add it to the beginning of the return val
} else {
ll :: ret
}
)
This works, but it doesn't smell optimal to me (especially the right-additions ':+'). For my use case, I have a pretty big (~25,000 inner Lists) List of elements, which are themselves all length-3 Lists. At most, there will be a fourfold degeneracy, because the inner lists are List(timestamp, ID, measurement) groups, and there are only four unique IDs. Essentially, I want to smush together all of the measurements that have the same timestamps.
Does anyone see a more optimal way of doing this?
I actually start with a List[Double] of timestamps and a List[Double] of measurements for each of the four IDs, if there's a better way of starting from that point.
Here is a slightly shorter way to do it:
data.
groupBy(_(0)).
mapValues(_.flatMap(_.tail)).
toList.
map(kv => kv._1 :: kv._2)
The result looks 1:1 exactly the same as what your algorithm produces:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
Explanation:
group by timestamp
in the grouped values, drop the redundant timestamps, and flatten to single list
tack the timestamp back onto the flat list of ids-&-measurements
Here is a possibility:
input
.groupBy(_(0))
.map { case (tstp, values) => tstp :: values.flatMap(_.tail) }
The idea is just to group inner lists by their first element and then flatten the resulting values.
which returns:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
What about representing your measurements with a case class?
case class Measurement(timestamp: Int, id: Int, value: Double)
val measurementData = List(Measurement(60, 0, 3.4), Measurement(60, 1, 2.5),
Measurement(120, 0, 1.1), Measurement(180, 0, 5.6),
Measurement(180, 1, 4.4), Measurement(180, 2, 6.7))
measurementData.foldLeft(List[Measurement]())({
case (Nil, m) => List(m)
case (x :: xs, m) if x.timestamp == m.timestamp => m :: xs
case (xs, m) => m :: xs
})

Displaying output under a certain format

I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0

Spark fill DataFrame with Vector for null

I have a DataFrame that contains feature vectors created by the VectorAssembler, it also contains null values. I now want to replace the null values with a vector:
val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)
df.na.fill(nil) // does not work.
What is the right way to do this?
EDIT:
I found a way thanks to the answer:
val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)
import sc.implicits._
var fill = Seq(Tuple1(nil)).toDF("replacement")
val dates = data.schema.fieldNames.filter(e => e.contains("1"))
data = data.crossJoin(broadcast(fill))
for(e <- dates){
data = data.withColumn(e, coalesce(data.col(e), $"replacement"))
}
data = data.drop("replacement")
If the problem is created by adding some additional rows you join with replacement:
import org.apache.spark.sql.functions._
val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector")
val fill = Seq(Tuple1(nil)).toDF("replacement")
df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")

how to compare to previous values in Seq[Double]

I am new to functional programming, I have a Seq[Double] and I'd like to check for each value if it is higher (1), lower (-1) or equal (0) to previous value, like:
val g = Seq(0.1, 0.3, 0.5, 0.5, 0.5, 0.3)
and I'd like to have a result like:
val result = Seq(1, 1, 0, 0, 0, -1)
is there a more concise way than:
val g = Seq(0.1, 0.3, 0.5, 0.5, 0.5, 0.3)
g.sliding(2).toList.map(xs =>
if (xs(0)==xs(1)){
0
} else if(xs(0)>xs(1)){
-1
} else {
1
}
)
Use compare:
g.sliding(2).map{ case Seq(x, y) => y compare x }.toList
compare is added by an enrichment trait called OrderedProxy
That's rather concise in my opinion but I'd make it a function and pass it into map to make it more readable. I used pattern matching and guards.
//High, low, equal
scala> def hlo(x: Double, y: Double): Int = y - x match {
| case 0.0 => 0
| case x if x < 0.0 => -1
| case x if x > 0.0 => 1
| }
hlo: (x: Double, y: Double)Int
scala> g.sliding(2).map(xs => hlo(xs(0), xs(1))).toList
res9: List[Int] = List(1, 1, 0, 0, -1)
I agree with Travis Brown's comment from above so am proposing it as an answer.
Reversing the order of the values in the zip, just to match the order of g. This has the added benefit of using tuples instead of a sequence so no pattern matching is needed.
(g, g.tail).zipped.toList.map(t => t._2 compare t._1)
res0: List[Int] = List(1, 1, 0, 0, -1)