modifying spark GraphX pageRank to do random walk with restart - scala

I am trying to implement random walk with restart by modifying the Spark GraphX implementation of PageRank algorithm.
def randomWalkWithRestart(graph: Graph[VertexProperty, EdgeProperty], patientID: String , numIter: Int = 10, alpha: Double = 0.15, tol: Double = 0.01): Unit = {
var rankGraph: Graph[Double, Double] = graph
// Associate the degree with each vertex
.outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) }
// Set the weight on the edges based on the degree
.mapTriplets( e => 1.0 / e.srcAttr, TripletFields.Src )
// Set the vertex attributes to the initial pagerank values
.mapVertices( (id, attr) => alpha )
var iteration = 0
var prevRankGraph: Graph[Double, Double] = null
while (iteration < numIter) {
rankGraph.cache()
// Compute the outgoing rank contributions of each vertex, perform local preaggregation, and
// do the final aggregation at the receiving vertices. Requires a shuffle for aggregation.
val rankUpdates = rankGraph.aggregateMessages[Double](
ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr), _ + _, TripletFields.Src)
// Apply the final rank updates to get the new ranks, using join to preserve ranks of vertices
// that didn't receive a message. Requires a shuffle for broadcasting updated ranks to the
// edge partitions.
prevRankGraph = rankGraph
rankGraph = rankGraph.joinVertices(rankUpdates) {
(id, oldRank, msgSum) => alpha + (1.0 - alpha) * msgSum
}.cache()
rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
//logInfo(s"PageRank finished iteration $iteration.")
prevRankGraph.vertices.unpersist(false)
prevRankGraph.edges.unpersist(false)
iteration += 1
}
}
I believe the (id, oldRank, msgSum) => alpha + (1.0 - alpha) * msgSum part should be changed, but I am not sure how. I need to add the ready state probability to this line.
Furthermore, the ready state probability should be initialized somewhere before the while loop. And the ready state probability has to be uploaded inside the while loop.
Any suggestions would be appreciated.

Related

Matching Torch STFT with Accelerate

Im trying to re-implement Torch's STFT code in Swift with Accelerate / vDSP, to produce a Log Mel Spectrogram by post processing the STFT so I can use the Mel Spectrogram as an input for a CoreML port of OpenAI's Whisper
Pytorch's native STFT / Mel code produces this Spectrogram (its clipped due to importing raw float 32s into Photoshop lol)
and mine:
Obviously the two things to notice are the values, and the lifted frequency components.
The STFT Docs here https://pytorch.org/docs/stable/generated/torch.stft.html
X[ω,m]=
k=0
∑
win_length-1
​
window[k] input[m×hop_length+k] * exp(−j * (2π⋅ωk) /win_length)
I believe Im properly handling window[k] input[m×hop_length+k] but I'm a bit lost as to how to calculate the exponent and what -J is referring to in the documentation, and how to convert the final exponential in vDSP. Also, if its a sum, how do I get the 200 elements I need!?
My Log Mel Spectrogram
My code follows:
func processData(audio: [Int16]) -> [Float]
{
assert(self.sampleCount == audio.count)
var audioFloat:[Float] = [Float](repeating: 0, count: audio.count)
vDSP.convertElements(of: audio, to: &audioFloat)
vDSP.divide(audioFloat, 32768.0, result: &audioFloat)
// Up to this point, Python and swift are numerically identical
// insert numFFT/2 samples before and numFFT/2 after so we have a extra numFFT amount to process
// TODO: Is this stricly necessary?
audioFloat.insert(contentsOf: [Float](repeating: 0, count: self.numFFT/2), at: 0)
audioFloat.append(contentsOf: [Float](repeating: 0, count: self.numFFT/2))
// Split Complex arrays holding the FFT results
var allSampleReal = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
var allSampleImaginary = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
// Step 2 - we need to create 200 x 3000 matrix of STFTs - note we appear to want to output complex numbers (?)
for (m) in 0 ..< self.melSampleCount
{
// Slice numFFTs every hop count (barf) and make a mel spectrum out of it
// audioFrame ends up holding split complex numbers
var audioFrame = Array<Float>( audioFloat[ (m * self.hopCount) ..< ( (m * self.hopCount) + self.numFFT) ] )
// Copy of audioFrame original samples
let audioFrameOriginal = audioFrame
assert(audioFrame.count == self.numFFT)
// Split Complex arrays holding a single FFT result of our Audio Frame, which gets appended to the allSample Split Complex arrays
var sampleReal:[Float] = [Float](repeating: 0, count: self.numFFT/2)
var sampleImaginary:[Float] = [Float](repeating: 0, count: self.numFFT/2)
sampleReal.withUnsafeMutableBytes { unsafeReal in
sampleImaginary.withUnsafeMutableBytes { unsafeImaginary in
vDSP.multiply(audioFrame,
hanningWindow,
result: &audioFrame)
var complexSignal = DSPSplitComplex(realp: unsafeReal.bindMemory(to: Float.self).baseAddress!,
imagp: unsafeImaginary.bindMemory(to: Float.self).baseAddress!)
audioFrame.withUnsafeBytes { unsafeAudioBytes in
vDSP.convert(interleavedComplexVector: [DSPComplex](unsafeAudioBytes.bindMemory(to: DSPComplex.self)),
toSplitComplexVector: &complexSignal)
}
// Step 3 - creating the FFT
self.fft.forward(input: complexSignal, output: &complexSignal)
}
}
// We need to match: https://pytorch.org/docs/stable/generated/torch.stft.html
// At this point, I'm unsure how to continue?
// let twoπ = Float.pi * 2
// let freqstep:Float = Float(16000 / (self.numFFT/2))
//
// var w:Float = 0.0
// for (k) in 0 ..< self.numFFT/2
// {
// let j:Float = sampleImaginary[k]
// let sample = audioFrame[k]
//
// let exponent = -j * ( (twoπ * freqstep * Float(k) ) / Float((self.numFFT/2)))
//
// w += powf(sample, exponent)
// }
allSampleReal[m] = sampleReal
allSampleImaginary[m] = sampleImaginary
}
// We now have allSample Split Complex holding 3000 200 dimensional real and imaginary FFT results
// We create flattened 3000 x 200 array of DSPSplitComplex values
var flattnedReal:[Float] = allSampleReal.flatMap { $0 }
var flattnedImaginary:[Float] = allSampleImaginary.flatMap { $0 }

Optimize "range-join" in plain scala (not Spark!)

I have two ordered sequences, one (large) is range of positions, one (small) is a sequence of attributes, defined on position_from/position_two which I'd like to join.
So for each element of positions, I need to traverse the other sequences, which is not optimal
def interpolateCurveOnPos(position:Seq[Double], curveAttributes:Seq[CurveSegment]) = {
position.map { pos =>
// range join
val cs = curveAttributes.find(c => pos >= c.position_von && pos < c.position_bis).get
// interpolate curve attribute
val curve = cs.curve_von + (pos - cs.position_von) * (cs.curve_bis - cs.curve_von) / (cs.position_bis - cs.position_von)
return curve
}
What I've tried:
As the index at which the matching curveSegement is found will allways increase, I've introduced a some state variables to reduce the search of the correct entry
def interpolateCurveOnPos(position:Seq[Double], curveAttributes:Seq[CurveSegment]) = {
var idxSave = 0
var csSave : CurveSegment = curveAttributes.head
position.map { pos =>
// range join
val cs = curveAttributes.drop(idxSave).find(c => pos >= c.position_von && pos < c.position_bis).get
if(cs != csSave) {
csSave = cs
idxSave=idxSave+1
}
// interpolate
val curve = cs.curve_von + (pos - cs.position_von) * (cs.curve_bis - cs.curve_von) / (cs.position_bis - cs.position_von)
return curve
}
I wonder if there is a more elegent way to do it?

How to optimize this algorithm that find all maximal matching in a graph?

In my app people give grades to each other, out of ten point. Each day, an algorithm computes a match for as much people as possible (it's impossible to compute a match for everyone). It makes a graph where vertexes are users and edges are the grades
I simplify the problem by saying that if 2 people give a grade to each other, there is an edge between them with a weight of their respective grade average. But if A give a grade to B, but B doesnt, their is no edge between them and they can never match : this way, the graph is not oriented anymore
I would like that, in average everybody be happy, but in the same time, I would like as few as possible of people that have no match.
Being very deterministic, I made an algorithm that find ALL maximal matchings in a graph. I did that because I thought I could analyse all these maximal matchings and apply a value function that could look like :
V(Matching) = exp(|M| / max(|M|)) * sum(weight of all Edge in M)
That is to say, a matching is high-valued if its cardinal is close to the cardinal of the maximum matching, and if the sum of the grade between people is high. I put an exponential function to the ratio |M|/max|M| because I consider it's a big problem if M is lower that 0.8 (so the exp will be arranged to highly decrease V as |M|/max|M| reaches 0.8)
I would have take the matching where V(M) is maximal. Though, the big problem is that my function that computes all maximal matching takes a lot of time. For only 15 vertex and 20 edges, it takes almost 10 minutes...
Here is the algorithm (in Swift) :
import Foundation
struct Edge : CustomStringConvertible {
var description: String {
return "e(\(v1), \(v2))"
}
let v1:Int
let v2:Int
let w:Int?
init(_ arrint:[Int])
{
v1 = arrint[0]
v2 = arrint[1]
w = nil
}
init(_ v1:Int, _ v2:Int)
{
self.v1 = v1
self.v2 = v2
w = nil
}
init(_ v1:Int, _ v2:Int, _ w:Int)
{
self.v1 = v1
self.v2 = v2
self.w = w
}
}
let mygraph:[Edge] =
[
Edge([1, 2]),
Edge([1, 5]),
Edge([2, 5]),
Edge([2, 3]),
Edge([3, 4]),
Edge([3, 6]),
Edge([5, 6]),
Edge([2,6]),
Edge([4,1]),
Edge([3,5]),
Edge([4,2]),
Edge([7,1]),
Edge([7,2]),
Edge([8,1]),
Edge([9,8]),
Edge([11,2]),
Edge([11, 8]),
Edge([12,13]),
Edge([1,6]),
Edge([4,7]),
Edge([5,7]),
Edge([3,5]),
Edge([9,1]),
Edge([10,11]),
Edge([10,4]),
Edge([10,2]),
Edge([10,1]),
Edge([10, 12]),
]
// remove all the edge and vertex "touching" the edges and vertex in "edgePath"
func reduce (graph:[Edge], edgePath:[Edge]) -> [Edge]
{
var alreadyUsedV:[Int] = []
for edge in edgePath
{
alreadyUsedV.append(edge.v1)
alreadyUsedV.append(edge.v2)
}
return graph.filter({ edge in
return alreadyUsedV.first(where:{ edge.v1 == $0 }) == nil && alreadyUsedV.first(where:{ edge.v2 == $0 }) == nil
})
}
func findAllMaximalMatching(graph Gi:[Edge]) -> [[Edge]]
{
var matchings:[[Edge]] = []
var G = Gi // current graph (reduced at each depth)
var M:[Edge] = [] // current matching being built
var Cx:[Int] = [] // current path in the possibilities tree
// eg : Cx[1] = 3 : for the depth 1, we are at the 3th edge
var d:Int = 0 // current depth
var debug_it = 0
while(true)
{
if(G.count == 0) // if there is no available edge in graph, it means we have a matching
{
if(M.count > 0) // security, if initial Graph is empty we cannot return an empty matching
{
matchings.append(M)
}
if(d == 0)
{
// depth = 0, we cannot decrement d, we have finished all the tree possibilities
break
}
d = d - 1
_ = M.popLast()
G = reduce(graph: Gi, edgePath: M)
}
else
{
let indexForThisDepth = Cx.count > d ? Cx[d] + 1 : 0
if(G.count < indexForThisDepth + 1)
{
// depth ended,
_ = Cx.popLast()
if( d == 0)
{
break
}
d = d - 1
_ = M.popLast()
// reduce from initial graph to the decremented depth
G = reduce(graph: Gi, edgePath: M)
}
else
{
// matching not finished to be built
M.append( G[indexForThisDepth] )
if(indexForThisDepth == 0)
{
Cx.append(indexForThisDepth)
}
else
{
Cx[d] = indexForThisDepth
}
d = d + 1
G = reduce(graph: G, edgePath: M)
}
}
debug_it += 1
}
print("matching counts : \(matchings.count)")
print("iterations : \(debug_it)")
return matchings
}
let m = findAllMaximalMatching(graph: mygraph)
// we have compute all the maximal matching, now we loop through all of them to find the one that has V(Mi) maximum
// ....
Finally my question is : how can I optimize this algorithm to find all maximal matching and to compute my value function on them to find the best matching for my app in a polynomial time ?
I may be missing something since the question is quite complicated, but why not simply use maximum flow problem, with every vertex appearing twice and the edges weights are the average grading if exists? It will return the maximal flow if configured correctly and runs polynomial time.

Functional version of a typical nested while loop

I hope this question may please functional programming lovers. Could I ask for a way to translate the following fragment of code to a pure functional implementation in Scala with good balance between readability and execution speed?
Description: for each elements in a sequence, produce a sub-sequence contains the elements that comes after the current elements (including itself) with a distance smaller than a given threshold. Once the threshold is crossed, it is not necessary to process the remaining elements
def getGroupsOfElements(input : Seq[Element]) : Seq[Seq[Element]] = {
val maxDistance = 10 // put any number you may
var outerSequence = Seq.empty[Seq[Element]]
for (index <- 0 until input.length) {
var anotherIndex = index + 1
var distance = input(index) - input(anotherIndex) // let assume a separate function for computing the distance
var innerSequence = Seq(input(index))
while (distance < maxDistance && anotherIndex < (input.length - 1)) {
innerSequence = innerSequence ++ Seq(input(anotherIndex))
anotherIndex = anotherIndex + 1
distance = input(index) - input(anotherIndex)
}
outerSequence = outerSequence ++ Seq(innerSequence)
}
outerSequence
}
You know, this would be a ton easier if you added a description of what you're trying to accomplish along with the code.
Anyway, here's something that might get close to what you want.
def getGroupsOfElements(input: Seq[Element]): Seq[Seq[Element]] =
input.tails.map(x => x.takeWhile(y => distance(x.head,y) < maxDistance)).toSeq

Exclude items from training set data

I have my data in two colors and excluded_colors.
colors contains all colors
excluded_colors contains some colors that I wish to exclude from my trainingset.
I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but exist in the testing set.
In order to achieve the above, I did this
var colors = spark.sql("""
select colors.*
from colors
LEFT JOIN excluded_colors
ON excluded_colors.color_id = colors.color_id
where excluded_colors.color_id IS NULL
"""
)
val trainer: (Int => Int) = (arg:Int) => 0
val sqlTrainer = udf(trainer)
val tester: (Int => Int) = (arg:Int) => 1
val sqlTester = udf(tester)
val rsplit = colors.randomSplit(Array(0.7, 0.3))
val train_colors = splits(0).select("color_id").withColumn("test",sqlTrainer(col("color_id")))
val test_colors = splits(1).select("color_id").withColumn("test",sqlTester(col("color_id")))
However, I'm realizing that by doing the above the colors in excluded_colors are completely ignored. They are not even in my testing set.
Question
How can I split the data in 70/30 while also ensuring that the colors in excluded_colors are not in training but are present in testing.
What we want to do is remove the "excluded colors" from the training set but have them in the testing and have a training/test split of 70/30.
What we need is a bit of math.
Given the total dataset (TD) and the excluded colors dataset (E) we can say that for train dataset (Tr) and test dataset (Ts) that:
|Tr| = x * (|TD|-|E|)
|Ts| = |E| + (1-x) * |TD|
We also know that |Tr| = 0.7 |TD|
Hence x = 0.7 |TD| / (|TD| - |E|)
Now that we know the sampling factor x, we can say:
Tr = (TD-E).sample(withReplacement = false, fraction = x)
// where (TD - E) is the result of the SQL expr above
Ts = TD.sample(withReplacement = false, fraction = 0.3)
// we sample the test set from the original dataset