Optimize "range-join" in plain scala (not Spark!) - scala

I have two ordered sequences, one (large) is range of positions, one (small) is a sequence of attributes, defined on position_from/position_two which I'd like to join.
So for each element of positions, I need to traverse the other sequences, which is not optimal
def interpolateCurveOnPos(position:Seq[Double], curveAttributes:Seq[CurveSegment]) = {
position.map { pos =>
// range join
val cs = curveAttributes.find(c => pos >= c.position_von && pos < c.position_bis).get
// interpolate curve attribute
val curve = cs.curve_von + (pos - cs.position_von) * (cs.curve_bis - cs.curve_von) / (cs.position_bis - cs.position_von)
return curve
}
What I've tried:
As the index at which the matching curveSegement is found will allways increase, I've introduced a some state variables to reduce the search of the correct entry
def interpolateCurveOnPos(position:Seq[Double], curveAttributes:Seq[CurveSegment]) = {
var idxSave = 0
var csSave : CurveSegment = curveAttributes.head
position.map { pos =>
// range join
val cs = curveAttributes.drop(idxSave).find(c => pos >= c.position_von && pos < c.position_bis).get
if(cs != csSave) {
csSave = cs
idxSave=idxSave+1
}
// interpolate
val curve = cs.curve_von + (pos - cs.position_von) * (cs.curve_bis - cs.curve_von) / (cs.position_bis - cs.position_von)
return curve
}
I wonder if there is a more elegent way to do it?

Related

How to optimize this algorithm that find all maximal matching in a graph?

In my app people give grades to each other, out of ten point. Each day, an algorithm computes a match for as much people as possible (it's impossible to compute a match for everyone). It makes a graph where vertexes are users and edges are the grades
I simplify the problem by saying that if 2 people give a grade to each other, there is an edge between them with a weight of their respective grade average. But if A give a grade to B, but B doesnt, their is no edge between them and they can never match : this way, the graph is not oriented anymore
I would like that, in average everybody be happy, but in the same time, I would like as few as possible of people that have no match.
Being very deterministic, I made an algorithm that find ALL maximal matchings in a graph. I did that because I thought I could analyse all these maximal matchings and apply a value function that could look like :
V(Matching) = exp(|M| / max(|M|)) * sum(weight of all Edge in M)
That is to say, a matching is high-valued if its cardinal is close to the cardinal of the maximum matching, and if the sum of the grade between people is high. I put an exponential function to the ratio |M|/max|M| because I consider it's a big problem if M is lower that 0.8 (so the exp will be arranged to highly decrease V as |M|/max|M| reaches 0.8)
I would have take the matching where V(M) is maximal. Though, the big problem is that my function that computes all maximal matching takes a lot of time. For only 15 vertex and 20 edges, it takes almost 10 minutes...
Here is the algorithm (in Swift) :
import Foundation
struct Edge : CustomStringConvertible {
var description: String {
return "e(\(v1), \(v2))"
}
let v1:Int
let v2:Int
let w:Int?
init(_ arrint:[Int])
{
v1 = arrint[0]
v2 = arrint[1]
w = nil
}
init(_ v1:Int, _ v2:Int)
{
self.v1 = v1
self.v2 = v2
w = nil
}
init(_ v1:Int, _ v2:Int, _ w:Int)
{
self.v1 = v1
self.v2 = v2
self.w = w
}
}
let mygraph:[Edge] =
[
Edge([1, 2]),
Edge([1, 5]),
Edge([2, 5]),
Edge([2, 3]),
Edge([3, 4]),
Edge([3, 6]),
Edge([5, 6]),
Edge([2,6]),
Edge([4,1]),
Edge([3,5]),
Edge([4,2]),
Edge([7,1]),
Edge([7,2]),
Edge([8,1]),
Edge([9,8]),
Edge([11,2]),
Edge([11, 8]),
Edge([12,13]),
Edge([1,6]),
Edge([4,7]),
Edge([5,7]),
Edge([3,5]),
Edge([9,1]),
Edge([10,11]),
Edge([10,4]),
Edge([10,2]),
Edge([10,1]),
Edge([10, 12]),
]
// remove all the edge and vertex "touching" the edges and vertex in "edgePath"
func reduce (graph:[Edge], edgePath:[Edge]) -> [Edge]
{
var alreadyUsedV:[Int] = []
for edge in edgePath
{
alreadyUsedV.append(edge.v1)
alreadyUsedV.append(edge.v2)
}
return graph.filter({ edge in
return alreadyUsedV.first(where:{ edge.v1 == $0 }) == nil && alreadyUsedV.first(where:{ edge.v2 == $0 }) == nil
})
}
func findAllMaximalMatching(graph Gi:[Edge]) -> [[Edge]]
{
var matchings:[[Edge]] = []
var G = Gi // current graph (reduced at each depth)
var M:[Edge] = [] // current matching being built
var Cx:[Int] = [] // current path in the possibilities tree
// eg : Cx[1] = 3 : for the depth 1, we are at the 3th edge
var d:Int = 0 // current depth
var debug_it = 0
while(true)
{
if(G.count == 0) // if there is no available edge in graph, it means we have a matching
{
if(M.count > 0) // security, if initial Graph is empty we cannot return an empty matching
{
matchings.append(M)
}
if(d == 0)
{
// depth = 0, we cannot decrement d, we have finished all the tree possibilities
break
}
d = d - 1
_ = M.popLast()
G = reduce(graph: Gi, edgePath: M)
}
else
{
let indexForThisDepth = Cx.count > d ? Cx[d] + 1 : 0
if(G.count < indexForThisDepth + 1)
{
// depth ended,
_ = Cx.popLast()
if( d == 0)
{
break
}
d = d - 1
_ = M.popLast()
// reduce from initial graph to the decremented depth
G = reduce(graph: Gi, edgePath: M)
}
else
{
// matching not finished to be built
M.append( G[indexForThisDepth] )
if(indexForThisDepth == 0)
{
Cx.append(indexForThisDepth)
}
else
{
Cx[d] = indexForThisDepth
}
d = d + 1
G = reduce(graph: G, edgePath: M)
}
}
debug_it += 1
}
print("matching counts : \(matchings.count)")
print("iterations : \(debug_it)")
return matchings
}
let m = findAllMaximalMatching(graph: mygraph)
// we have compute all the maximal matching, now we loop through all of them to find the one that has V(Mi) maximum
// ....
Finally my question is : how can I optimize this algorithm to find all maximal matching and to compute my value function on them to find the best matching for my app in a polynomial time ?
I may be missing something since the question is quite complicated, but why not simply use maximum flow problem, with every vertex appearing twice and the edges weights are the average grading if exists? It will return the maximal flow if configured correctly and runs polynomial time.

merge sort performance compared to insertion sort

For any array of length greater than 10, is it safe to say that merge sort performs fewer comparisons among the array's elements than does insertion sort on the same array because the best case for the run time of merge sort is O(N log N) while for insertion sort, its O(N)?
My take on this. First off, you are talking about comparisons, but there are swaps as well that matter.
In insertion sort in the worst case (an array sorted in opposite direction) you have to do n^2 - n comparisons and swaps (11^2 - 11 = 121 - 11 = 110 for 11 elements, for example). But if the array is even partially sorted in needed order (I mean many elements already stay at correct positions or even not far from them), the number of swaps&comparisons may significantly drop. The right position for the element will be found pretty soon and there will be no need for performing as many actions as in case of an array sorted in opposite order. So, as you can see for arr2, which is almost sorted, the number of actions will become linear (in relation to the input size) - 6.
var arr1 = [11,10,9,8,7,6,5,4,3,2,1];
var arr2 = [1,2,3,4,5,6,7,8,11,10,9];
function InsertionSort(arr) {
var arr = arr, compNum = 0, swapNum = 0;
for(var i = 1; i < arr.length; i++) {
var temp = arr[i], j = i - 1;
while(j >= 0) {
if(temp < arr[j]) { arr[j + 1] = arr[j]; swapNum++; } else break;
j--;
compNum++;
}
arr[j + 1] = temp;
}
console.log(arr, "Number of comparisons: " + compNum, "Number of swaps: " + swapNum);
}
InsertionSort(arr1); // worst case, 11^2 - 11 = 110 actions
InsertionSort(arr2); // almost sorted array, few actions
In merge sort we always do aprox. n*log n actions - the properties of the input array don't matter. So, as you can see in both cases we will get both of our arrays sorted in 39 actions:
var arr1 = [11,10,9,8,7,6,5,4,3,2,1];
var arr2 = [1,2,3,4,5,6,7,8,11,10,9];
var actions = 0;
function mergesort(arr, left, right) {
if(left >= right) return;
var middle = Math.floor((left + right)/2);
mergesort(arr, left, middle);
mergesort(arr, middle + 1, right);
merge(arr, left, middle, right);
}
function merge(arr, left, middle, right) {
var l = middle - left + 1, r = right - middle, temp_l = [], temp_r = [];
for(var i = 0; i < l; i++) temp_l[i] = arr[left + i];
for(var i = 0; i < r; i++) temp_r[i] = arr[middle + i + 1];
var i = 0, j = 0, k = left;
while(i < l && j < r) {
if(temp_l[i] <= temp_r[j]) {
arr[k] = temp_l[i]; i++;
} else {
arr[k] = temp_r[j]; j++;
}
k++; actions++;
}
while(i < l) { arr[k] = temp_l[i]; i++; k++; actions++;}
while(j < r) { arr[k] = temp_r[j]; j++; k++; actions++;}
}
mergesort(arr1, 0, arr1.length - 1);
console.log(arr1, "Number of actions: " + actions); // 11*log11 = 39 (aprox.)
actions = 0;
mergesort(arr2, 0, arr2.length - 1);
console.log(arr2, "Number of actions: " + actions); // 11*log11 = 39 (aprox.)
So, answering your question:
For any array of length greater than 10, is it safe to say that merge sort performs fewer comparisons among the array's elements than does insertion sort on the same array
I would say that no, it isn't safe to say so. Merge sort can perform more actions compared to insertion sort in some cases. The size of an array isn't important here. What is important in this particular case of comparing insertion sort vs. merge sort is how far from the sorted state is your array. I hope it helps :)
BTW, merge sort and insertion sort have been united in a hybrid stable sorting algorithm called Timsort to get the best from both of them. Check it out if interested.

Functional version of a typical nested while loop

I hope this question may please functional programming lovers. Could I ask for a way to translate the following fragment of code to a pure functional implementation in Scala with good balance between readability and execution speed?
Description: for each elements in a sequence, produce a sub-sequence contains the elements that comes after the current elements (including itself) with a distance smaller than a given threshold. Once the threshold is crossed, it is not necessary to process the remaining elements
def getGroupsOfElements(input : Seq[Element]) : Seq[Seq[Element]] = {
val maxDistance = 10 // put any number you may
var outerSequence = Seq.empty[Seq[Element]]
for (index <- 0 until input.length) {
var anotherIndex = index + 1
var distance = input(index) - input(anotherIndex) // let assume a separate function for computing the distance
var innerSequence = Seq(input(index))
while (distance < maxDistance && anotherIndex < (input.length - 1)) {
innerSequence = innerSequence ++ Seq(input(anotherIndex))
anotherIndex = anotherIndex + 1
distance = input(index) - input(anotherIndex)
}
outerSequence = outerSequence ++ Seq(innerSequence)
}
outerSequence
}
You know, this would be a ton easier if you added a description of what you're trying to accomplish along with the code.
Anyway, here's something that might get close to what you want.
def getGroupsOfElements(input: Seq[Element]): Seq[Seq[Element]] =
input.tails.map(x => x.takeWhile(y => distance(x.head,y) < maxDistance)).toSeq

Scala - Project euler #8

I'm currently learning Scala and I'm trying to solve some of the Euler Challenges with it.
I have some problems getting the response to the 8th challenge and I really don't know where is my bug.
object Product{
def main(args: Array[String]): Unit = {
var s = "7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450";
var len = 13;
var bestSet = s.substring(0,len);
var currentSet = "";
var i = 0;
var compare = 0;
for(i <- 1 until s.length - len){
currentSet = s.substring(i,i+len);
compare = compareBlocks(bestSet,currentSet);
if(compare == 1) bestSet = currentSet;
}
println(v1);
var result = 1L;
var c = ' ';
for(c <- v1.toCharArray){
result = result * c.asDigit.toLong;
}
println(result);
}
def compareBlocks(block1: String, block2: String): Int = {
var i = 0;
var v1 = 0;
var v2 = 0;
if((block1 contains "0") && !(block2 contains "0")) return 1;
if(!(block1 contains "0") && (block2 contains "0")) return -1;
if((block1 contains "0") && (block2 contains "0")) return 0;
var chars = block1.toCharArray;
for(i <- 0 until chars.length){
v1 = v1 + chars(i).asDigit;
}
chars = block2.toCharArray;
for(i <- 0 until chars.length)
{
v2 = v2 + chars(i).asDigit;
}
if(v1 < v2) return 1;
if(v2 < v1) return -1;
return 0;
}
}
My result is:
9753697817977 <- Digit sequence
8821658160 <- Multiplication
Using the Euler Project to challenge yourself and learn a new language is a pretty good idea, but just coming up with the correct answer doesn't mean that you're using the language well.
It's obvious from your code that you have yet to learn idiomatic Scala. Would it surprise you to learn that the desired product can be calculated from the 100-character input string with just one line of code? That one line of code will:
turn each input character into a digit (Int)
slide a fixed size (13-digit) window over all the digits
multiply all the digits within each window
select the maximum from all those products
There's a handy little web site that has solved Euler challenges in Scala. I recommend that every time you solve an Euler problem, compare your code with what's found on that site. (But be careful. It's too easy to look ahead at solutions that you haven't tackled yet.)

modifying spark GraphX pageRank to do random walk with restart

I am trying to implement random walk with restart by modifying the Spark GraphX implementation of PageRank algorithm.
def randomWalkWithRestart(graph: Graph[VertexProperty, EdgeProperty], patientID: String , numIter: Int = 10, alpha: Double = 0.15, tol: Double = 0.01): Unit = {
var rankGraph: Graph[Double, Double] = graph
// Associate the degree with each vertex
.outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) }
// Set the weight on the edges based on the degree
.mapTriplets( e => 1.0 / e.srcAttr, TripletFields.Src )
// Set the vertex attributes to the initial pagerank values
.mapVertices( (id, attr) => alpha )
var iteration = 0
var prevRankGraph: Graph[Double, Double] = null
while (iteration < numIter) {
rankGraph.cache()
// Compute the outgoing rank contributions of each vertex, perform local preaggregation, and
// do the final aggregation at the receiving vertices. Requires a shuffle for aggregation.
val rankUpdates = rankGraph.aggregateMessages[Double](
ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr), _ + _, TripletFields.Src)
// Apply the final rank updates to get the new ranks, using join to preserve ranks of vertices
// that didn't receive a message. Requires a shuffle for broadcasting updated ranks to the
// edge partitions.
prevRankGraph = rankGraph
rankGraph = rankGraph.joinVertices(rankUpdates) {
(id, oldRank, msgSum) => alpha + (1.0 - alpha) * msgSum
}.cache()
rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
//logInfo(s"PageRank finished iteration $iteration.")
prevRankGraph.vertices.unpersist(false)
prevRankGraph.edges.unpersist(false)
iteration += 1
}
}
I believe the (id, oldRank, msgSum) => alpha + (1.0 - alpha) * msgSum part should be changed, but I am not sure how. I need to add the ready state probability to this line.
Furthermore, the ready state probability should be initialized somewhere before the while loop. And the ready state probability has to be uploaded inside the while loop.
Any suggestions would be appreciated.