Functional way to build a matrix from the list in scala - scala

I've asked this question on https://users.scala-lang.org/ but haven't got a concrete answer yet. I am given a vector v and I would like to construct a matrix m based on this vector according to the rules specified below. I would like to write the following code in a purely functional way, i.e. m = v.map(...) or similar. I can do it easily in procedural way like this
import scala.util.Random
val v = Vector.fill(50)(Random.nextInt(100))
println(v)
val m = Array.fill[Int](10, 10)(0)
def populateMatrix(x: Int): Unit = m(x/10)(x%10) += 1
v.map(x => populateMatrix(x))
m foreach { row => row foreach print; println }
In words, I am iterating through v, getting a pair of indices (i,j) from each v(k) and updating the matrix m at these positions, i.e., m(i)(j) += 1. But I am seeking a functional way. It is clear for me how to implement this in, e.g. Mathematica
v=RandomInteger[{99}, 300]
m=SparseArray[{Rule[{Quotient[#, 10] + 1, Mod[#, 10] + 1}, 1]}, {10, 10}] & /# v // Total // Normal
But how to do it in scala, which is functional language, too?

Your populate matrix approach can be "reversed" - map vector into index tuples, group them, count size of each group and turn it into map (index tuple -> size) which will be used to populate corresponding index in array with Array.tabulate:
val v = Vector.fill(50)(Random.nextInt(100))
val values = v.map(i => (i/10, i%10))
.groupBy(identity)
.view
.mapValues(_.size)
val result = Array.tabulate(10,10)( (i, j)=> values.getOrElse((i,j), 0))

Related

Avoid ListBuffer while preparing an element-wise multiplication of two SparseVectors

I'm trying to implement a element-wise multiplication of two ml.linalg.SparseVector instances (also called a Hadamard product).
A SparseVector represents a vector, but rather than having space taken up by all the "0" values, they are omitted. The vector is represented as two lists of Indices and Values.
For example: SparseVector(indices: [0, 100, 100000], values: [0.25, 1, 0.8]) concisely represents an array of 100,000 elements, where only 3 values are non-zero.
I now need an element-wise multiplication of two of these, and there seems to be no built-in. Conceptually, it should be simple - any indices they don't have in common are dropped, and for the indices in common, the numbers are multiplied together.
For example: SparseVector(indices: [0, 500, 100000], values: [10, 1, 10]) when multiplied with the above should return: SparseVector(indices: [0, 100000], values: [2.5, 8])
Sadly, I've found no built-in for this. I have an approach for doing this in a single pass, but it isn't very scala-y, it has to build up the list in a loop as it discovers which indices are in common, and then grab the corresponding values for each index (which have the same cardinal position, but in a second array).
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions.udf
import scala.collection.mutable.ListBuffer
// Return a new SparseVector whose values are the element-wise product (Hadamard product)
val multSparseVectors = udf((v1: SparseVector, v2: SparseVector) => {
// val commonIndexes = v1.indices.intersect(v2.indices); // Missing scale factors are assumed to have a value of 0, so only common elements remain
// TODO: No clear way to map common indices to the values that go with those indices. E.g. no "valueForIndex" method
// new SparseVector(v1.size, commonIndexes, commonIndexes.map(i => v1.valueForIndex(i) * v2.valueForIndex(i)).toArray);
val indices = ListBuffer[Int](); // TODO: Some way to do this without mutable lists?
val values = ListBuffer[Double]();
var v1Pos = 0; // Current index of SparseVector v1 (we will be making a single pass)
var v2pos = 0; // Current index of SparseVector v2 (we will be making a single pass)
while(v1Pos < v1.indices.length && v2pos < v2.indices.length) {
while(v1.indices(v1Pos) < v2.indices(v2pos))
v2pos += 1; // Advance our position in SparseVector 2 until we've matched or passed the current SparseVector 1 index
if(v2pos > v2.indices.length && v1.indices(v1Pos) == v2.indices(v2pos)) {
indices += v1.indices(v1Pos);
values += v1.values(v1Pos) * v2.values(v2pos);
}
v1Pos += 1;
}
new SparseVector(v1.size, indices.toArray, values.toArray);
})
spark.udf.register("multSparseVectors", multSparseVectors)
Can anyone think of a way that I can do this using a map or similar? My main goal is I want to avoid having to make multiple O(N) passes over the second vector to "lookup" the position of a value in the indices list so that I can grab the corresponding values entry, because this would take O(K + N*2) time when I know there's an O(K + N) solution possible.
I've come up with a solution by boiling this problem into a more general one:
Finding the indices at which two arrays intersect
Given an answer to the above question (where to the two arrays v1.indices and v2.indices intersect), we can trivially use those indices to extract back the new SparseVector indices, and the values from each vector to be multiplied together.
The solution is given below:
%scala
import scala.annotation.tailrec
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions.udf
// This fanciness from https://stackoverflow.com/a/71928709/529618 finds the indices at which two lists intersect
#tailrec
def indicesOfIntersection(left: List[Int], right: List[Int], lidx: Int = 0, ridx: Int = 0, result: List[(Int, Int)] = Nil): List[(Int, Int)] = (left, right) match {
case (Nil, _) | (_, Nil) => result.reverse
case (l::tail, r::_) if l < r => indicesOfIntersection(tail, right, lidx+1, ridx, result)
case (l::_, r::tail) if l > r => indicesOfIntersection(left, tail, lidx, ridx+1, result)
case (l::ltail, r::rtail) => indicesOfIntersection(ltail, rtail, lidx+1, ridx+1, (lidx, ridx) :: result)
}
// Return a new SparseVector whose values are the element-wise product (Hadamard product)
val multSparseVectors = udf((v1: SparseVector, v2: SparseVector) => {
val intersection = indicesOfIntersection(v1.indices.toList, v2.indices.toList);
new SparseVector(v1.size,
intersection.map{case (x1,_) => v1.indices(x1)}.toArray,
intersection.map{case (x1,x2) => v1.values(x1) * v2.values(x2)}.toArray);
})
spark.udf.register("multSparseVectors", multSparseVectors)

How to trick Scala map method to produce more than one output per each input item?

Quite complex algorith is being applied to list of Spark Dataset's rows (list was obtained using groupByKey and flatMapGroups). Most rows are transformed 1 : 1 from input to output, but in some scenarios require more than one output per each input. The input row schema can change anytime. The map() fits the requirements quite well for the 1:1 transformation, but is there a way to use it producing 1 : n output?
The only work-around I found relies on foreach method which has unpleasant overhed cause by creating the initial empty list (remember, unlike the simplified example below, real-life list structure is changing randomly).
My original problem is too complex to share here, but this example demonstrates the concept. Let's have a list of integers. Each should be transformed into its square value and if the input is even it should also transform into one half of the original value:
val X = Seq(1, 2, 3, 4, 5)
val y = X.map(x => x * x) //map is intended for 1:1 transformation so it works great here
val z = X.map(x => for(n <- 1 to 5) (n, x * x)) //this attempt FAILS - generates list of five rows with emtpy tuples
// this work-around works, but newX definition is problematic
var newX = List[Int]() //in reality defining as head of the input list and dropping result's tail at the end
val za = X.foreach(x => {
newX = x*x :: newX
if(x % 2 == 0) newX = (x / 2) :: newX
})
newX
Is there a better way than foreach construct?
.flatMap produces any number of outputs from a single input.
val X = Seq(1, 2, 3, 4, 5)
X.flatMap { x =>
if (x % 2 == 0) Seq(x*x, x / 2) else Seq(x / 2)
}
#=> Seq[Int] = List(0, 4, 1, 1, 16, 2, 2)
flatMap in more detail
In X.map(f), f is a function that maps each input to a single output. By contrast, in X.flatMap(g), the function g maps each input to a sequence of outputs. flatMap then takes all the sequences produced (one for each element in f) and concatenates them.
The neat thing is .flatMap works not just for sequences, but for all sequence-like objects. For an option, for instance, Option(x)#flatMap(g) will allow g to return an Option. Similarly, Future(x)#flatMap(g) will allow g to return a Future.
Whenever the number of elements you return depends on the input, you should think of flatMap.

Adding Sparse Vectors 3.0.0 Apache Spark Scala

I am trying to create a function as the following to add
two org.apache.spark.ml.linalg.Vector. or i.e two sparse vectors
This vector could look as the following
(28,[1,2,3,4,7,11,12,13,14,15,17,20,22,23,24,25],[0.13028398104008743,0.23648605632753023,0.7094581689825907,0.13028398104008743,0.23648605632753023,0.0,0.14218861229025295,0.3580566057240087,0.14218861229025295,0.13028398104008743,0.26056796208017485,0.0,0.14218861229025295,0.06514199052004371,0.13028398104008743,0.23648605632753023])
For e.g.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector): org.apache.spark.ml.linalg.Vector = {
}
Let's look at a use case
val x = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val y = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to output to be
Vectors.sparse(2, List(0,1), List(1,1))
Here's another case where they share the same indices
val x = Vectors.sparse(2, List(1), List(1))
val y = Vectors.sparse(2, List(1), List(1))
This output should be
Vectors.sparse(2, List(1), List(2))
I've realized doing this is harder than it seems. I looked into one possible solution of converting the vectors into breeze, adding them in breeze and then converting it back to a vector. e.g Addition of two RDD[mllib.linalg.Vector]'s. So I tried implementing this.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector) ={
val dense_x = x.toDense
val dense_y = y.toDense
val bv1 = new DenseVector(dense_x.toArray)
val bv2 = new DenseVector(dense_y.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout
}
however this gave me an error in the last line
val vectout = Vectors.dense((bv1 + bv2).toArray)
Cannot resolve the overloaded method 'dense'.
I'm wondering why is error is occurring and ways to fix it?
To answer my own question, I had to think about how sparse vectors are. For e.g. Sparse Vectors require 3 arguments. the number of dimensions, an array of indices, and finally an array of values. For e.g.
val indices: Array[Int] = Array(1,2)
val norms: Array[Double] = Array(0.5,0.3)
val num_int = 4
val vector: Vector = Vectors.sparse(num_int, indices, norms)
If I converted this SparseVector to an Array I would get the following.
code:
val choiced_array = vector.toArray
choiced_array.map(element => print(element + " "))
Output:
[0.0, 0.5,0.3,0.0].
This is considered a more dense representation of it. So once you convert the two vectors to array you can add them with the following code
val add: Array[Double] = (vector.toArray, vector_2.toArray).zipped.map(_ + _)
This gives you another array of them both added. Next to create your new sparse vector, you would want to create an indices array as shown in the construction
var i = -1;
val new_indices_pre = add.map( (element:Double) => {
i = i + 1
if(element > 0.0)
i
else{
-1
}
})
Then lets filter out all -1 indices indication that indicate zero for that indice.
new_indices_pre.filter(element => element != -1)
Remember to filter out none zero values from the array which has the addition of the two vectors.
val final_add = add.filter(element => element > 0.0)
Lastly, we can make the new sparse Vector
Vectors.sparse(num_int,new_indices,final_add)

how to accumulate 2D array elements in Scala?

I have 2D array as:
Array(Array(1,1,0), Array(1,0,1))
and I would like to accumulate values over column so my final output look like
Array(Array(1,1,0), Array(2,1,1))
If this is 1D array, I can simply use 'scan' but I'm having trouble with using scan in 2D array.
can anyone help on this issue?
Here's one way to do it:
val t = Array(Array(1,1,0), Array(1,0,1))
val result = t.scanLeft(Array.fill(t(0).length)(0)) ((x,y) =>
x.zip(y).map(e => e._1 + e._2)).drop(1)
//to see the results
result.foreach(e => println(e.toList))
gives:
List(1, 1, 0)
List(2, 1, 1)
The idea is to create an array filled with zeros (using Array.fill) and then scan the 2D array using that as an accumulator. In the end, drop(1) gets rid of the zero-filled array.
EDIT:
In response to the comment, this solution works for a matrix of any size. The zip function takes care of element-wise addition.
EDIT 2 (Step by step explanation):
You already know about scan or a one-dimensional array. The idea is essentially the same.
We initialize the accumulator with zero. In this case, zero means an array of zeros. Array.fill is used to create an array filled with zeros.
Instead of a single addition, we need to add arrays element-wise. This is what the combination of zip and map do. There are a lot of examples available on the Internet about how these methods work.
Finally, we drop the zero element using Scala's drop(1). The result is an array of arrays containing accumulated values.
I would solve it as for given row r, sum all previous rows.
val accumatedMatrix =
for(row <- array.indices)
yield array.take(row + 1).foldLeft(Array(0, 0, 0)) {
case (a, b) => Array(a(0) + b(0), a(1) + b(1), a(2) + b(2))
}
input:
val array = Array(
Array(1,1,0),
Array(1,0,1)
)
output:
1,1,0
2,1,1
Instead of summing all the previous rows repeatedly you can improve it to memoize as well.
Pretty much the same approach as a 1D array:
a.tail.scan(a.head)((acc, value) =>
Array(acc(0) + value(0), acc(1) + value(1), acc(2) + value(2))
)
For arbitrary number (as long as they all have the same number):
a.tail.scan(a.head)((acc, value) =>
acc zip value map {case (a,b) => a+b}
)

Selective sampling in spark RDD

I have a RDD from logged events I wanted to take few samples of each category.
Data is like below
|xxx|xxxx|xxxx|type1|xxxx|xxxx
|xxx|xxxx|xxxx|type2|xxxx|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type3|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type3|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type4|xxxx|xxxx|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type1|xxxx|xxxx
|xxx|xxxx|xxxx|type6|xxxx
My try
eventlist = ['type1', 'type2'....]
orginalRDD = sc.textfile("/path/to/file/*.gz").map(lambda x: x.split("|"))
samplelist = []
for event in event list:
eventsample = orginalRDD.filter(lambda x: x[3] == event).take(5).collect()
samplelist.extend(eventsample)
print samplelist
I have two questions on this,
1. Some better way/efficient way to collect sample based on specific condition?
2. Is it possible to collect the unsplit lines instead of splitted lines?
Python or scala suggestion are welcome!
If sample doesn't have to be random something like this should work just fine:
n = ... # Number of elements you want to sample
pairs = orginalRDD.map(lambda x: (x[3], x))
pairs.aggregateByKey(
[], # zero values
lambda acc, x: (acc + [x])[:n], # Add new value a trim to n elements
lambda acc1, acc2: (acc1 + acc2)[:n]) # Combine two accumulators and trim
Getting a random sample is a little bit harder. One possible approach is to add a random value and sort before aggregation:
import os
import random
def add_random(iter):
seed = int(os.urandom(4).encode('hex'), 16)
rs = random.Random(seed)
for x in iter:
yield (rs.random(), x)
(pairs
.mapPartitions(add_random)
.sortByKey()
.values()
.aggregateByKey(
[],
lambda acc, x: (acc + [x])[:n],
lambda acc1, acc2: (acc1 + acc2)[:n]))
For a DataFrame specific solution see Choosing random items from a Spark GroupedData Object