Linear operations with slices in breeze - scala

Is it somehow possible to slice updates on Matrices in breeze? I could not find implicit value for parameter op.
Breeze 0.11.2.
val idxs = Seq(0,1)
val x = DenseMatrix.rand(3,3)
val y = DenseMatrix.rand(3,3)
x(idxs,idxs)+= y(idxs, idxs) // cant find implicit conversion for += here.
Analog code with DenseVectors works properly.
val xv = DenseVector.rand(3)
val yv = DenseVector.rand(3)
x(idxs) += y(idxs)
There is ugly work-around updating rows in iterative manner.
val idxs = IndexedSeq(0, 1)
val x:DenseMatrix[Double] = DenseMatrix.zeros(3, 3)
val y = DenseMatrix.rand(3, 3)
for(r<-idxs) {
val slx = x(::, r)
val sly = y(::, r)
slx(idxs) += sly(idxs)
}

It's an oversight. Please open an issue on github.

Related

How to set type of dataset when applying transformations and how to implement transformations without using spark.sql.functions._?

I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?
If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page
Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")

In scala, how do I get access to specific index in tuple?

I am implementing function that gets random index and returns the element at random index of tuple.
I know that for tuple like, val a=(1,2,3) a._1=2
However, when I use random index val index=random_index(integer that is smaller than size of tuple), a._index doesnt work.
You can use productElement, note that it is zero based and has return type of Any:
val a=(1,2,3)
a.productElement(1) // returns 2nd element
If you know random_index only at runtime the best what you can have is (as #GuruStron answered)
val a = (1,2,3)
val i = 1
val x = a.productElement(i)
x: Any // 2
If you know random_index at compile time you can do
import shapeless.syntax.std.tuple._
val a = (1,2,3)
val x = a(1)
x: Int // 2 // not just Any
// a(4) // doesn't compile
val i = 1
// a(i) // doesn't compile
https://github.com/milessabin/shapeless/wiki/Feature-overview:-shapeless-2.0.0#hlist-style-operations-on-standard-scala-tuples
Although this a(1) seems to be pretty similar to standard a._1.

Adding Sparse Vectors 3.0.0 Apache Spark Scala

I am trying to create a function as the following to add
two org.apache.spark.ml.linalg.Vector. or i.e two sparse vectors
This vector could look as the following
(28,[1,2,3,4,7,11,12,13,14,15,17,20,22,23,24,25],[0.13028398104008743,0.23648605632753023,0.7094581689825907,0.13028398104008743,0.23648605632753023,0.0,0.14218861229025295,0.3580566057240087,0.14218861229025295,0.13028398104008743,0.26056796208017485,0.0,0.14218861229025295,0.06514199052004371,0.13028398104008743,0.23648605632753023])
For e.g.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector): org.apache.spark.ml.linalg.Vector = {
}
Let's look at a use case
val x = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val y = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to output to be
Vectors.sparse(2, List(0,1), List(1,1))
Here's another case where they share the same indices
val x = Vectors.sparse(2, List(1), List(1))
val y = Vectors.sparse(2, List(1), List(1))
This output should be
Vectors.sparse(2, List(1), List(2))
I've realized doing this is harder than it seems. I looked into one possible solution of converting the vectors into breeze, adding them in breeze and then converting it back to a vector. e.g Addition of two RDD[mllib.linalg.Vector]'s. So I tried implementing this.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector) ={
val dense_x = x.toDense
val dense_y = y.toDense
val bv1 = new DenseVector(dense_x.toArray)
val bv2 = new DenseVector(dense_y.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout
}
however this gave me an error in the last line
val vectout = Vectors.dense((bv1 + bv2).toArray)
Cannot resolve the overloaded method 'dense'.
I'm wondering why is error is occurring and ways to fix it?
To answer my own question, I had to think about how sparse vectors are. For e.g. Sparse Vectors require 3 arguments. the number of dimensions, an array of indices, and finally an array of values. For e.g.
val indices: Array[Int] = Array(1,2)
val norms: Array[Double] = Array(0.5,0.3)
val num_int = 4
val vector: Vector = Vectors.sparse(num_int, indices, norms)
If I converted this SparseVector to an Array I would get the following.
code:
val choiced_array = vector.toArray
choiced_array.map(element => print(element + " "))
Output:
[0.0, 0.5,0.3,0.0].
This is considered a more dense representation of it. So once you convert the two vectors to array you can add them with the following code
val add: Array[Double] = (vector.toArray, vector_2.toArray).zipped.map(_ + _)
This gives you another array of them both added. Next to create your new sparse vector, you would want to create an indices array as shown in the construction
var i = -1;
val new_indices_pre = add.map( (element:Double) => {
i = i + 1
if(element > 0.0)
i
else{
-1
}
})
Then lets filter out all -1 indices indication that indicate zero for that indice.
new_indices_pre.filter(element => element != -1)
Remember to filter out none zero values from the array which has the addition of the two vectors.
val final_add = add.filter(element => element > 0.0)
Lastly, we can make the new sparse Vector
Vectors.sparse(num_int,new_indices,final_add)

How to divide a Matrix by another matrix in Scala

I have initialized two matrices (X and Y) in Scala as follows,
var x = ofDim[Int](a1,b1)
var y = ofDim[Int](a2,b2)
x,y,a1,a2,b1 and b2 are variables.
and now i need to decide X by Y (X/Y). How can achieve that?
There is other approach that use Apache Commons. However, it is important observe that the division operation applies multiplication and inversion operations and, some matrix are inversable and others no: https://en.wikipedia.org/wiki/Invertible_matrix
The following example applies the library Apache Commons (Study.scala):
import org.apache.commons.math3.linear._
object Study {
def main(args: Array[String]): Unit = {
val xArray = Array(Array(1.0, 2.0), Array(3.0, 4.0))
val yArray = Array(Array(1.0, 2.0), Array(3.0, 4.0))
val x = new Array2DRowRealMatrix(xArray)
val y = new Array2DRowRealMatrix(yArray)
val yInverse = new LUDecomposition(y).getSolver().getInverse();
val w = x.multiply(yInverse)
for(i <- 0 until w.getRowDimension())
for(j <- 0 until w.getColumnDimension())
println(w.getEntry(i, j))
}
}
Tip: If you intend to use the scala console, you need to specify the classpath ...
scala -classpath .../commons-math3/3.2/commons-math3-3.2.jar
... in the scala session you load the algorithm ...
:load .../Study.scala
... and the results come out calling the main function of Study (approximation can be applied) ...
scala> Study.main(null)
0.99 / 1.11E-16 / 0.0 / 1.02
Try:
import breeze.linalg.{DenseMatrix, inv}
val mx = new DenseMatrix(a1, b1, x.transpose.flatten)
val my = new DenseMatrix(a2, b2, y.transpose.flatten)
mx * inv(my)
The library Beeze, as mecioned in other responses, is necessary. You can install it using SBT or Maven
The Breeze project can be download from GitHub
This is the Maven approach:
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.10</artifactId> <!-- or 2.11 -->
<version>0.12</version>
</dependency>
The code ...
import breeze.linalg.DenseMatrix
object Division {
def main(args: Array[String]): Unit = {
var a1 = 10
var a2 = 11
var b1 = 12
var b2 = 13
//var x = Array.ofDim[Int](a1,b1)
//var y = Array.ofDim[Int](a2,b2)
var x = DenseMatrix(a1,b1)
var y = DenseMatrix(a2,b2)
var result = x/y
print(result)
}
}

How to take the first element of each element of a RDD[Double,Double] and create a diagonal matrix with it?

I'm working on Spark in Scala and I want to transform
Array[(Double, Double)] = Array((0.9398785848878621,1.0), (0.25788885483788343,1.0), (0.6093264774118677,1.0), (0.19736451516248585,0.0), (0.9952925254744414,1.0), (0.6920511147023924,0.0...
into something like
Array[Double]=Array(0.9398785848878621, 0.25788885483788343, 0.6093264774118677, 0.19736451516248585, 0.9952925254744414 , 0.6920511147023924 ...
How can I do it?
Then how can I use this Array[Double] to create a diagonal matrix ?
Just take the first part of your tuple :
val a = Array((0.9398785848878621,1.0), (0.25788885483788343,1.0))
val result = a.map(_._1)
Try this:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val a = Array((0.9398785848878621,1.0), (0.25788885483788343,1.0), ...)
val res1 = a.map(_._1)
val entries = New Array[Double](res1.length)
for (i <- 0 to res1.length - 1){
entries(i) = MatrixEntry(i,i,res1(i))
}
val mat = CoordinateMatrix(sc.parallelize(res1))