I have a Vector and want to remove elements from the Vector. How can I do it in Scala?My input is a Vector[2.0, 3.0, 0.3, 1.0, 4.0] -->Vector[Double] and I want a Vector[2.0, 0.3, 4.0] as output, so I want to remove the element with the index 1 and 3 from my input Vector...
def removeElementFromVector(input: Vector) = {
val inputAsArray = input.toArray
inputAsArray
// ...
val reducedInputAsVector = inputAsArray.toVector
}
Yes, you can use filter to achieve it, but we need to add index to remove element at an index:
Ex: Your vector (scala.collection.immutable.Vector[Double]):
scala> val v1 = val v1 = Vector(2.2, 3.3, 4.4, 5.5, 6.6, 4.4)
Output: Vector(2.2, 3.3, 4.4, 5.5, 6.6, 4.4)
Now, we will remove element at index 2:
scala> var indexRemove=2
scala> val v2 = v1.zipWithIndex.filter(x => x._2!=indexRemove).map(x=>x._1).toVector
Output: Vector(2.2, 3.3, 5.5, 6.6, 4.4)
Now, we will remove element at index 3
scala> var indexRemove=3
scala> val v2 = v1.zipWithIndex.filter(x => x._2!=indexRemove).map(x=>x._1).toVector
Output: Vector(2.2, 3.3, 4.4, 6.6, 4.4)
Hope this helps.
My input is: Vector[2.0, 3.0, 0.3, 1.0, 4.0] -->Vector[Double] and I want a Vector[2.0, 3.0, 0.3, 4.0] as output, so I want to remove the element with the index 1 and 3 from my input Vector. Sry for my question was not clear enough..
You can use filter method to remove them.
> val reducedInputVector = input.filter(x => !(Array(1,3) contains input.indexOf(x)))
reducedInputVector: scala.collection.immutable.Vector[Double] = Vector(2.0, 0.3, 4.0)
I solved it with apply:
val vec1 = Vector(2.0,3.0,0.3,1.0, 4.0)
val vec2 = Vectors.dense(vec1.apply(0), vec1.apply(1),vec1.apply(2), vec1.apply(4))
the output is
vec1: scala.collection.immutable.Vector[Double] = Vector(2.0, 3.0, 0.3, 1.0, 4.0)
vec2: org.apache.spark.mllib.linalg.Vector = [2.0,3.0,0.3,4.0]
def deleteItem[A](row: Vector[A], item: Int): Vector[A] = {
val (a,b) = row.splitAt(item)
if (b!=Nil) a ++ b.tail else a
}
Related
I have a first seq, for example:
val s: Seq[Double] = List.fill(6)(0.0)
and a sub sequences of the indices of s:
val subInd: Seq[Int] = List(2, 4, 5)
Now, what I want to do is update s on the positions 2, 4 and 5 by another Seq, which has the length of subInd:
val t: Seq[Double] = List(5.0, 6.0, 7.0)
such, that:
val updateBySeq(s, subInd, t): List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)
I have searched on this site and found Update multiple values in a sequence where the second answer comes close to the functionality I want to have.
However, the difference is, that the function provided would update s on the indices contained in subInd by one value. I, on the other hand, would want them to correspond to multiple, unique values in a third Seq t.
I have tried various things, like using recursion and ListBuffers, instead of Lists, to incrementally update the elements of s, but either they left s unchanged or I got an error because I violated some immutability constraint.
This should work:
def updateListByIndexes[T](data: List[T])(indexes: List[Int], update: List[T]): List[T] = {
val updateMap = (indexes lazyZip update).toMap
data.iterator.zipWithIndex.map {
case (elem, idx) =>
updateMap.getOrElse(key = idx, default = elem)
}.toList
}
Which you can use like this:
val s = List.fill(6)(0.0)
// s: List[Double] = List(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
val subInd = List(2, 4, 5)
// subInd: List[Int] = List(2, 4, 5)
val t = List(5.0, 6.0, 7.0)
// t: List[Double] = List(5.0, 6.0, 7.0)
updateListByIndexes(s)(subInd, t)
// res: List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)
I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0
I have N maps (Map[String, Double]) each having the same set of keys. Let's say something like the following:
map1 = ("elem1": 2.0, "elem2": 4.0, "elem3": 3.0)
map2 = ("elem1": 4.0, "elem2": 1.0, "elem3": 1.0)
map3 = ("elem1": 3.0, "elem2": 10.0, "elem3": 2.0)
I need to return a new map with element-wise average of those input maps:
resultMap = ("elem1": 3.0, "elem2": 5.0, "elem3": 2.0)
What's the cleanest way to do that in scala? Preferrably without using extra external libraries.
This all happens in Spark*. Thus any answers suggesting spark-specific usage could be helpful.
One option is to convert all Maps to Seqs, union them to a single Seq, group by key and take the average of values:
val maps = Seq(map1, map2, map3)
maps.map(_.toSeq).reduce(_++_).groupBy(_._1).mapValues(x => x.map(_._2).sum/x.length)
// res6: scala.collection.immutable.Map[String,Double] = Map(elem1 -> 3.0, elem3 -> 2.0, elem2 -> 5.0)
Since your question is tagged with apache-spark you can get your desired output by combining the maps into RDD[Map[String, Double]] as
scala> val rdd = sc.parallelize(Seq(Map("elem1"-> 2.0, "elem2"-> 4.0, "elem3"-> 3.0),Map("elem1"-> 4.0, "elem2"-> 1.0, "elem3"-> 1.0),Map("elem1"-> 3.0, "elem2"-> 10.0, "elem3"-> 2.0)))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Double]] = ParallelCollectionRDD[1] at parallelize at <console>:24
Then you can use flatMap to flatten the entries of maps into individual rows and use groupBy function with key and sum the grouped values and devide it with the size of the grouped maps. You should get Your desired output as
scala> rdd.flatMap(row => row).groupBy(kv => kv._1).mapValues(values => values.map(value => value._2).sum/values.size)
res0: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[5] at mapValues at <console>:27
scala> res0.foreach(println)
[Stage 0:> (0 + 0) / 4](elem2,5.0)
(elem3,2.0)
(elem1,3.0)
Hope the answer is helpful
I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));
I just want a quick way to create an array (or vector) of doubles that doesn't come out as type NumericRange.
Ive tried
val ys = Array(9. to 1. by -1.)
But this returns type Array[scala.collection.immutable.NumericRange[Double]]
Is there a way to coerce this to regular type Array[Double]?
scala> (9d to 1d by -1d).toArray
res0: Array[Double] = Array(9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)
I think it slightly more concise and readable:
Array(9d to 1 by -1 : _*)
res0: Array[Double] = Array(9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)