How to convert a type Any List to a type Double (Scala) - scala

I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a List type Any by using the following command:
val avgX_List = mainDataFrame.groupBy().agg(mean("_c1")).collect().map(_(0)).toList
where "_c1" is the second column of my dataframe. This line of code returns a List with type List[Any].
To pass the result into a variable I used the following command:
var avgX = avgX_List(0)
hoping that the var avgX would be type double automatically but that didn't happen obviously.
So now let the questions begin:
What does map(_(0)) do? I know the basic definition of the map() transformation but I can't find an explanation with this exact argument
I know that by using .toList method in the end of the command my result will be a List with type Any. Is there a way that I could change this into List which contains type Double elements? Or even convert this one
Do you think that it would be much more appropriate to pass the column of my Dataframe into a List[Double] and then calculate the average of its elements?
Is the solution I showed above at any point of view correct based on my problem? I know that "it is working" is different from "correct solution"?
Summing up, I need to calculate the average of a certain column of a Dataframe and have the result as a double type variable.
Note that: I am Greek and I find it hard sometimes to understand some English coding "slang".

map(_(0)) is a shortcut for map( (r: Row) => r(0) ), which is in turn a shortcut for map( (r: Row) => r.apply(0) ). The apply method returns Any, and so you are losing the right type. Try using map(_.getAs[Double](0)) or map(_.getDouble(0)) instead.
Collecting all entries of the column and then computing the average would be highly counterproductive, because you'd have to send huge amounts of data to the master node, and then do all the calculations on this single central node. That would be the exact opposite of what Spark is good for.
You also don't need collect(...).toList, because you can access the 0-th entry directly (it doesn't matter whether you get it from an Array or from a List). Since you are collapsing everything into a single Row anyway, you could get rid of the map step entirely by reordering the methods a little bit:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).collect()(0).getDouble(0)
It can be written even shorter using the first method:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).first().getDouble(0)

#Any dataType in Scala can't be directly converted to Double.
#Use toString & then toDouble on final captured result.
#Eg-
#scala> x
#res22: Any = 1.0
#scala> x.toString.toDouble
#res23: Double = 1.0
#Note- Instead of using map().toList() directly use (0)(0) to get the final value from your resultset.
#TestSample(Scala)-
val wa = Array("one","two","two")
val wrdd = sc.parallelize(wa,3).map(x=>(x,1))
val wdf = wrdd.toDF("col1","col2")
val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#O/p-
#scala> val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#x: Double = 1.0

Related

Generating a random sequence using Scala

I need to get a random sequence of 100 values from 10^-10 to 10^10 and storing to an Array using Scala. I tried following but it didn't work
Array(scala.math.pow(10,-10).doubleValue to scala.math.pow(10,10).intValue by scala.math.pow(10,5).toLong)
Can anyone help me to figure out how to do this correctly?
So you need to fill() the array with Random elements.
import scala.util.Random
val rndm = new Random(1911L)
Array.fill(100)(rndm.between(math.pow(10,-10), math.pow(10,10)))
//res0: Array[Double] = Array(6.08868427907728E9
// , 3.29548545155816E9
// , 9.52802903383275E9
// , 7.981295238889314E9
// , 1.9462480080050848E9
// . . .
This works because the 2nd parameter to the fill() method is "by-name", i.e. re-evaluated for every element.
UPDATE
Things aren't quite as clean if you don't have the .between() method (Scala 2.13).
Array.fill(100)(rndm.nextDouble())
.map(_ * math.pow(10,10))
Note that this actually has a floor of 0.0 instead of the desired 0.0000000001. It's very unlikely you'd have an entry that's too small, especially when taking only 100 samples. Still, there are steps you could take to insure that can't happen.

Issue with Double datatype in Scala

New to Scala and am trying to come up with a library in Scala to check if the double value being passed is of a certain precision and scale. What I noticed was that if the value being passed is 1.00001 then I get the value as that in my called function, but if the value being passed is 0.00001 then I get the value as 1.0E-5, Is there any way to preserve the number in Scala?
def checkPrecisionAndScaleFormat(precision: Int, scale: Int)(valueToCheck: Double): Boolean = {
val value = BigDecimal(valueToCheck)
value.precision <= precision && value.scale <= scale
}
What I noticed was that if the value being passed is 1.00001 then I get the value as that in my called function, but if the value being passed is 0.00001 then I get the value as 1.0E-5
From your phrasing, it seems like you see 1.00001 and 1.0E-5 when debugging (either by printing or in the debugger). It's important to understand that
this doesn't correspond to any internal difference, it's just how Double.toString is defined in Java.
when you do something like val x = 1.00001, the value isn't exactly 1.00001 but the closest number representable as a Double: 1.000010000000000065512040237081237137317657470703125. The easiest way to see the exact value is actually looking at BigDecimal.exact(valueToCheck).
The only way to preserve the number is not to work with Double to begin with. If it's passed as a string, create the BigDecimal from the string. If it's the result of some calculations as a double, consider doing them on BigDecimals instead. But string representation of a Double simply doesn't carry the information you want.

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.
Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.

Strange slowdown in some simple scala code

I am processing a large number of records (CDRS) that are essentially (who, where, how much), to save space I use a lookup to map the strings into integer and aggregate the traffic on a map of maps (who maps to a map (where maps how much)
type CDR = (String, String, Int)
type Lookup = scala.collection.mutable.HashMap[String, (Int, Float)]
type Traffic = scala.collection.mutable.HashMap[Int,scala.collection.mutable.HashMap[Int,Int]]enter code here
I have found a strange behavior, when I build the lookup tables in advance the code runs as expected, however when I start processing and build the maps on the fly it slows down as it processes the records.
I use the same function to build the lookup tables for this comparison. I essentially check if the code for the lookup is there, if not i create a new entry (it is a mutable map), like this:
def index(id: String, map: Lookup, reverse: Reverse): Int = {
if (map.contains(id)) {
map(id)._1
} else {
val number = if (map.keys.size == 0) 0 else reverse.keys.max + 1
reverse += ( number -> id)
map += (id -> (number, 0.toFloat))
number
}
}
Am I missing something here?
EDIT----> I can no longer reproduce the slowdown. I will assume I was either too tired or dumber than usual. Running time now seems to be same as I expected to be.
What is mapCellRvs? Default scala Map's .size (and .keys.size, which is the same thing) simply counts all elements by scanning them linearly.
Try replacing mapCellRvs.keys.size == 0 with mapCellRvs.isEmpty ...
Also, reverse.keys.max is linear as well. You may want to just remember the max somewhere separately, rather than compute it every time.

scala neat way to match and map out Iterable

I am still learning to code in Scala/Spark and I have a problem and greatly appreciate your help.
I have an Iterable with a Key and a Double.
The Iterable [Key] is a subset of say 5 possible Keys say:
population
age
gender
height
weight
and the Double is the corresponding reading.
My question is that I want to represent my data in a flat format of:
(Double,Double,Double,Double,Double)
which corresponds to the readings from Keys in specific order:
(population,age,gender,height,weight)
but in the Iterable where a key does not exist, I want still need to pad it with a 0. So for example:
Iterable((population,10),(age,21),(gender,0))
I want to be able to represent this as
(10,21,0,0,0) //the last 2 zeros are padded because there is no key matching height and weight.
So far I've been doing individual match to each key (if exist then copy the Double and if not pad with zero), but I want to know if there is a neat way of doing this.
Thanks
Personally, I'd create a Map. So say you've got your Iterable:
val values = Seq(("population", 10), ("age", 21), ("gender", 0)).toIterable
Convert it to a map:
val keyValueMap = values.toMap
And when you extract the values from it, just use the getOrElse function:
keyValueMap.getOrElse("height", defaultHeight)
What I'd do is first define the order of the elements, so,
val keyOrder = List("population", "age", "gender", "height", "weight")
Next, you can just do something like;
val valMap = Map("population" -> 199D, "gender" -> 1D, "weight" -> 50D)
keyOrder.map(k => valMap.getOrElse(k, 0D))