Scala not a member of Any

Scala not a member of Any - scala

def interpolate(l:List[Tuple2[String,String]]) : List[Tuple2[java.util.Date, Long]] = {
val mapped : List[Tuple2[java.util.Date, Long]] = l.map(item => (format.parse(item._1), item._2.toLong ) )
val results = ListBuffer[Tuple2[java.util.Date, Long]]()
val last : Option[Tuple2[java.util.Date, Long]] = None
mapped.foreach( item =>
if(!last.isEmpty) {
val daysItem = item._1.getTime() / 1000 / 60 / 60 / 24
val daysLast = last.get._1.getTime() / 1000 / 60 / 60 / 24
if( daysItem - daysLast > 1 ) {
val slope = (item._2 - last.get._2) / (daysItem - daysLast)
val days = daysLast until daysItem
val missingChunk : List[Tuple2[java.util.Date, Long]] = days.map( day => (new Date(day * 24 * 60 * 60 * 1000), slope * day)).toList
results ++= missingChunk
}
}
//results += item
last = Some(item)
)
results.toList
}
Error:
<console>:45: error: value last is not a member of Any
possible cause: maybe a semicolon is missing before `value last'?
last = Some(item)
^

There are 2 problems here:
1) multiple statements needs {...} brackets:
from
mapped.foreach((item: (Date, Long)) => item
XXX // OK
YYY // NO
)
to
mapped.foreach { (item: (Date, Long)) => item
XXX // OK
YYY // OK
}
2) val can't be reassigned:
from
val last: Option[Tuple2[java.util.Date, Long]] = None
to
var last: Option[Tuple2[java.util.Date, Long]] = None
Refactor #1
I would try to avoid using var. It seems that with this condition if (last.isDefined) probably we are trying to zip the list with itself:
scala> val l = List(1, 2, 3, 4, 5)
scala> l.zip(l.tail)
List[(Int, Int)] = List((1,2), (2,3), (3,4), (4,5))
Refactoring your example:
import java.util.Date
def interpolate(l: List[(String, String)]): List[(Date, Long)] = {
val mapped: List[(Date, Long)] = l.map(item => (format.parse(item._1), item._2.toLong))
val results = ListBuffer[(Date, Long)]()
mapped.zip(mapped.tail).foreach { case ((lastDate, lastLong), (itemDate, itemLong)) =>
val daysItem = itemDate.getTime / 1000 / 60 / 60 / 24
val daysLast = lastDate.getTime / 1000 / 60 / 60 / 24
if (daysItem - daysLast > 1) {
val slope = (itemLong - lastLong) / (daysItem - daysLast)
val days = daysLast until daysItem
val missingChunk: List[(Date, Long)] = days.map(day => (new Date(day * 24 * 60 * 60 * 1000), slope * day)).toList
results ++= missingChunk
}
}
results.toList
}
Refactor #2
ListBuffer is a mutable collection. In our scenario it seems we are trying to flatten the missingChunks.
Keep refactoring:
def interpolate(l: List[(String, String)]): List[(Date, Long)] = {
val mapped: List[(Date, Long)] = l.map(item => (format.parse(item._1), item._2.toLong))
val missingChunks = mapped.zip(mapped.tail).map { case ((lastDate, lastLong), (itemDate, itemLong)) =>
val daysItem = itemDate.getTime / 1000 / 60 / 60 / 24
val daysLast = lastDate.getTime / 1000 / 60 / 60 / 24
if (daysItem - daysLast > 1) {
val slope = (itemLong - lastLong) / (daysItem - daysLast)
val days = daysLast until daysItem
days.map(day => (new Date(day * 24 * 60 * 60 * 1000), slope * day)).toList
} else List.empty[(Date, Long)]
}
missingChunks.flatten
}

Related

Outlier Elimination in Spark With InterQuartileRange Results in Error

I have the following recursive function that determines the Outlier using the InterQuartileRange method:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
#scala.annotation.tailrec
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
}
inner(df.columns.toList, df)
}
val outlierDF = interQuartileRangeFiltering(incomingDF)
So basically what I'm doing is that I'm recursively iterating over the columns and eliminating the outliers. Strangely it results in an ArrayIndexOutOfBounds Exception and prints the following:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
What is wrong with my approach?

Here is what I came up with and works like a charm:
def outlierEliminator(df: DataFrame, colsToIgnore: List[String])(fn: (String, DataFrame) => (Double, Double)): DataFrame = {
val ID_COL_NAME = "id"
val dfWithId = DataFrameUtils.addColumnIndex(spark, df, ID_COL_NAME)
val dfWithIgnoredCols = dfWithId.drop(colsToIgnore: _*)
#tailrec
def inner(
cols: List[String],
filterIdSeq: List[Long],
dfWithId: DataFrame
): List[Long] = cols match {
case Nil => filterIdSeq
case column :: xs =>
if (column == ID_COL_NAME) {
inner(xs, filterIdSeq, dfWithId)
} else {
val (lowerBound, upperBound) = fn(column, dfWithId)
val filteredIds =
dfWithId
.filter(s"$column < $lowerBound or $column > $upperBound")
.select(col(ID_COL_NAME))
.map(r => r.getLong(0))
.collect
.toList
inner(xs, filteredIds ++ filterIdSeq, dfWithId)
}
}
val filteredIds = inner(dfWithIgnoredCols.columns.toList, List.empty[Long], dfWithIgnoredCols)
dfWithId.except(dfWithId.filter($"$ID_COL_NAME".isin(filteredIds: _*)))
}

ConvergenceException on SimpleCurveFitter in Scala

I'm trying to fit a curve with SimpleCurveFitter of commons.math3.fitting in Scala but I catch an exception :
org.apache.commons.math3.exception.ConvergenceException : Unable to permorm
Qr decomposition on jacobian
However, I have checked my gradient calculations.... I still don't see why the exception is raised.
See the code by yourself
def main(args: Array[String]): Unit = {
var xv: DenseVector[Double] = linspace(0, 3, 300)
var yv: DenseVector[Double] = DenseVector.zeros(300)
for (i <- xv.findAll(x => x < 1.0)) yv.update(i, 1)
for (i <- xv.findAll(x => x >= 1.0)) yv.update(i, exp(-(xv(i) - 1.0)/1))
val wop: Array[WeightedObservedPoint] = new Array[WeightedObservedPoint](xv.length)
for (i <- 0 to xv.length - 1) wop.update(i, new WeightedObservedPoint(1, xv(i), yv(i)))
val f: ParametricUnivariateFunction = new ParametricUnivariateFunction {
override def value(x: Double, parameters: Double*): Double = {
val a = parameters(0)
val b = parameters(1)
1.0 / (1.0 + a * pow(x, 2 * b))
}
override def gradient(x: Double, parameters: Double*): Array[Double] = {
val a = parameters(0)
val b = parameters(1)
val ga = - pow(x, 2 * b) / pow(1 + a * pow(x, 2 * b), 2)
val gb = - (2 * a * pow(x, 2 * b) * log(x)) / pow(1 + a * pow(x, 2 * b), 2)
val grad = Array(ga, gb)
grad
}
}
val wopc = JavaConverters.asJavaCollection(wop)
val cf = SimpleCurveFitter.create(f, Array(1, 1))
val param = cf.fit(wopc)
println(param(0), param(1))
}
Thank you for your help :)

Best way to write Fastest Euclidean distance in Scala

I would like to find the fastest way to write euclidean distance in Scala. After some attemps, i'm here.
def euclidean[V <: Seq[Double]](dot1: V, dot2: V): Double = {
var d = 0D
var i = 0
while( i < dot1.size ) {
val toPow2 = dot1(i) - dot2(i)
d += toPow2 * toPow2
i += 1
}
sqrt(d)
}
Fastest results are obtain with mutable.ArrayBuffer[Double] as V and no collection.parallel._ are authorized for various vector size from 2 up to 10000
For those who desire to test breeze its slower with following distance function :
def euclideanDV(v1: DenseVector[Double], v2: DenseVector[Double]) = norm(v1 - v2)
If anyone knows any pure scala code or library that could help to improve speed it would be greatly appreciated.
The way i tested speed was i follow.
val te1 = 0L
val te2 = 0L
val runNumber = 100000
val warmUp = 60000
(0 until runNumber).foreach{ x =>
val t1 = System.nanoTime
euclidean1(v1, v2)
val t2 = System.nanoTime
euclidean2(v1, v2)
val t3 = System.nanoTime
if( x >= warmUp ) {
te1 += t2 - t1
te2 += t3 - t2
}
}
Here a some of my tries
// Fast on ArrayBuffer, quadratic on List
def euclidean1[V <: Seq[Double]](v1: V, v2: V) =
{
var d = 0D
var i = 0
while( i < v1.size ){
val toPow2 = v1(i) - v2(i)
d += toPow2 * toPow2
i += 1
}
sqrt(d)
}
// Breeze test
def euclideanDV(v1: DenseVector[Double], v2: DenseVector[Double]) = norm(v1 - v2)
// Slower than euclidean1
def euclidean2[V <: Seq[Double]](v1: V, v2: V) =
{
var d = 0D
var i = 0
while( i < v1.size )
{
d += pow(v1(i) - v2(i), 2)
i += 1
}
d
}
// Slower than 1 for Vsize ~< 1000 and a bit better over 1000 on ArrayBuffer
def euclidean3[V <: Seq[Double]](v1: V, v2: V) =
{
var d = 0D
var i = 0
(0 until v1.size).foreach{ i=>
val toPow2 = v1(i) - v2(i)
d += toPow2 * toPow2
}
sqrt(d)
}
// Slower than 1 for Vsize ~< 1000 and a bit better over 1000 on ArrayBuffer
def euclidean3bis(dot1: Seq[Double], dot2: Seq[Double]): Double =
{
var sum = 0D
dot1.indices.foreach{ id =>
val toPow2 = dot1(id) - dot2(id)
sum += toPow2 * toPow2
}
sqrt(sum)
}
// Slower than 1
def euclidean4[V <: Seq[Double]](v1: V, v2: V) =
{
var d = 0D
var i = 0
val vz = v1.zip(v2)
while( i < vz.size )
{
val (a, b) = vz(i)
val toPow2 = a - b
d += toPow2 * toPow2
i += 1
}
d
}
// Slower than 1
def euclideanL1(v1: List[Double], v2: List[Double]) = sqrt(v1.zip(v2).map{ case (a, b) =>
val toPow2 = a - b
toPow2 * toPow2
}.sum)
// Slower than 1
def euclidean5(dot1: Seq[Double], dot2: Seq[Double]): Double =
{
var sum = 0D
dot1.zipWithIndex.foreach{ case (a, id) =>
val toPow2 = a - dot2(id)
sum += toPow2 * toPow2
}
sqrt(sum)
}
// super super slow
def euclidean6(v1: Seq[Double], v2: Seq[Double]) = sqrt(v1.zip(v2).map{ case (a, b) => pow(a - b, 2) }.sum)
// Slower than 1
def euclidean7(dot1: Seq[Double], dot2: Seq[Double]): Double =
{
var sum = 0D
dot1.zip(dot2).foreach{ case (a, b) => sum += pow(a - b, 2) }
sum
}
// Slower than 1
def euclidean8(v1: Seq[Double], v2: Seq[Double]) =
{
def inc(n: Int, v: Double) = {
val toPow2 = v1(n) - v2(n)
v + toPow2 * toPow2
}
#annotation.tailrec
def go(n: Int, v: Double): Double =
{
if( n < v1.size - 1 ) go(n + 1, inc(n, v))
else inc(n, v)
}
sqrt(go(0, 0D))
}
// Slower than 1
def euclideanL2(v1: List[Double], v2: List[Double]) =
{
def inc(vzz: List[(Double, Double)], v: Double): Double =
{
val (a, b) = vzz.head
val toPow2 = a - b
v + toPow2 * toPow2
}
#annotation.tailrec
def go(vzz: List[(Double, Double)], v: Double): Double =
{
if( vzz.isEmpty ) v
else go(vzz.tail, inc(vzz, v))
}
sqrt(go(v1.zip(v2), 0D))
}

I tried tailrecursion on List but not enough efficiently on ArrayBuffer, i totally agree with the fact that proper tools like JMH are needed to test speed efficiency properly. But when order of magnitude is between 10-50% faster, we can be confident that it is better.
Even if it is V <: Seq[Double] it is NOT appropriate for List but for ArrayLike structure.
Here my proposal
def euclideanF[V <: Seq[Double]](v1: V, v2: V) = {
#annotation.tailrec
def go(d: Double, i: Int): Double = {
if( i < v1.size ) {
val toPow2 = v1(i) - v2(i)
go(d + toPow2 * toPow2, i + 1)
}
else d
}
sqrt(go(0D, 0))
}

String Hive function for split key-value pair into two columns

How to split this data T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0 into two columns using hive function
For example
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0

You can do this with a regex implementation:
def main(args: Array[String]) {
val s = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val pattern = "[A-Z]\\_\\d+\\.?\\d*"
var buff = new String()
val r = Pattern.compile(pattern)
val m = r.matcher(s)
while (m.find()) {
buff = buff + (m.group(0))
buff = buff + "\n"
}
buff = buff.toString.replaceAll("\\_", " ")
println("output:\n" + buff)
}
Output:
output:
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0

If you need to collect the data for further processing, and you're guaranteed it's always paired correctly, you could do something like this.
scala> val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
str: String = T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0
scala> val data = str.split("_").sliding(2,2)
data: Iterator[Array[String]] = non-empty iterator
scala> data.toList // just to see it
res29: List[Array[String]] = List(Array(T, 32), Array(P, 1), Array(A, 420), Array(H, 60), Array(R, 0.30841494477846165), Array(S, 0))

You can split your string, get an array, zipWithIndex and filter based on index to get two arrays col1 and col2 and then use it for printing:
val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val tmp = str.split('_').zipWithIndex
val col1 = tmp.filter( p => p._2 % 2 == 0 ).map( p => p._1)
val col2 = tmp.filter( p => p._2 % 2 != 0 ).map( p => p._1)
//col1: Array[String] = Array(T, P, A, H, R, S)
//col2: Array[String] = Array(32, 1, 420, 60, ...

Average word length in Spark

I have a list of values and their aggregated lengths of all their occurrences as an array.
Ex: If my sentence is
"I have a cat. The cat looks very cute"
My array looks like
Array((I,1), (have,4), (a,1), (cat,6), (The, 3), (looks, 5), (very ,4), (cute,4))
Now I want to compute the average length of each word. i.e the length / number of occurrences.
I tried to do the coding using Scala as follows:
val avglen = arr.reduceByKey( (x,y) => (x, y.toDouble / x.size.toDouble) )
I'm getting an error as follows at x.size
error: value size is not a member of Int
Please help me where I'm going wrong here.

After your comment I think I got it:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val avgs = words.map { case (word, count) => (word, count / word.length.toDouble) }
println("My averages are: ")
avgs.take(100).foreach(println)
Supposing you have a paragraph with those words and You want to calculate the mean size of the words of the paragraph.
In two steps, with a map-reduce approach and in spark-1.5.1:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val wordCount = words.map { case (word, count) => count}.reduce((a, b) => a + b)
val wordLength = words.map { case (word, count) => word.length * count}.reduce((a, b) => a + b)
println("The avg length is: " + wordLength / wordCount.toDouble)
I ran this code using an .ipynb connected to a spark-kernel this is the output.

If I understand the problem correctly:
val rdd: RDD[(String, Int) = ???
val ave: RDD[(String, Double) =
rdd.map { case (name, numOccurance) =>
(name, name.length.toDouble / numOccurance)
}

This is a slightly confusing question. If your data is already in an Array[(String, Int)] collection (presumably after a collect() to the driver), then you need not use any RDD transformations. In fact, there's a nifty trick you can run with fold*() to grab the average over a collection:
val average = arr.foldLeft(0.0) { case (sum: Double, (_, count: Int)) => sum + count } / arr.foldLeft(0.0) { case (sum: Double, (word: String, count: Int)) => sum + count / word.length }
Kind of long winded, but it essentially aggregates the total number of characters in the numerator and the number of words in the denominator. Run on your example, I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val average = ...
average: Double = 3.111111111111111
If you have your (String, Int) tuples distributed across an RDD[(String, Int)], you can use accumulators to solve this problem quite easily:
val chars = sc.accumulator(0.0)
val words = sc.accumulator(0.0)
wordsRDD.foreach { case (word: String, count: Int) =>
chars += count; words += count / word.length
}
val average = chars.value / words.value
When running on the above example (placed in an RDD), I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val wordsRDD = sc.parallelize(arr)
wordsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:14
scala> val chars = sc.accumulator(0.0)
chars: org.apache.spark.Accumulator[Double] = 0.0
scala> val words = sc.accumulator(0.0)
words: org.apache.spark.Accumulator[Double] = 0.0
scala> wordsRDD.foreach { case (word: String, count: Int) =>
| chars += count; words += count / word.length
| }
...
scala> val average = chars.value / words.value
average: Double = 3.111111111111111

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala not a member of Any - scala

Related

Outlier Elimination in Spark With InterQuartileRange Results in Error

ConvergenceException on SimpleCurveFitter in Scala

Best way to write Fastest Euclidean distance in Scala

String Hive function for split key-value pair into two columns

Average word length in Spark

Categories

Resources