Scala filter and multiple predicates reporting - scala

I have a 4 predicates
private def pred1(ep:MyClass):Boolean = ep.attr1.contains(true) && func1(ep)
private def pred2(ep:MyClass):Boolean = ep.attr1.contains(true) && !func1(ep)
private def pred3(ep:MyClass):Boolean = ep.attr1.contains(false) && func2(ep)
private def pred4(ep:MyClass):Boolean = ep.attr1.contains(false) && !func2(ep)
I then have a list that I want to filter by each of the predicates like so.
val ep: Seq[MyClass] = ???
val v1 = es.filter(pred1)
val v2 = es.filter(pred2)
val v3 = es.filter(pred3)
val v4 = es.filter(pred4)
How do I get values of v1, v2, v3, v4 with the correct predicates in a single filter and report it as a 4 tuple (v1,v2,v3,v4)? Or something similar. I do not want to do this 4 times. I have a huge sequence and this is not optimized

You can use a fold like this ..
ep.foldLeft[(Seq[MyClass], Seq[MyClass], Seq[MyClass], Seq[MyClass])]
((Nil,Nil,Nil,Nil)) { case ((a,b,c,d), i) =>
(
if (pred1(i)) a :+ i else a,
if (pred2(i)) b :+ i else b,
if (pred3(i)) c :+ i else c,
if (pred4(i)) d :+ i else d
)
}

Related

Mapping over a collection that might return multiple values or a single value

I'm currently mapping over a collection for validation, and I need to return back a single or multiple validation errors:
val errors: Seq[Option[ProductErrors]] = products.map {
if(....) Some(ProductError(...))
else if(...) Some(ProductError(..))
else None
}
errors.flatten
So currently I am returning an Option[ProductError] per map iteration, but in some cases I need to return multiple ProductError's, how can I acheive this?
e.g.
if(...) {
val p1 = Some(ProductError(...))
val p2 = Some(ProductError(....))
}
case class ProductErrors(msg: String = "anything")
val products = (1 to 10).toList
def convert(p: Int): Seq[ProductErrors] = {
if (p < 5) Seq(ProductErrors("less than 5"))
else if (p < 8 && p % 2 == 1) Seq(ProductErrors("element is odd"), ProductErrors("less than 8"))
else Seq()
}
val errors = products.map(convert)
// errors.flatten.size
// val res8: Int = 8
// you can just use flatMap here
products.flatMap(convert).size // 8

Class State get loss between function calls in Flink

I have this class:
case class IDADiscretizer(
nAttrs: Int,
nBins: Int = 5,
s: Int = 5) extends Serializable {
private[this] val log = LoggerFactory.getLogger(this.getClass)
private[this] val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
private[this] val randomReservoir = SamplingUtils.reservoirSample((1 to s).toList.iterator, 1)
def updateSamples(v: LabeledVector): Vector[IntervalHeapWrapper] = {
val attrs = v.vector.map(_._2)
val label = v.label
// TODO: Check for missing values
attrs
.zipWithIndex
.foreach {
case (attr, i) =>
if (V(i).getNbSamples < s) {
V(i) insertValue attr // insert
} else {
if (randomReservoir(0) <= s / (i + 1)) {
//val randVal = Random nextInt s
//V(i) replace (randVal, attr)
V(i) insertValue attr
}
}
}
V
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints: Vector[Vector[Double]] = V map (_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): (DataSet[Vector[IntervalHeapWrapper]], Vector[Vector[Double]]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
}
Using flink, I would like to get the cutpoints after the call of discretize, but it seems the information stored in V get loss. Do I have to use Broadcast like in this question? is there a better way to access the state of class?
I've tried to call cutpoints in two ways, one with is:
def discretize(data: DataSet[LabeledVector]) = data map (x => updateSamples(x))
Then, called from outside:
val a = IDADiscretizer(nAttrs = 4)
val r = a.discretize(dataSet)
r.print
val cuts = a.cutPoints
Here, cuts is empty so I tried to compute the discretization as well as the cutpoints inside discretize:
def discretize(data: DataSet[LabeledVector]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
And use it like this:
val a = IDADiscretizer(nAttrs = 4)
val (d, c) = a.discretize(dataSet)
c foreach println
But the same happends.
Finally, I've also tried to make V completely public:
val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
Still empty
What am I doing wrong?
Related questions:
Keep keyed state across multiple transformations
Flink State backend keys atomicy and distribution
Flink: does state access across stream?
Flink: Sharing state in CoFlatMapFunction
Answer
Thanks to #TillRohrmann what I finally did was:
private[this] def computeCutPoints(x: LabeledVector) = {
val attrs = x.vector.map(_._2)
val label = x.label
attrs
.zipWithIndex
.foldLeft(V) {
case (iv, (v, i)) =>
iv(i) insertValue v
iv
}
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints(data: DataSet[LabeledVector]): Seq[Seq[Double]] =
data.map(computeCutPoints _)
.collect
.last.map(_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): DataSet[LabeledVector] =
data.map(updateSamples _)
And then use it like this:
val a = IDADiscretizer(nAttrs = 4)
val d = a.discretize(dataSet)
val cuts = a.cutPoints(dataSet)
d.print
cuts foreach println
I do not know if it is the best way, but at least is working now.
The way Flink works is that the user defines operators/user defined functions which operate on input data coming from a source function. In order to execute a program the user code is sent to the Flink cluster where it is executed. The results of the computation has to be output to some storage system via a sink function.
Due to this, it is not possible to mix local and distributed computations easily as you are trying with your solution. What discretize does is to define a map operator which transforms the input DataSet data. This operation will be executed once you call ExecutionEnvironment#execute or DataSet#print, for example. Now the user code and the definition for IDADiscretizer is sent to the cluster where they are instantiated. Flink will update the values in an instance of IDADiscretizer which is not the same instance as the one you have on the client.

Scala FoldLeft function

I have below sample data:
Day,JD,Month,Year,PRCP(in),SNOW(in),TAVE (F),TMAX (F),TMIN (F)
1,335,12,1895,0,0,12,26,-2
2,336,12,1895,0,0,-3,11,-16
.
.
.
Now I need to calculate hottest day having maximm TMAX, now I have calculated it with reduceBy, but couldn't figure out how to do it with foldBy below is the code:
import scala.io.Source
case class TempData(day:Int , DayOfYear:Int , month:Int , year:Int ,
precip:Double , snow:Double , tave:Double, tmax:Double, tmin:Double)
object TempData {
def main(args:Array[String]) : Unit = {
val source = Source.fromFile("C:///DataResearch/SparkScala/MN212142_9392.csv.txt")
val lines = source.getLines().drop(1)
val data = lines.flatMap { line =>
val p = line.split(",")
TempData(p(0).toInt, p(1).toInt, p(2).toInt, p(4).toInt
, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble, p(9).toDouble))
}.toArray
source.close()
val HottestDay = data.maxBy(_.tmax)
println(s"Hot day 1 is $HottestDay")
val HottestDay2 = data.reduceLeft((d1, d2) => if (d1.tmax >= d2.tmax) d1 else d2)
println(s"Hot day 2 is $HottestDay2")
val HottestDay3 = data.foldLeft(0.0,0.0).....
println(s"Hot day 3 is $HottestDay3")
I cannot figure out how to use foldBy function in this.
foldLeft is a more general reduceLeft (it does not require the result to be a supertype of the collection type and it allows one to define the value if there's nothing to fold over). One can implement reduceLeft in terms of foldLeft like so:
def reduceLeft[B >: A](op: (B, A) => B): B = {
if (data.isEmpty) throw new UnsupportedOperationException("empty collection")
else this.tail.foldLeft(this.head)(op)
}
Applying that transformation, assuming that data is not empty, you can thus translate
data.reduceLeft((d1, d2) => if (d1.tmax >= d2.tmax) d1 else d2)
into
data.tail.foldLeft(data.head) { (d1, d2) =>
if (d1.tmax >= d2.tmax) d1
else d2
}
If data has size 1, then data.tail is empty and the result is data.head (which is trivially the maximum).
Maybe you are looking for something like this
data.foldLeft(date(0))((a,b) => if(a.tmax >= b.tmax) a else b)

.zip three futures in Scala [duplicate]

This question already has answers here:
Return Future[(Int,Int)] instead of (Future[Int],Future[Int])
(2 answers)
Closed 5 years ago.
I need the result variable below to contain Future[(String,String,String)] with the result of futures f1, f2 and f3, but instead I'm getting Future[((String, String), String)]. I need the three futures to run in parallel. How to make this work?
def futureA = Future { "A" }
def futureB = Future { "B" }
def futureC = Future { "C" }
def futureFunc = {
val cond1 = 1
val cond2 = 0
val f1 = if (cond1 > 0)
futureA
else
Future {""}
val f2 = if (cond2 > 0)
futureB
else
Future {""}
val f3 = futureC
val fx = f1.zip(f2)
val result = fx.zip(f3)
}
If you create your futures beforehand, you can combine them in a for comprehension and they will run in parallel:
for {
a <- f1
b <- f2
c <- f3
} yield (a, b, c)
res0: scala.concurrent.Future[(String, String, String)]
I tried to create more solutions and here is result:
def futureFunc = {
val cond1 = 1
val cond2 = 0
val f1 = if (cond1 > 0)
futureA
else
Future {""}
val f2 = if (cond2 > 0)
futureB
else
Future {""}
val f3 = futureC
//#1
Future.sequence(List(f1, f2, f3)).map {
case List(a, b, c) => (a, b, c)
}
//#2
for{
f11 <- f1
f22 <- f2
f33 <- f3
} yield (f11, f22, f33)
//#3
f1.zip(f2).zip(f3).map{
case ((f11,f22),f33) => (f11,f22,f33)
}
}
First one uses Future sequence, for creating Future[List[]] and then mapping this list for tuple (because of type safety we don't have method for tupling list).
Second is usage of for comprehension as described by Sascha, as you may know it is syntactic sugar for maps and flatmaps which is preferred to work with futures.
Last one is using zips, as you wanted, but you still need to map last future to obtain tuple which you want.
All operations are non blocking, but for all operations you need to know exactly futures which you will be using. You can use additional libraries for tupling lists, and then use first solution for not well known amount for futures. For readability i think for comprehension is best.

Tune Nested Loop in Scala

I was wondering if I can tune the following Scala code :
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
var listNoDuplicates: List[(Class1, Class2)] = Nil
for (outerIndex <- 0 until listOfTuple.size) {
if (outerIndex != listOfTuple.size - 1)
for (innerIndex <- outerIndex + 1 until listOfTuple.size) {
if (listOfTuple(i)._1.flag.equals(listOfTuple(j)._1.flag))
listNoDuplicates = listOfTuple(i) :: listNoDuplicates
}
}
listNoDuplicates
}
Usually if you have someting looking like:
var accumulator: A = new A
for( b <- collection ) {
accumulator = update(accumulator, b)
}
val result = accumulator
can be converted in something like:
val result = collection.foldLeft( new A ){ (acc,b) => update( acc, b ) }
So here we can first use a map to force the unicity of flags. Supposing the flag has a type F:
val result = listOfTuples.foldLeft( Map[F,(ClassA,ClassB)] ){
( map, tuple ) => map + ( tuple._1.flag -> tuple )
}
Then the remaining tuples can be extracted from the map and converted to a list:
val uniqList = map.values.toList
It will keep the last tuple encoutered, if you want to keep the first one, replace foldLeft by foldRight, and invert the argument of the lambda.
Example:
case class ClassA( flag: Int )
case class ClassB( value: Int )
val listOfTuples =
List( (ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)), (ClassA(1),ClassB(-1)) )
val result = listOfTuples.foldRight( Map[Int,(ClassA,ClassB)]() ) {
( tuple, map ) => map + ( tuple._1.flag -> tuple )
}
val uniqList = result.values.toList
//uniqList: List((ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)))
Edit: If you need to retain the order of the initial list, use instead:
val uniqList = listOfTuples.filter( result.values.toSet )
This compiles, but as I can't test it it's hard to say if it does "The Right Thing" (tm):
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
(for {outerIndex <- 0 until listOfTuple.size
if outerIndex != listOfTuple.size - 1
innerIndex <- outerIndex + 1 until listOfTuple.size
if listOfTuple(i)._1.flag == listOfTuple(j)._1.flag
} yield listOfTuple(i)).reverse.toList
Note that you can use == instead of equals (use eq if you need reference equality).
BTW: https://codereview.stackexchange.com/ is better suited for this type of question.
Do not use index with lists (like listOfTuple(i)). Index on lists have very lousy performance. So, some ways...
The easiest:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
SortedSet(listOfTuple: _*)(Ordering by (_._1.flag)).toList
This will preserve the last element of the list. If you want it to preserve the first element, pass listOfTuple.reverse instead. Because of the sorting, performance is, at best, O(nlogn). So, here's a faster way, using a mutable HashSet:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
// Produce a hash map to find the duplicates
import scala.collection.mutable.HashSet
val seen = HashSet[Flag]()
// now fold
listOfTuple.foldLeft(Nil: List[(Class1,Class2)]) {
case (acc, el) =>
val result = if (seen(el._1.flag)) acc else el :: acc
seen += el._1.flag
result
}.reverse
}
One can avoid using a mutable HashSet in two ways:
Make seen a var, so that it can be updated.
Pass the set along with the list being created in the fold. The case then becomes:
case ((seen, acc), el) =>