I have the following code, but I want to make it more idiomatically correct. First of all I define three variables as Option[String].
var p1 = None : Option[String]
var p2 = None : Option[String]
var p3 = None : Option[String]
Then I define these parameters using the function getParameters(). Basically this function simply reads parameters from text file.
getParameters()
//...
def getParameters(): Unit = {
val p1 = Some(getP1())
val p2 = Some(getP2())
val p3 = Some(getP3())
}
Finally, right after getParameters() I run another function getRules that uses p1, p2 and p3. Now it expects them to be String instead of Some[String].
val df = getRules(p1,p2,p3)
If any of these three parameters is None, then the program should throw an error. I wonder if I am on the right way. What is the number of parameters is bigger, e.g. 10 or 15? What is the best short way to process these parameters?
val valuesOpt = for(a <- p1; b <- p2; c <- p3) yield (a,b,c)
valuesOpt.map{
case (a, b, c) => getRules(a, b, c)
}.getOrElse(throw new Exception("Nope"))
There are multiple ways to unpack an Option, but in your case I think this is the easiest to read/maintain in my opinion:
val df = (p1, p2, p3) match {
case (Some(x), Some(y), Some(z)) => getRules(x, y, z)
case _ => throw new Exception("One or more values were NONE!")
}
*Edit: Here is a small scala fiddle to demonstrate how to use this: https://scalafiddle.io/sf/YVCYBBl/1
Related
I have this class:
case class IDADiscretizer(
nAttrs: Int,
nBins: Int = 5,
s: Int = 5) extends Serializable {
private[this] val log = LoggerFactory.getLogger(this.getClass)
private[this] val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
private[this] val randomReservoir = SamplingUtils.reservoirSample((1 to s).toList.iterator, 1)
def updateSamples(v: LabeledVector): Vector[IntervalHeapWrapper] = {
val attrs = v.vector.map(_._2)
val label = v.label
// TODO: Check for missing values
attrs
.zipWithIndex
.foreach {
case (attr, i) =>
if (V(i).getNbSamples < s) {
V(i) insertValue attr // insert
} else {
if (randomReservoir(0) <= s / (i + 1)) {
//val randVal = Random nextInt s
//V(i) replace (randVal, attr)
V(i) insertValue attr
}
}
}
V
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints: Vector[Vector[Double]] = V map (_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): (DataSet[Vector[IntervalHeapWrapper]], Vector[Vector[Double]]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
}
Using flink, I would like to get the cutpoints after the call of discretize, but it seems the information stored in V get loss. Do I have to use Broadcast like in this question? is there a better way to access the state of class?
I've tried to call cutpoints in two ways, one with is:
def discretize(data: DataSet[LabeledVector]) = data map (x => updateSamples(x))
Then, called from outside:
val a = IDADiscretizer(nAttrs = 4)
val r = a.discretize(dataSet)
r.print
val cuts = a.cutPoints
Here, cuts is empty so I tried to compute the discretization as well as the cutpoints inside discretize:
def discretize(data: DataSet[LabeledVector]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
And use it like this:
val a = IDADiscretizer(nAttrs = 4)
val (d, c) = a.discretize(dataSet)
c foreach println
But the same happends.
Finally, I've also tried to make V completely public:
val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
Still empty
What am I doing wrong?
Related questions:
Keep keyed state across multiple transformations
Flink State backend keys atomicy and distribution
Flink: does state access across stream?
Flink: Sharing state in CoFlatMapFunction
Answer
Thanks to #TillRohrmann what I finally did was:
private[this] def computeCutPoints(x: LabeledVector) = {
val attrs = x.vector.map(_._2)
val label = x.label
attrs
.zipWithIndex
.foldLeft(V) {
case (iv, (v, i)) =>
iv(i) insertValue v
iv
}
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints(data: DataSet[LabeledVector]): Seq[Seq[Double]] =
data.map(computeCutPoints _)
.collect
.last.map(_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): DataSet[LabeledVector] =
data.map(updateSamples _)
And then use it like this:
val a = IDADiscretizer(nAttrs = 4)
val d = a.discretize(dataSet)
val cuts = a.cutPoints(dataSet)
d.print
cuts foreach println
I do not know if it is the best way, but at least is working now.
The way Flink works is that the user defines operators/user defined functions which operate on input data coming from a source function. In order to execute a program the user code is sent to the Flink cluster where it is executed. The results of the computation has to be output to some storage system via a sink function.
Due to this, it is not possible to mix local and distributed computations easily as you are trying with your solution. What discretize does is to define a map operator which transforms the input DataSet data. This operation will be executed once you call ExecutionEnvironment#execute or DataSet#print, for example. Now the user code and the definition for IDADiscretizer is sent to the cluster where they are instantiated. Flink will update the values in an instance of IDADiscretizer which is not the same instance as the one you have on the client.
I have below sample data:
Day,JD,Month,Year,PRCP(in),SNOW(in),TAVE (F),TMAX (F),TMIN (F)
1,335,12,1895,0,0,12,26,-2
2,336,12,1895,0,0,-3,11,-16
.
.
.
Now I need to calculate hottest day having maximm TMAX, now I have calculated it with reduceBy, but couldn't figure out how to do it with foldBy below is the code:
import scala.io.Source
case class TempData(day:Int , DayOfYear:Int , month:Int , year:Int ,
precip:Double , snow:Double , tave:Double, tmax:Double, tmin:Double)
object TempData {
def main(args:Array[String]) : Unit = {
val source = Source.fromFile("C:///DataResearch/SparkScala/MN212142_9392.csv.txt")
val lines = source.getLines().drop(1)
val data = lines.flatMap { line =>
val p = line.split(",")
TempData(p(0).toInt, p(1).toInt, p(2).toInt, p(4).toInt
, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble, p(9).toDouble))
}.toArray
source.close()
val HottestDay = data.maxBy(_.tmax)
println(s"Hot day 1 is $HottestDay")
val HottestDay2 = data.reduceLeft((d1, d2) => if (d1.tmax >= d2.tmax) d1 else d2)
println(s"Hot day 2 is $HottestDay2")
val HottestDay3 = data.foldLeft(0.0,0.0).....
println(s"Hot day 3 is $HottestDay3")
I cannot figure out how to use foldBy function in this.
foldLeft is a more general reduceLeft (it does not require the result to be a supertype of the collection type and it allows one to define the value if there's nothing to fold over). One can implement reduceLeft in terms of foldLeft like so:
def reduceLeft[B >: A](op: (B, A) => B): B = {
if (data.isEmpty) throw new UnsupportedOperationException("empty collection")
else this.tail.foldLeft(this.head)(op)
}
Applying that transformation, assuming that data is not empty, you can thus translate
data.reduceLeft((d1, d2) => if (d1.tmax >= d2.tmax) d1 else d2)
into
data.tail.foldLeft(data.head) { (d1, d2) =>
if (d1.tmax >= d2.tmax) d1
else d2
}
If data has size 1, then data.tail is empty and the result is data.head (which is trivially the maximum).
Maybe you are looking for something like this
data.foldLeft(date(0))((a,b) => if(a.tmax >= b.tmax) a else b)
Here I got two hash sets:
var vertexes = new HashSet[String]()
var edges = new HashSet[RDFTriple]() //RDFTriple is a class
I want to put them into a map like this:
var graph = Map[String, HashSet[_]]()
graph.put("e", edges)
graph.put("v", vertexes)
But now I want to take vertexes and edges respectively but failed. I have tried something like the following:
val a = graph.get("v")
a match {
case _ => val v = a
}
val b = graph.get("e")
b match {
case _ => val e = b
}
But v and e are recognized as Option[HashSet[_]] while I want are HashSet[String] and HashSet[RDFTriple].
How can I do this?
I will apprecicate it so much cuz it bothers me too long.
It is not recommended to use different types in the same Map, however you could some the problem by using Some and asInstanceOf like this:
val v = a match {
case Some(a) => a.asInstanceOf[HashSet[String]]
case None => // do something
}
Note that the assignment val v = ... is done outside the match to allow usage of the variable afterwards. The match for the edges is similar.
However, a better solution would be to use a case class for the graph. Then you would avoid a lot of hassle.
case class Graph(vertexes: HashSet[String], edges: HashSet[RDFTriple])
val graph = Graph(vertexes, edges)
val v = graph.vertexes // HashSet[String]
val e = graph.edges // HashSet[RDFTriple]
Is it possible to handle Either in similar way to Option? In Option, I have a getOrElse function, in Either I want to return Left or process Right. I'm looking for the fastest way of doing this without any boilerplate like:
val myEither:Either[String, Object] = Right(new Object())
myEither match {
case Left(leftValue) => value
case Right(righValue) =>
"Success"
}
In Scala 2.12,
Either is right-biased, which means that Right is assumed to be the default case to operate on. If it is Left, operations like map, flatMap, ... return the Left value unchanged
so you can do
myEither.map(_ => "Success").merge
if you find it more readable than fold.
You can use .fold:
scala> val r: Either[Int, String] = Right("hello")
r: Either[Int,String] = Right(hello)
scala> r.fold(_ => "got a left", _ => "Success")
res7: String = Success
scala> val l: Either[Int, String] = Left(1)
l: Either[Int,String] = Left(1)
scala> l.fold(_ => "got a left", _ => "Success")
res8: String = got a left
Edit:
Re-reading your question it's unclear to me whether you want to return the value in the Left or another one (defined elsewhere)
If it is the former, you can pass identity to .fold, however this might change the return type to Any:
scala> r.fold(identity, _ => "Success")
res9: Any = Success
Both cchantep's and Marth's are good solutions to your immediate problem. But more broadly, it's difficult to treat Either as something fully analogous to Option, particularly in letting you express sequences of potentially failable computations for comprehensions. Either has a projection API (used in cchantep's solution), but it is a bit broken. (Either's projections break in for comprehensions with guards, pattern matching, or variable assignment.)
FWIW, I've written a library to solve this problem. It augments Either with this API. You define a "bias" for your Eithers. "Right bias" means that ordinary flow (map, get, etc) is represented by a Right object while Left objects represent some kind of problem. (Right bias is conventional, although you can also define a left bias if you prefer.) Then you can treat the Either like an Option; it offers a fully analogous API.
import com.mchange.leftright.BiasedEither
import BiasedEither.RightBias._
val myEither:Either[String, Object] = ...
val o = myEither.getOrElse( "Substitute" )
More usefully, you can now treat Either like a true scala monad, i.e. use flatMap, map, filter, and for comprehensions:
val myEither : Either[String, Point] = ???
val nextEither = myEither.map( _.x ) // Either[String,Int]
or
val myEither : Either[String, Point] = ???
def findGalaxyAtPoint( p : Point ) : Either[String,Galaxy] = ???
val locPopPair : Either[String, (Point, Long)] = {
for {
p <- myEither
g <- findGalaxyAtPoint( p )
} yield {
(p, g.population)
}
}
If all processing steps succeeded, locPopPair will be a Right[Long]. If anything went wrong, it will be the first Left[String] encountered.
It's slightly more complex, but a good idea to define an empty token. Let's look at a slight variation on the for comprehension above:
val locPopPair : Either[String, (Point, Long)] = {
for {
p <- myEither
g <- findGalaxyAtPoint( p ) if p.x > 1000
} yield {
(p, g.population)
}
}
What would happen if the test p.x > 1000 failed? We'd want to return some Left that signifies "empty", but there is no universal appropriate value (not all Left's are Left[String]. As of now, what would happen is the code would throw a NoSuchElementException. But we can specify an empty token ourselves, as below:
import com.mchange.leftright.BiasedEither
val RightBias = BiasedEither.RightBias.withEmptyToken[String]("EMPTY")
import RightBias._
val myEither : Either[String, Point] = ???
def findGalaxyAtPoint( p : Point ) : Either[String,Galaxy] = ???
val locPopPair : Either[String, (Point, Long)] = {
for {
p <- myEither
g <- findGalaxyAtPoint( p ) if p.x > 1000
} yield {
(p, g.population)
}
}
Now, if the p.x > 1000 test fails, there will be no Exception, locPopPair will just be Left("EMPTY").
I guess you can do as follows.
def foo(myEither: Either[String, Object]) =
myEither.right.map(rightValue => "Success")
In scala 2.13, you can use myEither.getOrElse
Right(12).getOrElse(17) // 12
Left(12).getOrElse(17) // 17
Similarly to mutually recursive types in scala I am trying to create a mutually recursive type in Scala.
I am trying to make a graph defined with this type (which does compile) :
case class Node(val id : Int, val edges : Set[Node])
But I don't understand how I can actually create something with this type, because in order to initialize Node A with edges B and C, I need to at least have a lazy reference to B and C, but I can't simultaneously create their edgesets.
Is it possible to implement this recursive type?
EDIT:
Here is the solution I am currently using to convert an explicit adjacency list to a self-referential one.
def mapToGraph(edgeMap : Map[Int, mutable.Set[Int]]) : List[Node] = {
lazy val nodeMap = edgeMap map (kv => (kv._1, new Node(kv._1, futures.get(kv._1).get)))
lazy val futures : Map[Int, Set[Node]] = edgeMap map (kv => {
val edges = (kv._2 map (e => nodeMap.get(e).get)).toSet
(kv._1, edges)
})
val eval = nodeMap.values.toList
eval //to force from lazy to real - don't really like doing this
}
or alternatively, fromm an edgelist
//reads an edgeList into a graph
def readEdgelist(filename : String) : List[Node] = {
lazy val nodes = new mutable.HashMap[Int, Node]()
lazy val edges = new mutable.HashMap[Int, mutable.Buffer[Node]]()
Source.fromFile(filename).getLines() filter (x => x != None) foreach {edgeStr =>
val edge = edgeStr.split('\t')
if (edge.size != 2) goodbye("Not a well-formed edge : " + edgeStr + " size: " + edge.size.toString)
val src = edge(0).toInt
val des = edge(1).toInt
if (!(nodes.contains(src))) nodes.put(src, new Node(src, futures.get(src).get))
if (!(nodes.contains(des))) nodes.put(des, new Node(des, futures.get(des).get))
edges.put(src, edges.getOrElse(src, mutable.Buffer[Node]()) += nodes.get(des).get)
}
lazy val futures : Map[Int, Set[Node]] = nodes map {node => (node._1, edges.getOrElse(node._1, mutable.Buffer[Node]()).toSet)} toMap
val eval = nodes.values.toList
eval
}
Thanks everyone for the advice!
Sounds like you need to work from the bottom up
val b = Node(1, Set.empty)
val c = Node(2, Set.empty)
val a = Node(3, Set(b, c))
Hope that helps
Chicken & egg... you have three options:
Restrict your graphs to Directed Acyclic Graphs (DAGs) and use RKumsher's suggestion.
To retain immutability, you'll need to separate your node instances from your edge sets (two different classes: create nodes, then create edge sets/graph).
If you prefer the tight correlation, consider using a setter for the edge sets so that you can come back and set them later, after all the nodes are created.