I have below sample data:
Day,JD,Month,Year,PRCP(in),SNOW(in),TAVE (F),TMAX (F),TMIN (F)
Now I need to calculate hottest day having maximm TMAX, now I have calculated it with reduceBy, but couldn't figure out how to do it with foldBy below is the code:
import scala.io.Source
case class TempData(day:Int , DayOfYear:Int , month:Int , year:Int ,
precip:Double , snow:Double , tave:Double, tmax:Double, tmin:Double)
object TempData {
def main(args:Array[String]) : Unit = {
val source = Source.fromFile("C:///DataResearch/SparkScala/MN212142_9392.csv.txt")
val lines = source.getLines().drop(1)
val data = lines.flatMap { line =>
val p = line.split(",")
TempData(p(0).toInt, p(1).toInt, p(2).toInt, p(4).toInt
, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble, p(9).toDouble))
val HottestDay = data.maxBy(_.tmax)
println(s"Hot day 1 is $HottestDay")
val HottestDay2 = data.reduceLeft((d1, d2) => if (d1.tmax >= d2.tmax) d1 else d2)
println(s"Hot day 2 is $HottestDay2")
val HottestDay3 = data.foldLeft(0.0,0.0).....
println(s"Hot day 3 is $HottestDay3")
I cannot figure out how to use foldBy function in this.
foldLeft is a more general reduceLeft (it does not require the result to be a supertype of the collection type and it allows one to define the value if there's nothing to fold over). One can implement reduceLeft in terms of foldLeft like so:
def reduceLeft[B >: A](op: (B, A) => B): B = {
if (data.isEmpty) throw new UnsupportedOperationException("empty collection")
else this.tail.foldLeft(this.head)(op)
Applying that transformation, assuming that data is not empty, you can thus translate
data.reduceLeft((d1, d2) => if (d1.tmax >= d2.tmax) d1 else d2)
data.tail.foldLeft(data.head) { (d1, d2) =>
if (d1.tmax >= d2.tmax) d1
else d2
If data has size 1, then data.tail is empty and the result is data.head (which is trivially the maximum).
Maybe you are looking for something like this
data.foldLeft(date(0))((a,b) => if(a.tmax >= b.tmax) a else b)
I have this class:
case class IDADiscretizer(
nAttrs: Int,
nBins: Int = 5,
s: Int = 5) extends Serializable {
private[this] val log = LoggerFactory.getLogger(this.getClass)
private[this] val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
private[this] val randomReservoir = SamplingUtils.reservoirSample((1 to s).toList.iterator, 1)
def updateSamples(v: LabeledVector): Vector[IntervalHeapWrapper] = {
val attrs = v.vector.map(_._2)
val label = v.label
// TODO: Check for missing values
.foreach {
case (attr, i) =>
if (V(i).getNbSamples < s) {
V(i) insertValue attr // insert
} else {
if (randomReservoir(0) <= s / (i + 1)) {
//val randVal = Random nextInt s
//V(i) replace (randVal, attr)
V(i) insertValue attr
* Return the cutpoints for the discretization
def cutPoints: Vector[Vector[Double]] = V map (_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): (DataSet[Vector[IntervalHeapWrapper]], Vector[Vector[Double]]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
Using flink, I would like to get the cutpoints after the call of discretize, but it seems the information stored in V get loss. Do I have to use Broadcast like in this question? is there a better way to access the state of class?
I've tried to call cutpoints in two ways, one with is:
def discretize(data: DataSet[LabeledVector]) = data map (x => updateSamples(x))
Then, called from outside:
val a = IDADiscretizer(nAttrs = 4)
val r = a.discretize(dataSet)
val cuts = a.cutPoints
Here, cuts is empty so I tried to compute the discretization as well as the cutpoints inside discretize:
def discretize(data: DataSet[LabeledVector]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
And use it like this:
val a = IDADiscretizer(nAttrs = 4)
val (d, c) = a.discretize(dataSet)
c foreach println
But the same happends.
Finally, I've also tried to make V completely public:
val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
Still empty
What am I doing wrong?
Related questions:
Keep keyed state across multiple transformations
Flink State backend keys atomicy and distribution
Flink: does state access across stream?
Flink: Sharing state in CoFlatMapFunction
Thanks to #TillRohrmann what I finally did was:
private[this] def computeCutPoints(x: LabeledVector) = {
val attrs = x.vector.map(_._2)
val label = x.label
.foldLeft(V) {
case (iv, (v, i)) =>
iv(i) insertValue v
* Return the cutpoints for the discretization
def cutPoints(data: DataSet[LabeledVector]): Seq[Seq[Double]] =
data.map(computeCutPoints _)
def discretize(data: DataSet[LabeledVector]): DataSet[LabeledVector] =
data.map(updateSamples _)
And then use it like this:
val a = IDADiscretizer(nAttrs = 4)
val d = a.discretize(dataSet)
val cuts = a.cutPoints(dataSet)
cuts foreach println
I do not know if it is the best way, but at least is working now.
The way Flink works is that the user defines operators/user defined functions which operate on input data coming from a source function. In order to execute a program the user code is sent to the Flink cluster where it is executed. The results of the computation has to be output to some storage system via a sink function.
Due to this, it is not possible to mix local and distributed computations easily as you are trying with your solution. What discretize does is to define a map operator which transforms the input DataSet data. This operation will be executed once you call ExecutionEnvironment#execute or DataSet#print, for example. Now the user code and the definition for IDADiscretizer is sent to the cluster where they are instantiated. Flink will update the values in an instance of IDADiscretizer which is not the same instance as the one you have on the client.
I want to count up the number of times that a function f returns each value in it's range (0 to f_max, inclusive) when applied to a given list l, and return the result as an array, in Scala.
Currently, I accomplish as follows:
def count (l: List): Array[Int] = {
val arr = new Array[Int](f_max + 1)
l.foreach {
el => arr(f(el)) += 1
return arr
So arr(n) is the number of times that f returns n when applied to each element of l. This works however, it is imperative style, and I am wondering if there is a clean way to do this purely functionally.
Thank you
how about a more general approach:
def count[InType, ResultType](l: Seq[InType], f: InType => ResultType): Map[ResultType, Int] = {
l.view // create a view so we don't create new collections after each step
.map(f) // apply your function to every item in the original sequence
.groupBy(x => x) // group the returned values
.map(x => x._1 -> x._2.size) // count returned values
val f = (i:Int) => i
count(Seq(1,2,3,4,5,6,6,6,4,2), f)
l.foldLeft(Vector.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc.updated(result, acc(result) + 1)
Alternatively, a good balance of performance and external purity would be:
def count(l: List[???]): Vector[Int] = {
val arr = l.foldLeft(Array.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc(result) += 1
Is it possible to handle Either in similar way to Option? In Option, I have a getOrElse function, in Either I want to return Left or process Right. I'm looking for the fastest way of doing this without any boilerplate like:
val myEither:Either[String, Object] = Right(new Object())
myEither match {
case Left(leftValue) => value
case Right(righValue) =>
In Scala 2.12,
Either is right-biased, which means that Right is assumed to be the default case to operate on. If it is Left, operations like map, flatMap, ... return the Left value unchanged
so you can do
myEither.map(_ => "Success").merge
if you find it more readable than fold.
You can use .fold:
scala> val r: Either[Int, String] = Right("hello")
r: Either[Int,String] = Right(hello)
scala> r.fold(_ => "got a left", _ => "Success")
res7: String = Success
scala> val l: Either[Int, String] = Left(1)
l: Either[Int,String] = Left(1)
scala> l.fold(_ => "got a left", _ => "Success")
res8: String = got a left
Re-reading your question it's unclear to me whether you want to return the value in the Left or another one (defined elsewhere)
If it is the former, you can pass identity to .fold, however this might change the return type to Any:
scala> r.fold(identity, _ => "Success")
res9: Any = Success
Both cchantep's and Marth's are good solutions to your immediate problem. But more broadly, it's difficult to treat Either as something fully analogous to Option, particularly in letting you express sequences of potentially failable computations for comprehensions. Either has a projection API (used in cchantep's solution), but it is a bit broken. (Either's projections break in for comprehensions with guards, pattern matching, or variable assignment.)
FWIW, I've written a library to solve this problem. It augments Either with this API. You define a "bias" for your Eithers. "Right bias" means that ordinary flow (map, get, etc) is represented by a Right object while Left objects represent some kind of problem. (Right bias is conventional, although you can also define a left bias if you prefer.) Then you can treat the Either like an Option; it offers a fully analogous API.
import com.mchange.leftright.BiasedEither
import BiasedEither.RightBias._
val myEither:Either[String, Object] = ...
val o = myEither.getOrElse( "Substitute" )
More usefully, you can now treat Either like a true scala monad, i.e. use flatMap, map, filter, and for comprehensions:
val myEither : Either[String, Point] = ???
val nextEither = myEither.map( _.x ) // Either[String,Int]
val myEither : Either[String, Point] = ???
def findGalaxyAtPoint( p : Point ) : Either[String,Galaxy] = ???
val locPopPair : Either[String, (Point, Long)] = {
for {
p <- myEither
g <- findGalaxyAtPoint( p )
} yield {
(p, g.population)
If all processing steps succeeded, locPopPair will be a Right[Long]. If anything went wrong, it will be the first Left[String] encountered.
It's slightly more complex, but a good idea to define an empty token. Let's look at a slight variation on the for comprehension above:
val locPopPair : Either[String, (Point, Long)] = {
for {
p <- myEither
g <- findGalaxyAtPoint( p ) if p.x > 1000
} yield {
(p, g.population)
What would happen if the test p.x > 1000 failed? We'd want to return some Left that signifies "empty", but there is no universal appropriate value (not all Left's are Left[String]. As of now, what would happen is the code would throw a NoSuchElementException. But we can specify an empty token ourselves, as below:
import com.mchange.leftright.BiasedEither
val RightBias = BiasedEither.RightBias.withEmptyToken[String]("EMPTY")
import RightBias._
val myEither : Either[String, Point] = ???
def findGalaxyAtPoint( p : Point ) : Either[String,Galaxy] = ???
val locPopPair : Either[String, (Point, Long)] = {
for {
p <- myEither
g <- findGalaxyAtPoint( p ) if p.x > 1000
} yield {
(p, g.population)
Now, if the p.x > 1000 test fails, there will be no Exception, locPopPair will just be Left("EMPTY").
I guess you can do as follows.
def foo(myEither: Either[String, Object]) =
myEither.right.map(rightValue => "Success")
In scala 2.13, you can use myEither.getOrElse
Right(12).getOrElse(17) // 12
Left(12).getOrElse(17) // 17
I am trying to create a frequency distribution.
My data is in the following pattern (ColumnIndex, (Value, countOfValue)) of type (Int, (Any, Long)). For instance, (1, (A, 10)) means for column index 1, there are 10 A's.
My goal is to get the top 100 values for all my index's or Keys.
Right away I can make it less compute intensive for my workload by doing an initial filter:
val freqNumDist = numRDD.filter(x => x._2._2 > 1)
Now I found an interesting example of a class, here which seems to fit my use case:
class TopNList (val maxSize:Int) extends Serializable {
val topNCountsForColumnArray = new mutable.ArrayBuffer[(Any, Long)]
var lowestColumnCountIndex:Int = -1
var lowestValue = Long.MaxValue
def add(newValue:Any, newCount:Long): Unit = {
if (topNCountsForColumnArray.length < maxSize -1) {
topNCountsForColumnArray += ((newValue, newCount))
} else if (topNCountsForColumnArray.length == maxSize) {
} else {
if (newCount > lowestValue) {
topNCountsForColumnArray.insert(lowestColumnCountIndex, (newValue, newCount))
def updateLowestValue: Unit = {
var index = 0
topNCountsForColumnArray.foreach{ r =>
if (r._2 < lowestValue) {
lowestValue = r._2
lowestColumnCountIndex = index
So Now What I was thinking was putting together an aggregateByKey to use this class in order to get my top 100 values! The problem is that I am unsure of how to use this class in aggregateByKey in order to accomplish this goal.
val initFreq:TopNList = new TopNList(100)
def freqSeq(u: (TopNList), v:(Double, Long)) = (
u.add(v._1, v._2)
def freqComb(u1: TopNList, u2: TopNList) = (
u2.topNCountsForColumnArray.foreach(r => u1.add(r._1, r._2))
val freqNumDist = numRDD.filter(x => x._2._2 > 1).aggregateByKey(initFreq)(freqSeq, freqComb)
The obvious problem is that nothing is returned by the functions I am using. So I am wondering how to modify this class or do I need to think about this in a whole new light and just cherry pick some of the functions out of this class and add them to the functions I am using for the aggregateByKey?
I'm either thinking about classes wrong or the entire aggregateByKey or both!
Your projections implementations (freqSeq, freqComb) return Unit while you expect them to return TopNList
If intentially keep the style of your solution, the relevant impl should be
def freqSeq(u: TopNList, v:(Any, Long)) : TopNList = {
u.add(v._1, v._2) // operation gives void result (Unit)
u // this one of TopNList type
def freqComb(u1: TopNList, u2: TopNList) : TopNList = {
u2.topNCountsForColumnArray.foreach (r => u1.add (r._1, r._2) )
Just take a look on aggregateByKey signature of PairRDDFunctions, what does it expect for
def aggregateByKey[U](zeroValue : U)(seqOp : scala.Function2[U, V, U], combOp : scala.Function2[U, U, U])(implicit evidence$3 : scala.reflect.ClassTag[U]) : org.apache.spark.rdd.RDD[scala.Tuple2[K, U]] = { /* compiled code */ }
For those who don't know what a 5-card Poker Straight is: http://en.wikipedia.org/wiki/List_of_poker_hands#Straight
I'm writing a small Poker simulator in Scala to help me learn the language, and I've created a Hand class with 5 ordered Cards in it. Each Card has a Rank and Suit, both defined as Enumerations. The Hand class has methods to evaluate the hand rank, and one of them checks whether the hand contains a Straight (we can ignore Straight Flushes for the moment). I know there are a few nice algorithms for determining a Straight, but I wanted to see whether I could design something with Scala's pattern matching, so I came up with the following:
def isStraight() = {
def matchesStraight(ranks: List[Rank.Value]): Boolean = ranks match {
case head :: Nil => true
case head :: tail if (Rank(head.id + 1) == tail.head) => matchesStraight(tail)
case _ => false
That works fine and is fairly readable, but I was wondering if there is any way to get rid of that if. I'd imagine something like the following, though I can't get it to work:
private def isStraight() = {
def matchesStraight(ranks: List[Rank.Value]): Boolean = ranks match {
case head :: Nil => true
case head :: next(head.id + 1) :: tail => matchesStraight(next :: tail)
case _ => false
Any ideas? Also, as a side question, what is the general opinion on the inner matchesStraight definition? Should this rather be private or perhaps done in a different way?
You can't pass information to an extractor, and you can't use information from one value returned in another, except on the if statement -- which is there to cover all these cases.
What you can do is create your own extractors to test these things, but it won't gain you much if there isn't any reuse.
For example:
class SeqExtractor[A, B](f: A => B) {
def unapplySeq(s: Seq[A]): Option[Seq[A]] =
if (s map f sliding 2 forall { case Seq(a, b) => a == b } ) Some(s)
else None
val Straight = new SeqExtractor((_: Card).rank)
Then you can use it like this:
listOfCards match {
case Straight(cards) => true
case _ => false
But, of course, all that you really want is that if statement in SeqExtractor. So, don't get too much in love with a solution, as you may miss simpler ways of doing stuff.
You could do something like:
val ids = ranks.map(_.id)
ids.max - ids.min == 4 && ids.distinct.length == 5
Handling aces correctly requires a bit of work, though.
Update: Here's a much better solution:
(ids zip ids.tail).forall{case (p,q) => q%13==(p+1)%13}
The % 13 in the comparison handles aces being both rank 1 and rank 14.
How about something like:
def isStraight(cards:List[Card]) = (cards zip cards.tail) forall { case (c1,c2) => c1.rank+1 == c2.rank}
val cards = List(Card(1),Card(2),Card(3),Card(4))
scala> isStraight(cards)
res2: Boolean = true
This is a completely different approache, but it does use pattern matching. It produces warnings in the match clause which seem to indicate that it shouldn't work. But it actually produces the correct results:
Straight !!! 34567
Straight !!! 34567
Sorry no straight this time
I ignored the Suites for now and I also ignored the possibility of an ace under a 2.
abstract class Rank {
def value : Int
case class Next[A <: Rank](a : A) extends Rank {
def value = a.value + 1
case class Two() extends Rank {
def value = 2
class Hand(a : Rank, b : Rank, c : Rank, d : Rank, e : Rank) {
val cards = List(a, b, c, d, e).sortWith(_.value < _.value)
object Hand{
def unapply(h : Hand) : Option[(Rank, Rank, Rank, Rank, Rank)] = Some((h.cards(0), h.cards(1), h.cards(2), h.cards(3), h.cards(4)))
object Poker {
val two = Two()
val three = Next(two)
val four = Next(three)
val five = Next(four)
val six = Next(five)
val seven = Next(six)
val eight = Next(seven)
val nine = Next(eight)
val ten = Next(nine)
val jack = Next(ten)
val queen = Next(jack)
val king = Next(queen)
val ace = Next(king)
def main(args : Array[String]) {
val simpleStraight = new Hand(three, four, five, six, seven)
val unsortedStraight = new Hand(four, seven, three, six, five)
val notStraight = new Hand (two, two, five, five, ace)
def printIfStraight[A](h : Hand) {
h match {
case Hand(a: A , b : Next[A], c : Next[Next[A]], d : Next[Next[Next[A]]], e : Next[Next[Next[Next[A]]]]) => println("Straight !!! " + a.value + b.value + c.value + d.value + e.value)
case Hand(a,b,c,d,e) => println("Sorry no straight this time")
If you are interested in more stuff like this google 'church numerals scala type system'
How about something like this?
def isStraight = {
cards.map(_.rank).toList match {
case first :: second :: third :: fourth :: fifth :: Nil if
first.id == second.id - 1 &&
second.id == third.id - 1 &&
third.id == fourth.id - 1 &&
fourth.id == fifth.id - 1 => true
case _ => false
You're still stuck with the if (which is in fact larger) but there's no recursion or custom extractors (which I believe you're using incorrectly with next and so is why your second attempt doesn't work).
If you're writing a poker program, you are already check for n-of-a-kind. A hand is a straight when it has no n-of-a-kinds (n > 1) and the different between the minimum denomination and the maximum is exactly four.
I was doing something like this a few days ago, for Project Euler problem 54. Like you, I had Rank and Suit as enumerations.
My Card class looks like this:
case class Card(rank: Rank.Value, suit: Suit.Value) extends Ordered[Card] {
def compare(that: Card) = that.rank compare this.rank
Note I gave it the Ordered trait so that we can easily compare cards later. Also, when parsing the hands, I sorted them from high to low using sorted, which makes assessing values much easier.
Here is my straight test which returns an Option value depending on whether it's a straight or not. The actual return value (a list of Ints) is used to determine the strength of the hand, the first representing the hand type from 0 (no pair) to 9 (straight flush), and the others being the ranks of any other cards in the hand that count towards its value. For straights, we're only worried about the highest ranking card.
Also, note that you can make a straight with Ace as low, the "wheel", or A2345.
case class Hand(cards: Array[Card]) {
def straight: Option[List[Int]] = {
if( cards.sliding(2).forall { case Array(x, y) => (y compare x) == 1 } )
Some(5 :: cards(0).rank.id :: 0 :: 0 :: 0 :: 0 :: Nil)
else if ( cards.map(_.rank.id).toList == List(12, 3, 2, 1, 0) )
Some(5 :: cards(1).rank.id :: 0 :: 0 :: 0 :: 0 :: Nil)
else None
Here is a complete idiomatic Scala hand classifier for all hands (handles 5-high straights):
case class Card(rank: Int, suit: Int) { override def toString = s"${"23456789TJQKA" rank}${"♣♠♦♥" suit}" }
object HandType extends Enumeration {
val HighCard, OnePair, TwoPair, ThreeOfAKind, Straight, Flush, FullHouse, FourOfAKind, StraightFlush = Value
case class Hand(hand: Set[Card]) {
val (handType, sorted) = {
def rankMatches(card: Card) = hand count (_.rank == card.rank)
val groups = hand groupBy rankMatches mapValues {_.toList.sorted}
val isFlush = (hand groupBy {_.suit}).size == 1
val isWheel = "A2345" forall {r => hand exists (_.rank == Card.ranks.indexOf(r))} // A,2,3,4,5 straight
val isStraight = groups.size == 1 && (hand.max.rank - hand.min.rank) == 4 || isWheel
val (isThreeOfAKind, isOnePair) = (groups contains 3, groups contains 2)
val handType = if (isStraight && isFlush) HandType.StraightFlush
else if (groups contains 4) HandType.FourOfAKind
else if (isThreeOfAKind && isOnePair) HandType.FullHouse
else if (isFlush) HandType.Flush
else if (isStraight) HandType.Straight
else if (isThreeOfAKind) HandType.ThreeOfAKind
else if (isOnePair && groups(2).size == 4) HandType.TwoPair
else if (isOnePair) HandType.OnePair
else HandType.HighCard
val kickers = ((1 until 5) flatMap groups.get).flatten.reverse
require(hand.size == 5 && kickers.size == 5)
(handType, if (isWheel) (kickers takeRight 4) :+ kickers.head else kickers)
object Hand {
import scala.math.Ordering.Implicits._
implicit val rankOrdering = Ordering by {hand: Hand => (hand.handType, hand.sorted)}