Here is an entropy calculation based on an answer by Jeff Atwood : How to calculate the entropy of a file? which is based on :
object MeasureEntropy extends App {
val s = "measure measure here measure measure measure"
def entropyValue(s: String) = {
val m = s.split(" ").toList.groupBy((word: String) => word).mapValues(_.length.toDouble)
var result: Double = 0.0;
val len = s.split(" ").length;
m map {
case (key, value: Double) =>
var frequency: Double = value / len;
result -= frequency * (scala.math.log(frequency) / scala.math.log(2));
I'd like to improve this by removing the mutable state relating to :
var result: Double = 0.0;
How to combine the result into a single calculation over the map function ?
Using foldLeft, or in this case /: which is a syntactic sugar for it:
(0d /: m) {case (result, (key,value)) =>
val frequency = value / len
result - frequency * (scala.math.log(frequency) / scala.math.log(2))
A simple sum will do the trick: {
case (key, value: Double) =>
val frequency: Double = value / len;
- frequency * (scala.math.log(frequency) / scala.math.log(2));
It can be written using foldLeft like below.
def entropyValue(s: String) = {
val m = s.split(" ").toList.groupBy((word: String) => word).mapValues(_.length.toDouble)
val len = s.split(" ").length
m.foldLeft(0.0)((r, t) => r - ((t._2 / len) * (scala.math.log(t._2 / len) / scala.math.log(2))))
I have a Map where key = LocalDateTime and value = Group
def someGroup(/.../): List[Group] = { {
}.map(group => (group.completedDt, group)).toMap
And there is also List [Group], where Group (completedDt: LocalDateTime, cost: Int), in which always cost = 0
An example of what I have:
map: [(2021-04-01T00:00:00.000, 500), (2021-04-03T00:00:00.000, 1000), (2021-04-05T00:00:00.000, 750)]
list: ((2021-04-01T00:00:00.000, 0),(2021-04-02T00:00:00.000, 0),(2021-04-03T00:00:00.000, 0),(2021-04-04T00:00:00.000, 0),(2021-04-05T00:00:00.000, 0))
The expected result is:
list ((2021-04-01T00:00:00.000, 500),(2021-04-02T00:00:00.000, 0),(2021-04-03T00:00:00.000, 1000),(2021-04-04T00:00:00.000, 0),(2021-04-05T00:00:00.000, 750))
Thanks in advance!
Assuming that if there's a time appearing in both that you want to combine the costs:
type Group = (LocalDateTime, Int) // completedDt, cost
val groupMap: Map[LocalDateTime, Group] = ???
val groupList: List[Group] = ???
val combined =
groupList.foldLeft(groupMap) { (acc, group) =>
val completedDt = group._1
if (acc.contains(completedDt)) {
val nv = completedDt -> (acc(completedDt)._2 + group._2)
acc.updated(completedDt, nv)
} else acc + (completedDt -> group)
}.values.toList.sortBy(_._1) // You might need to define an Ordering[LocalDateTime]
The notation in your question leads me to think Group is just a pair, not a case class. It's also worth noting that I'm not sure what having the map be Map[LocalDateTime, Group] vs. Map[LocalDateTime, Int] (and thus by definition a collection of Group) buys you.
EDIT: if you have a general collection of collections of Group, you can
val groupLists: List[List[Group]] = ???
groupList.foldLeft(Map.empty[LocalDateTime, Group]) { (acc, lst) =>
lst.foldLeft(acc) { (m, group) =>
val completedDt = group._1
if (m.contains(completedDt)) {
val nv = completedDt -> (acc(completedDt)._2 + group._2)
m.updated(completedDt, nv)
} else m + (completedDt -> group)
Considering this function in Decoder:
final def decodeCollect[F[_], A](dec: Decoder[A], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
What I really need is dec: Vector[Decoder[A]], like this:
final def decodeCollect[F[_], A](dec: Vector[Decoder[A]], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
to process a binary format that has fields that are not self describing. Early in the file are description records, and from these come field sizes that have to be applied later in data records. So I want to build up a list of decoders and apply it N times, where N is the number of decoders.
I could write a new function modeled on decodeCollect, but it takes an implicit Factory, so I probably would have to compile the scodec library and add it.
Is there a simpler approach using what exists in the scodec library? Either a way to deal with the factory or a different approach?
I finally hacked a solution in the codec codebase. Now that that door is open, I'll add whatever I need until I succeed.
final def decodeNCollect[F[_], A](dec: Vector[Decoder[A]])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
val bldr = cbf.newBuilder
var remaining = buffer
var count = 0
val maxCount = dec.length
var error: Option[Err] = None
while (count < maxCount && remaining.nonEmpty) {
dec(count).decode(remaining) match {
case Attempt.Successful(DecodeResult(value, rest)) =>
bldr += value
count += 1
remaining = rest
case Attempt.Failure(err) =>
error = Some(err.pushContext(count.toString))
remaining = BitVector.empty
Attempt.fromErrOption(error, DecodeResult(bldr.result, remaining))
final def encodeNSeq[A](encs: Vector[Encoder[A]])(seq: collection.immutable.Seq[A]): Attempt[BitVector] = {
if (encs.length != seq.length)
return Attempt.failure(Err("encodeNSeq: length of coders and items does not match"))
val buf = new collection.mutable.ArrayBuffer[BitVector](seq.size)
((seq zip (0 until encs.length)): Seq[(A, Int)]) foreach { case (a, i) =>
encs(i).encode(a) match {
case Attempt.Successful(aa) => buf += aa
case Attempt.Failure(err) => return Attempt.failure(err.pushContext(buf.size.toString))
def merge(offset: Int, size: Int): BitVector = size match {
case 0 => BitVector.empty
case 1 => buf(offset)
case n =>
val half = size / 2
merge(offset, half) ++ merge(offset + half, half + (if (size % 2 == 0) 0 else 1))
Attempt.successful(merge(0, buf.size))
private[codecs] final class VectorNCodec[A](codecs: Vector[Codec[A]]) extends Codec[Vector[A]] {
def sizeBound = SizeBound(0, Some(codecs.length.toLong))
def encode(vector: Vector[A]) = Encoder.encodeNSeq(codecs)(vector)
def decode(buffer: BitVector) =
Decoder.decodeNCollect[Vector, A](codecs)(buffer)
override def toString = s"vector($codecs)"
def vectorOf[A](valueCodecs: Vector[Codec[A]]): Codec[Vector[A]] =
flatZip { count => new VectorNCodec(valueCodecs) }.
narrow[Vector[A]]({ case (cnt, xs) =>
if (xs.size == cnt) Attempt.successful(xs)
else Attempt.failure(Err(s"Insufficient number of elements: decoded ${xs.size} but should have decoded $cnt"))
}, xs => (xs.size, xs)).
I am trying to create a frequency distribution.
My data is in the following pattern (ColumnIndex, (Value, countOfValue)) of type (Int, (Any, Long)). For instance, (1, (A, 10)) means for column index 1, there are 10 A's.
My goal is to get the top 100 values for all my index's or Keys.
Right away I can make it less compute intensive for my workload by doing an initial filter:
val freqNumDist = numRDD.filter(x => x._2._2 > 1)
Now I found an interesting example of a class, here which seems to fit my use case:
class TopNList (val maxSize:Int) extends Serializable {
val topNCountsForColumnArray = new mutable.ArrayBuffer[(Any, Long)]
var lowestColumnCountIndex:Int = -1
var lowestValue = Long.MaxValue
def add(newValue:Any, newCount:Long): Unit = {
if (topNCountsForColumnArray.length < maxSize -1) {
topNCountsForColumnArray += ((newValue, newCount))
} else if (topNCountsForColumnArray.length == maxSize) {
} else {
if (newCount > lowestValue) {
topNCountsForColumnArray.insert(lowestColumnCountIndex, (newValue, newCount))
def updateLowestValue: Unit = {
var index = 0
topNCountsForColumnArray.foreach{ r =>
if (r._2 < lowestValue) {
lowestValue = r._2
lowestColumnCountIndex = index
So Now What I was thinking was putting together an aggregateByKey to use this class in order to get my top 100 values! The problem is that I am unsure of how to use this class in aggregateByKey in order to accomplish this goal.
val initFreq:TopNList = new TopNList(100)
def freqSeq(u: (TopNList), v:(Double, Long)) = (
u.add(v._1, v._2)
def freqComb(u1: TopNList, u2: TopNList) = (
u2.topNCountsForColumnArray.foreach(r => u1.add(r._1, r._2))
val freqNumDist = numRDD.filter(x => x._2._2 > 1).aggregateByKey(initFreq)(freqSeq, freqComb)
The obvious problem is that nothing is returned by the functions I am using. So I am wondering how to modify this class or do I need to think about this in a whole new light and just cherry pick some of the functions out of this class and add them to the functions I am using for the aggregateByKey?
I'm either thinking about classes wrong or the entire aggregateByKey or both!
Your projections implementations (freqSeq, freqComb) return Unit while you expect them to return TopNList
If intentially keep the style of your solution, the relevant impl should be
def freqSeq(u: TopNList, v:(Any, Long)) : TopNList = {
u.add(v._1, v._2) // operation gives void result (Unit)
u // this one of TopNList type
def freqComb(u1: TopNList, u2: TopNList) : TopNList = {
u2.topNCountsForColumnArray.foreach (r => u1.add (r._1, r._2) )
Just take a look on aggregateByKey signature of PairRDDFunctions, what does it expect for
def aggregateByKey[U](zeroValue : U)(seqOp : scala.Function2[U, V, U], combOp : scala.Function2[U, U, U])(implicit evidence$3 : scala.reflect.ClassTag[U]) : org.apache.spark.rdd.RDD[scala.Tuple2[K, U]] = { /* compiled code */ }
I'm trying to update the value in a map[String, WordCount] while iterating this
case class WordCount(name: String,
id: Int,
var count: Double,
var links: scala.collection.mutable.HashMap[String, Double],
var ent: Double) {
def withEnt(v: Double): WordCount = {println(s"v: $v"); copy(ent = v)}
var targets = mutable.HashMap[String, WordCount]()
The function calEnt is:
def calEnt(total: Double, links: scala.collection.mutable.HashMap[String, Double]): Double = {
var p: Double = 0.0
var ent: Double = 0.0 => {
p = d._2 / total
ent -= p * Math.log(p)
return ent
And I'm using: => m._2.withEnt(calEnt(m._2.count, m._2.links)))
for iterating the map and calculate the new value for ent and update this with withEnt. I can imprime in console the new value but it is not setting inside the map. What is the way to do that? please.
Use foreach:
targets.foreach(m => targets(m._1) = m._2.withEnt(calEnt(m._2.count, m._2.links)))
Self-contained example:
val m = scala.collection.mutable.HashMap[Int, Int](1 -> 1, 2 -> 2)
m.foreach(p => m(p._1) = p._2 + 1)
map method won't modify your targets HashMap. It will return a new, modified HashMap. Try this:
targets = => (m._1, m._2.withEnt(calEnt(m._2.count, m._2.links))))
Note also that we map to a pairs of keys m._1 and modified values. Not just to values.
I am looking for an existing implementation of a union-find or disjoint set data structure in Scala before I attempt to roll my own as the optimisations look somewhat complicated.
I mean this kind of thing - where the two operations union and find are optimised.
Does anybody know of anything existing? I've obviously tried googling around.
I had written one for myself some time back which I believe performs decently. Unlike other implementations, the find is O(1) and union is O(log(n)). If you have a lot more union operations than find, then this might not be very useful. I hope you find it useful:
package week2
import scala.collection.immutable.HashSet
import scala.collection.immutable.HashMap
* Union Find implementaion.
* Find is O(1)
* Union is O(log(n))
* Implementation is using a HashTable. Each wrap has a set which maintains the elements in that wrap.
* When 2 wraps are union, then both the set's are clubbed. O(log(n)) operation
* A HashMap is also maintained to find the Wrap associated with each node. O(log(n)) operation in mainitaining it.
* If the input array is null at any index, it is ignored
class UnionFind[T](all: Array[T]) {
private var dataStruc = new HashMap[T, Wrap]
for (a <- all if (a != null))
dataStruc = dataStruc + (a -> new Wrap(a))
var timeU = 0L
var timeF = 0L
* The number of Unions
private var size = dataStruc.size
* Unions the set containing a and b
def union(a: T, b: T): Wrap = {
val st = System.currentTimeMillis()
val first: Wrap = dataStruc.get(a).get
val second: Wrap = dataStruc.get(b).get
if (first.contains(b) || second.contains(a))
else {
// below is to merge smaller with bigger rather than other way around
val firstIsBig = (first.set.size > second.set.size)
val ans = if (firstIsBig) {
first.set = first.set ++ second.set
second.set.foreach(a => {
dataStruc = dataStruc - a
dataStruc = dataStruc + (a -> first)
} else {
second.set = second.set ++ first.set
first.set.foreach(a => {
dataStruc = dataStruc - a
dataStruc = dataStruc + (a -> second)
timeU = timeU + (System.currentTimeMillis() - st)
size = size - 1
* true if they are in same set. false if not
def find(a: T, b: T): Boolean = {
val st = System.currentTimeMillis()
val ans = dataStruc.get(a).get.contains(b)
timeF = timeF + (System.currentTimeMillis() - st)
def sizeUnion: Int = size
class Wrap(e: T) {
var set = new HashSet[T]
set = set + e
def add(elem: T) {
set = set + elem
def contains(elem: T): Boolean = set.contains(elem)
Here is a simple, short and somewhat efficient mutable implementation of UnionFind:
import scala.collection.mutable
class UnionFind[T]:
private val map = new mutable.HashMap[T, mutable.HashSet[T]]
private var size = 0
def distinct = size
def addFresh(a: T): Unit =
val set = new mutable.HashSet[T]
set += a
map(a) = set
size += 1
def setEqual(a: T, b: T): Unit =
val ma = map(a)
val mb = map(b)
if !ma.contains(b) then
// redirect the elements of the smaller set to the bigger set
if ma.size > mb.size
ma ++= mb
mb.foreach { x => map(x) = ma }
mb ++= ma
ma.foreach { x => map(x) = mb }
size = size - 1
def isEqual(a: T, b: T): Boolean =
An immutable implementation of UnionFind can be useful when rollback or backtracking or proofs are necessary
An mutable implementation can avoid garbage collection for speedup
One could also consider a persistent datastructure -- works like an immutable implementation, but is using internally some mutable state for speed