How can we sample from a list of strings based on their probability of occurrence in the list in Scala? - scala

I have a List[(String, Double)] variable where the second element of tuple denotes the probability of the string in first element appearing in a corpus. An example would be [(Apple, 0.2), (Banana, 0.3), (Lemon, 0.5)] where an Apple appears with a probability of 0.2 in the list of strings. I want to randomly sample from the list of strings based on their probability of appearance something along the lines of numpy random.choice() method. What would be the correct way to do this in Scala?

Another solution:
def choice(samples: Seq[(String, Double)], n: Int): Seq[String] = {
val (strings, probs) = samples.unzip
val cumprobs = probs.scanLeft(0.0){ _ + _ }.init
def p2s(p: Double): String = strings(cumprobs.lastIndexWhere(_ <= p))
Seq.fill(n)(math.random).map(p2s)
}
An usage (and verify):
>> val ss = choice(Seq(("Apple", 0.2), ("Banana", 0.3), ("Lemon", 0.5)), 10000)
>> ss.groupBy(identity).map{ case(k, v) => (k, v.size)}
Map[String, Int] = Map(Banana -> 3013, Lemon -> 4971, Apple -> 2016)

A very naive (and inefficient) solution would be to create a List of 100 elements that repeats each of the original elements the amount of times needed to respect its probabilities. Then you can randomly shuffle that List and finally take the first element.
import scala.util.Random
final val percent_100 = BigDecimal(100)
def choice[T](data: List[(T, Double)]): T = {
val distribution = data.flatMap {
case (elem, probability) =>
val scaledProbability = BigDecimal(probability).setScale(
scale = 2,
BigDecimal.RoundingMode.HALF_EVEN
)
val n = (scaledProbability * percent_100).toIntExact
List.fill(n)(elem)
}
Random.shuffle(distribution).head
}
However, I am sure there should be better ways of solving this.

Related

Getting the mode from an RDD

I would like to get the mode (the most common number) from an rdd using Spark + Scala.
I can get it doing the following but I think it could be a better way to calculate this. The most important thing is if more than one value has the same number of repetition, I need to return both of them.
Let's see my example code:
val l = List(3,4,4,3,3,7,7,7,9)
val rdd = spark.sparkContext.parallelize(l)
val grouped = rdd.map (e => (e, 1)).groupBy(_._1).map(e=> (e._1, e._2.size))
val maxRep = grouped.collect().maxBy(_._2)._2
val mode = grouped.filter(e => e._2 == maxRep).map(e => e._1).collect
And the result is right:
Array[Int] = Array(3, 7)
but is there a better way to do this? I mean considering the performance because the original RDD would be much bigger than this.
This should work and be a little bit more efficient.
(only if you are sure the total number of elements is small)
val counted = rdd.countByValue()
val max = counted.valuesIterator.max
val maxElements = count.collect { case (k, v) if (v == max) => k }
If there could be many elements, consider this alternative which is memory safe.
val counted = rdd.map(x => (x, 1L)).reduceByKey(_ + _).cache()
val max = counted.values.max
val maxElements = counted.map { case (k, v) => (v, k) }.lookup(max)
How about get the max key-value pair from a double groupBy? This works even better for bigger data size.
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max
// res1: (Int, Iterable[(Int, Int)]) = (3,CompactBuffer((3,3), (7,3)))
To get the element
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max._2.map(_._1)
// res4: Iterable[Int] = List(3, 7)
The first groupBy will get element into (element -> count) with type Map[Int, Long], the second groupBy will group (element -> count) by count, like (count -> Iterable((element, count)), then simply max to get the key-value pair with the maximum key value, which is the count.

Efficient way to select a subset of elements from a Seq in Scala

I have a sequence
val input = Seq(1,3,4,5,9,11...)
I want to randomly select a subset of it. What is the fastest way.
I currently implement it like this:
//ratio is the percentage of the subgroup from the whole group
def randomSelect(ratio:Double): Boolean = {
val rr=scala.util.Random
if (rr.nextFloat() < ratio) true else false
}
val ratio = 0.3
val result = input.map(x=>(x, randomSelect(ratio))).filter(x._2).map(x=>x._1)
So I first attach a true/false label for each element, and filter out those false elements, and get back the subset of the sequence.
Is there any faster/ advantage way?
So there are basically two approaches to this:
select n elements at random
include or exclude each element with probability p
Your solution is the latter and can be simplified to:
l.filter(_ => r.nextFloat < p)
(I'm calling the list, l, the instance of Random r and your ratio p from here on out.)
If you wanted to sample exactly n elements you could do:
r.shuffle(l).take(n)
I compared these selecting 200 elements from a 1000 element list:
scala> val first = time{
| l.map(x => (x, r.nextFloat < p)).filter(_._2).map(_._1)
| }
Elapsed time: 3249507ns
scala> val second = time {
| r.shuffle(l).take(200)
| }
Elapsed time: 10640432ns
scala> val third = time{
| l.filter(_ => r.nextFloat < p)}
Elapsed time: 1689009ns
Dropping your extra two mapss appears to speed things up by about a third (which makes complete sense). The shuffle-and-take method is significantly slower, but does guarantee you a fixed number of elements.
I borrowed the timing function from here if you want to do a more rigorous investigation (i.e. average over many trials, rather than 1).
If your list isn't big, a simple filter as suggested by others should suffice:
list.filter(_ => Random.nextDouble < p)
In case you have a big list, the per-element call of Random could become the bottleneck. One approach to minimize the calls is to generate random gaps (0, 1, 2, ...) by which the data sampling will hop over the elements. Below is a simple implementation in Scala:
import scala.util.Random
import scala.math._
def gapSampling(list: List[Double], p: Double): List[Double] = {
def randomGap(p: Double): Double = {
val epsilon: Double = 1e-10
val u = max(Random.nextDouble, epsilon)
floor( log(u) / log(1 - p) )
}
#scala.annotation.tailrec
def samplingFcn(acc: List[Double], list: List[Double], p: Double): List[Double] = list match {
case Nil => acc
case _ =>
val gap = randomGap(p).toInt
val l = list.drop(gap + 1)
val accNew = l.headOption match {
case Some(e) => e :: acc
case None => acc
}
samplingFcn(accNew, l, p)
}
samplingFcn(List[Double](), list, p).reverse
}
val list = (1 to 100).toList.map(_.toDouble)
gapSampling(list, 0.3)
// res1: List[Double] = List(
// 2.0, 5.0, 7.0, 14.0, 15.0, 18.0, 20.0, 25.0, 26.0, 28.0, 33.0,
// 35.0, 42.0, 43.0, 47.0, 48.0, 50.0, 55.0, 56.0, 59.0, 62.0,
// 69.0, 72.0, 75.0, 76.0, 79.0, 82.0, 93.0, 96.0, 97.0, 98.0
// )
More details about such gap sampling can be found here.

Compare the similarity between two text in Scala

I want to compare two texts in Scala and calculate the similarity rate. I begin to code this but I'm blocked :
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]):Unit = {
val white = "/whiteCat.txt" // "The white cat is eating a white soup"
val black = "/blackCat.txt" // "The black cat is eating a white sandwich"
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val b = sc.textFile(white)
val words = b.flatMap(line => line.split("\\W+"))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.take(10).foreach(println)
//counts.saveAsTextFile(outputFile)
}
}
I succeeded to split the words of each text and count the occurency of each word. For example in the file1 there is :
(The,1)
(white,2)
(cat,1)
(is,1)
(eating,1)
(a,1)
(soup,1)
To calculate the similarity rate. I have to do this algorithm but I'm not experienced with Scala
i=0
foreach word in the first text
j = 0
IF keyFile1[i] == keyFile2[j]
THEN MIN(valueFile1[i], valueFile2[j]) / MAX(valueFile1[i], valueFile2[j])
ELSE j++
i++
Can you help me please ?
You can use leftOuterJoin to join the two key/value-pair RDDs to generate a RDD of type Array[(String, (Int, Option[Int]))], gather both counts from the Tuples, flatten the counts to type Int, and apply your min/max formula, as in the following example:
val wordCountsWhite = sc.textFile("/path/to/whitecat.txt").
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCountsWhite.collect
// res1: Array[(String, Int)] = Array(
// (is,1), (eating,1), (cat,1), (white,2), (The,1), (soup,1), (a,1)
// )
val wordCountsBlack = sc.textFile("/path/to/blackcat.txt").
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCountsBlack.collect
// res2: Array[(String, Int)] = Array(
// (is,1), (eating,1), (cat,1), (white,1), (The,1), (a,1), (sandwich,1), (black,1)
// )
val similarityRDD = wordCountsWhite.leftOuterJoin(wordCountsBlack).map{
case (k: String, (c1: Int, c2: Option[Int])) => {
val counts = Seq(Some(c1), c2).flatten
(k, counts.min.toDouble / counts.max )
}
}
similarityRDD.collect
// res4: Array[(String, Double)] = Array(
// (is,1.0), (eating,1.0), (cat,1.0), (white,0.5), (The,1.0), (soup,1.0), (a,1.0)
// )
That seems straight forward using for comprehension
for( a <- counts1; b <- counts2 if a._1==b._1 ) yield Math.min(a._2,b._2)/Math.max(a._2,b._2)
Edit:
sorry, the above code does not work.
Here's a modified code with for comprehension. counts1 and counts2 are 2 counts from the question.
val result= for( (t1,t2) <- counts1.cartesian(counts2) if( t1._1==t2._1)) yield Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble
the result:
result.foreach(println)
1.0
0.5
1.0
1.0
1.0
There are numerous algorithms to find similarity between strings. One of these methods is edit distance. There are different definitions of edit distance and there are different sets of operations based on the methodology. But main idea is finding minimum series of operations (insertion, deletion, substitution) to convert string a into string b.
Levenshtein distance and Longest Common Subsequence are widely known algorithms to find similarity between strings. But these methods are insensitive to contexts. Because of this situation, you may want to take a look at this article in which n-gram similarity and distance are represented. You can also find Scala implementations of these algorithms in github or rosetta code.
I hope it helps!

Refactoring a small Scala function

I have this function to compute the distance between two n-dimensional points using Pythagoras' theorem.
def computeDistance(neighbour: Point) = math.sqrt(coordinates.zip(neighbour.coordinates).map {
case (c1: Int, c2: Int) => math.pow(c1 - c2, 2)
}.sum)
The Point class (simplified) looks like:
class Point(val coordinates: List[Int])
I'm struggling to refactor the method so it's a little easier to read, can anybody help please?
Here's another way that makes the following three assumptions:
The length of the list is the number of dimensions for the point
Each List is correctly ordered, i.e. List(x, y) or List(x, y, z). We do not know how to handle List(x, z, y)
All lists are of equal length
def computeDistance(other: Point): Double = sqrt(
coordinates.zip(other.coordinates)
.flatMap(i => List(pow(i._2 - i._1, 2)))
.fold(0.0)(_ + _)
)
The obvious disadvantage here is that we don't have any safety around list length. The quick fix for this is to simply have the function return an Option[Double] like so:
def computeDistance(other: Point): Option[Double] = {
if(other.coordinates.length != coordinates.length) {
return None
}
return Some(sqrt(coordinates.zip(other.coordinates)
.flatMap(i => List(pow(i._2 - i._1, 2)))
.fold(0.0)(_ + _)
))
I'd be curious if there is a type safe way to ensure equal list length.
EDIT
It was politely pointed out to me that flatMap(x => List(foo(x))) is equivalent to map(foo) , which I forgot to refactor when I was originally playing w/ this. Slightly cleaner version w/ Map instead of flatMap :
def computeDistance(other: Point): Double = sqrt(
coordinates.zip(other.coordinates)
.map(i => pow(i._2 - i._1, 2))
.fold(0.0)(_ + _)
)
Most of your problem is that you're trying to do math with really long variable names. It's almost always painful. There's a reason why mathematicians use single letters. And assign temporary variables.
Try this:
class Point(val coordinates: List[Int]) { def c = coordinates }
import math._
def d(p: Point) = {
val delta = for ((a,b) <- (c zip p.c)) yield pow(a-b, dims)
sqrt(delta.sum)
}
Consider type aliases and case classes, like this,
type Coord = List[Int]
case class Point(val c: Coord) {
def distTo(p: Point) = {
val z = (c zip p.c).par
val pw = z.aggregate(0.0) ( (a,v) => a + math.pow( v._1-v._2, 2 ), _ + _ )
math.sqrt(pw)
}
}
so that for any two points, for instance,
val p = Point( (1 to 5).toList )
val q = Point( (2 to 6).toList )
we have that
p distTo q
res: Double = 2.23606797749979
Note method distTo uses aggregate on a parallelised collection of tuples, and combines the partial results by the last argument (summation). For high dimensional points this may prove more efficient than the sequential counterpart.
For simplicity of use, consider also implicit classes, as suggested in a comment above,
implicit class RichPoint(val c: Coord) extends AnyVal {
def distTo(d: Coord) = Point(c) distTo Point(d)
}
Hence
List(1,2,3,4,5) distTo List(2,3,4,5,6)
res: Double = 2.23606797749979

Scala: groupBy (identity) of List Elements

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
When I use
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
I'd suggest this:
def producePairs(words: Array[String]): Map[(String,String), Double] = {
val table = for(w1 <- words; w2 <- words) yield (w1,w2)
val grouppedPairs = table.groupBy(identity)
val size = grouppedPairs.size.toDouble
grouppedPairs.mapValues(_.length / size)
}
The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
val grouped = pairs.groupBy(t => t)
grouped.mapValues(_.size)
}
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
m + (p -> (m.getOrElse(p, 0) + 1))
}
}
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
Alternative approach which is not of order O(num_words * num_words) but of order O(num_unique_words * num_unique_words) (or something like that):
def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
val size = (counts.size * counts.size).toDouble
for(w1 <- counts; w2 <- counts) yield {
((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
}
}