What is the scala equivalent of Python's Numpy np.random.choice?(Random weighted selection in scala) - scala

I was looking for Scala's equivalent code or underlying theory for pythons np.random.choice (Numpy as np). I have a similar implementation that uses Python's np.random.choice method to select the random moves from the probability distribution.
Python's code
Input list: ['pooh', 'rabbit', 'piglet', 'Christopher'] and probabilies: [0.5, 0.1, 0.1, 0.3]
I want to select one of the value from the input list given the associated probability of each input element.

The Scala standard library has no equivalent to np.random.choice but it shouldn't be too difficult to build your own, depending on which options/features you want to emulate.
Here, for example, is a way to get an infinite Stream of submitted items, with the probability of any one item weighted relative to the others.
def weightedSelect[T](input :(T,Int)*): Stream[T] = {
val items :Seq[T] = input.flatMap{x => Seq.fill(x._2)(x._1)}
def output :Stream[T] = util.Random.shuffle(items).toStream #::: output
output
}
With this each input item is given with a multiplier. So to get an infinite pseudorandom selection of the characters c and v, with c coming up 3/5ths of the time and v coming up 2/5ths of the time:
val cvs = weightedSelect(('c',3),('v',2))
Thus the rough equivalent of the np.random.choice(aa_milne_arr,5,p=[0.5,0.1,0.1,0.3]) example would be:
weightedSelect("pooh"-> 5
,"rabbit" -> 1
,"piglet" -> 1
,"Christopher" -> 3).take(5).toArray
Or perhaps you want a better (less pseudo) random distribution that might be heavily lopsided.
def weightedSelect[T](items :Seq[T], distribution :Seq[Double]) :Stream[T] = {
assert(items.length == distribution.length)
assert(math.abs(1.0 - distribution.sum) < 0.001) // must be at least close
val dsums :Seq[Double] = distribution.scanLeft(0.0)(_+_).tail
val distro :Seq[Double] = dsums.init :+ 1.1 // close a possible gap
Stream.continually(items(distro.indexWhere(_ > util.Random.nextDouble())))
}
The result is still an infinite Stream of the specified elements but the passed-in arguments are a bit different.
val choices :Stream[String] = weightedSelect( List("this" , "that")
, Array(4998/5000.0, 2/5000.0))
// let's test the distribution
val (choiceA, choiceB) = choices.take(10000).partition(_ == "this")
choiceA.length //res0: Int = 9995
choiceB.length //res1: Int = 5 (not bad)

Related

chisel3 arithmetic operations on Doubles

Please I have problems manipulating arithmetic operations with doubles in chisel. I have been seeing examples that uses just the following types: Int,UInt,SInt.
I saw here that arithmetic operations where described only for SInt and UInt. What about Double?
I tried to declare my output out as Double, but didn't know how. Because the output of my code is Double.
Is there a way to declare in Bundle an input and an output of type Double?
Here is my code:
class hashfunc(val k:Int, val n: Int ) extends Module {
val a = k + k
val io = IO(new Bundle {
val b=Input(UInt(k.W))
val w=Input(UInt(k.W))
var out = Output(UInt(a.W))
})
val tabHash1 = new Array[Array[Double]](n)
val x = new ArrayBuffer[(Double, Data)]
val tabHash = new Array[Double](tabHash1.size)
for (ind <- tabHash1.indices){
var sum=0.0
for (ind2 <- 0 until x.size){
sum += ( x(ind2) * tabHash1(ind)(ind2) )
}
tabHash(ind) = ((sum + io.b) / io.w)
}
io.out := tabHash.reduce(_ + _)
}
When I compile the code, I get the following error:
code error
Thank you for your kind attention, looking forward to your responses.
Chisel does have a native FixedPoint type which maybe of use. It is in the experimental package
import chisel3.experimental.FixedPoint
There is also a project DspTools that has simulation support for Doubles. There are some nice features, e.g. it that allows modules to parameterized on the numeric types (Complex, Double, FixedPoint, SInt) so that you can run simulations on double to validate the desired mathematical behavior and then switch to a synthesizable number format that meets your precision criteria.
DspTools is an ongoing research projects and the team would appreciate outside users feedback.
Operations on floating point numbers (Double in this case) are not supported directly by any HDL. The reason for this is that while addition/subtraction/multiplication of fixed point numbers is well defined there are a lot of design space trade-offs for floating point hardware as it is a much more complex piece of hardware.
That is to say, a high performance floating point unit is a significant piece of hardware in it's own right and would be time shared in any realistic design.

How to randomly sample p percent of users in user event stream

I am looking for an algorithm that fairly samples p percent of users from an infinite list of users.
A naive algorithm looks something like this:
//This is naive.. what is a better way??
def userIdToRandomNumber(userId: Int): Float = userId.toString.hashCode % 1000)/1000.0
//An event listener will call this every time a new event is received
def sampleEventByUserId(event: Event) = {
//Process all events for 3% percent of users
if (userIdToRandomNumber(event.user.userId) <= 0.03) {
processEvent(event)
}
}
There are issues with this code though (hashCode may favor shorter strings, modulo arithmetic is discretizing value so its not exactly p, etc.).
Was is the "more correct" way of finding a deterministic mapping of userIds to a random number for the function userIdToRandomNumber above?
Try the method(s) below instead of the hashCode. Even for short strings, the values of the characters as integers ensure that the sum goes over 100. Also, avoid the division, so you avoid rounding errors
def inScope(s: String, p: Double) = modN(s, 100) < p * 100
def modN(s: String, n: Int): Int = {
var sum = 0
for (c <- s) { sum += c }
sum % n
}
Here is a very simple mapping, assuming your dataset is large enough:
For every user, generate a random number x, say in [0, 1].
If x <= p, pick that user
This is a practically used method on large datasets, and gives you entirely random results!
I am hoping you can easily code this in Scala.
EDIT: In the comments, you mention deterministic. I am interpreting that to mean if you sample again, it gives you the same results. For that, simply store x for each user.
Also, this will work for any number of users (even infinite). You just need to generate x for each user once. The mapping is simply userId -> x.
EDIT2: The algorithm in your question is biased. Suppose p = 10%, and there are 1100 users (userIds 1-1100). The first 1000 userIds have a 10% chance of getting picked, the next 100 have a 100% chance. Also, the hashing will map user ids to new values, but there is still no guarentee that modulo 1000 would give you a uniform sample!
I have come up with a deterministic solution to randomly sample users from a stream that is completely random (assuming the random number generator is completely random):
def sample(x: AnyRef, percent: Double): Boolean = {
new Random(seed=x.hashCode).nextFloat() <= percent
}
//sample 3 percent of users
if (sample(event.user.userId, 0.03)) {
processEvent(event)
}

What is the most efficient way to iterate a collection in parallel in Scala (TOP N pattern)

I am new to Scala and would like to build a real-time application to match some people. For a given Person, I would like to get the TOP 50 people with the highest matching score.
The idiom is as follows :
val persons = new mutable.HashSet[Person]() // Collection of people
/* Feed omitted */
val personsPar = persons.par // Make it parall
val person = ... // The given person
res = personsPar
.filter(...) // Some filters
.map{p => (p,computeMatchingScoreAsFloat(person, p))}
.toList
.sortBy(-_._2)
.take(50)
.map(t => t._1 + "=" + t._2).mkString("\n")
In the sample code above, HashSet is used, but it can be any type of collection, as I am pretty sure it is not optimal
The problem is that persons contains over 5M elements, the computeMatchingScoreAsFloat méthods computes a kind a correlation value with 2 vectors of 200 floats. This computation takes ~2s on my computer with 6 cores.
My question is, what is the fastest way of doing this TOPN pattern in Scala ?
Subquestions :
- What implementation of collection (or something else?) should I use ?
- Should I use Futures ?
NOTE: It has to be computed in parallel, the pure computation of computeMatchingScoreAsFloat alone (with no ranking/TOP N) takes more than a second, and < 200 ms if multi-threaded on my computer
EDIT: Thanks to Guillaume, compute time has been decreased from 2s to 700ms
def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = {
val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min)
t.foldLeft(starter)(
(myQueue,a) => {
if( myQueue.length <= n ){ myQueue.enqueue(a);myQueue}
else if( ord.compare(a,myQueue.head) < 0 ) myQueue
else{
myQueue.dequeue
myQueue.enqueue(a)
myQueue
}
}
)
}
Thanks
I would propose a few changes:
1- I believe that the filter and map steps requires traversing the collection twice. Having a lazy collection would reduce it to one. Either have a lazy collection (like Stream) or converting it to one, for instance for a list:
myList.view
2- the sort step requires sorting all elements. Instead, you can do a FoldLeft with an accumulator storing the top N records. See there for an example of an implementation:
Simplest way to get the top n elements of a Scala Iterable . I would probably test a Priority Queue instead of a list if you want maximum performance (really falling into its wheelhouse). For instance, something like this:
def IntStream(n:Int):Stream[(Int,Int)] = if(n == 0) Stream.empty else (util.Random.nextInt,util.Random.nextInt) #:: IntStream(n-1)
def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = {
val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min)
t.foldLeft(starter)(
(myQueue,a) => {
if( myQueue.length <= n ){ myQueue.enqueue(a);myQueue}
else if( ord.compare(a,myQueue.head) < 0 ) myQueue
else{
myQueue.dequeue
myQueue.enqueue(a)
myQueue
}
}
)
}
def diff(t2:(Int,Int)) = t2._2
top(10,IntStream(10000))(Ordering.by(diff)) // select top 10
I really think that you problem requires a SINGLE collection traverse so you be able to get your run time down to below 1 sec
Good luck!

Different result returned using Scala Collection par in a series of runs

I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.