How to randomly sample p percent of users in user event stream - scala

I am looking for an algorithm that fairly samples p percent of users from an infinite list of users.
A naive algorithm looks something like this:
//This is naive.. what is a better way??
def userIdToRandomNumber(userId: Int): Float = userId.toString.hashCode % 1000)/1000.0
//An event listener will call this every time a new event is received
def sampleEventByUserId(event: Event) = {
//Process all events for 3% percent of users
if (userIdToRandomNumber(event.user.userId) <= 0.03) {
processEvent(event)
}
}
There are issues with this code though (hashCode may favor shorter strings, modulo arithmetic is discretizing value so its not exactly p, etc.).
Was is the "more correct" way of finding a deterministic mapping of userIds to a random number for the function userIdToRandomNumber above?

Try the method(s) below instead of the hashCode. Even for short strings, the values of the characters as integers ensure that the sum goes over 100. Also, avoid the division, so you avoid rounding errors
def inScope(s: String, p: Double) = modN(s, 100) < p * 100
def modN(s: String, n: Int): Int = {
var sum = 0
for (c <- s) { sum += c }
sum % n
}

Here is a very simple mapping, assuming your dataset is large enough:
For every user, generate a random number x, say in [0, 1].
If x <= p, pick that user
This is a practically used method on large datasets, and gives you entirely random results!
I am hoping you can easily code this in Scala.
EDIT: In the comments, you mention deterministic. I am interpreting that to mean if you sample again, it gives you the same results. For that, simply store x for each user.
Also, this will work for any number of users (even infinite). You just need to generate x for each user once. The mapping is simply userId -> x.
EDIT2: The algorithm in your question is biased. Suppose p = 10%, and there are 1100 users (userIds 1-1100). The first 1000 userIds have a 10% chance of getting picked, the next 100 have a 100% chance. Also, the hashing will map user ids to new values, but there is still no guarentee that modulo 1000 would give you a uniform sample!

I have come up with a deterministic solution to randomly sample users from a stream that is completely random (assuming the random number generator is completely random):
def sample(x: AnyRef, percent: Double): Boolean = {
new Random(seed=x.hashCode).nextFloat() <= percent
}
//sample 3 percent of users
if (sample(event.user.userId, 0.03)) {
processEvent(event)
}

Related

Looking for advice on improving a custom function in AnyLogic

I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:
d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store
I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).
Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?
Thanks in advance.
//for each store, record all customers assigned to it
for (Store store : stores)
{
distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
//calculates average distance from store j to customer nodes that belong to store j
double sumFirstDistByStore = 0.0;
int h = 0;
while (h < store.colCusts.size())
{
sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
h++;
}
distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
//calculates average of distances between all customer nodes belonging to store j
double custDistSumPerStore = 0.0;
int loopLimit = store.colCusts.size();
int i = 0;
while (i < loopLimit - 1)
{
int j = 1;
while (j < loopLimit)
{
custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
j++;
}
i++;
}
distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
distancesStore.println();
}
Firstly a few simple comments:
Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:
for (Store store : stores)
{ ...
use this:
stores.parallelStream().forEach(store -> {
...
});
this will process stores entries in parallel using standard Java streams API.
It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.
Please have a look at above and post results. With more information additional performance improvements may be possible.

What is the scala equivalent of Python's Numpy np.random.choice?(Random weighted selection in scala)

I was looking for Scala's equivalent code or underlying theory for pythons np.random.choice (Numpy as np). I have a similar implementation that uses Python's np.random.choice method to select the random moves from the probability distribution.
Python's code
Input list: ['pooh', 'rabbit', 'piglet', 'Christopher'] and probabilies: [0.5, 0.1, 0.1, 0.3]
I want to select one of the value from the input list given the associated probability of each input element.
The Scala standard library has no equivalent to np.random.choice but it shouldn't be too difficult to build your own, depending on which options/features you want to emulate.
Here, for example, is a way to get an infinite Stream of submitted items, with the probability of any one item weighted relative to the others.
def weightedSelect[T](input :(T,Int)*): Stream[T] = {
val items :Seq[T] = input.flatMap{x => Seq.fill(x._2)(x._1)}
def output :Stream[T] = util.Random.shuffle(items).toStream #::: output
output
}
With this each input item is given with a multiplier. So to get an infinite pseudorandom selection of the characters c and v, with c coming up 3/5ths of the time and v coming up 2/5ths of the time:
val cvs = weightedSelect(('c',3),('v',2))
Thus the rough equivalent of the np.random.choice(aa_milne_arr,5,p=[0.5,0.1,0.1,0.3]) example would be:
weightedSelect("pooh"-> 5
,"rabbit" -> 1
,"piglet" -> 1
,"Christopher" -> 3).take(5).toArray
Or perhaps you want a better (less pseudo) random distribution that might be heavily lopsided.
def weightedSelect[T](items :Seq[T], distribution :Seq[Double]) :Stream[T] = {
assert(items.length == distribution.length)
assert(math.abs(1.0 - distribution.sum) < 0.001) // must be at least close
val dsums :Seq[Double] = distribution.scanLeft(0.0)(_+_).tail
val distro :Seq[Double] = dsums.init :+ 1.1 // close a possible gap
Stream.continually(items(distro.indexWhere(_ > util.Random.nextDouble())))
}
The result is still an infinite Stream of the specified elements but the passed-in arguments are a bit different.
val choices :Stream[String] = weightedSelect( List("this" , "that")
, Array(4998/5000.0, 2/5000.0))
// let's test the distribution
val (choiceA, choiceB) = choices.take(10000).partition(_ == "this")
choiceA.length //res0: Int = 9995
choiceB.length //res1: Int = 5 (not bad)

SCALA: Function for Square root of BigInt

I searched internet for a function to find exact square root of BigInt using scala programming language. I didn't get one, But saw one Java Program and I converted that function into Scala version. It is working but I am not sure, whether it can handle very large BigInt. But it returns BigInt only. Not BigDecimal as Square Root. It shows there is some bit manipulation done in the code with some hard coding of numbers like shiftRight(5), BigInt("8") and shiftRight(1). I can understand the logic clearly, But not the hard coding of these bitshift numbers and the number 8. May be these bitshift functions are not available in scala, and thats why it is needed to convert to java BigInteger at few places. These hard coded numbers may impact the precision of the result.I just changed the java code into scala code just copying the exact algorithm. And here is the code I have written in scala:
def sqt(n:BigInt):BigInt = {
var a = BigInt(1)
var b = (n>>5)+BigInt(8)
while((b-a) >= 0) {
var mid:BigInt = (a+b)>>1
if(mid*mid-n> 0) b = mid-1
else a = mid+1
}
a-1
}
My Points are:
Can't we return a BigDecimal instead of BigInt? How can we do that?
How these hardcoded numbers shiftRight(5), shiftRight(1) and 8 are related
to precision of the result.
I tested for one number in scala REPL: The function sqt is giving exact square root of the squared number. but not for the actual number as below:
scala> sqt(BigInt("19928937494873929279191794189"))
res9: BigInt = 141169888768369
scala> res9*res9
res10: scala.math.BigInt = 19928937494873675935734920161
scala> sqt(res10)
res11: BigInt = 141169888768369
scala>
I understand shiftRight(5) means divide by 2^5 ie.by 32 in decimal and so on..but why 8 is added here after shift operation? why exactly 5 shifts? as a first guess?
Your question 1 and question 3 are actually the same question.
How [do] these bitshifts impact [the] precision of the result?
They don't.
How [are] these hardcoded numbers ... related to precision of the result?
They aren't.
There are many different methods/algorithms for estimating/calculating the square root of a number (as can be seen here). The algorithm you've posted appears to be a pretty straight forward binary search.
Pick a number a guaranteed to be smaller than the target (square root of n).
Pick a number b guaranteed to be larger than the target (square root of n).
Calculate mid, the whole number mid-point between a and b.
If mid is larger than (or equal to) the target then move b to mid (-1 because we know it's too large).
If mid is smaller than the target then move a to mid (+1 because we know it's too small).
Repeat 3,4,5 until a is no longer less than b.
Return a-1 as the square root of n rounded down to a whole number.
The bitshifts and hardcoded numbers are used in selecting the initial value of b. But b only has be greater than the target. We could have just done var b = n. Why all the bother?
It's all about efficiency. The closer b is to the target, the fewer iterations are needed to find the result. Why add 8 after the shift? Because 31>>5 is zero, which is not greater than the target. The author chose (n>>5)+8 but he/she might have chosen (n>>7)+12. There are trade-offs.
Can't we return a BigDecimal instead of BigInt? How can we do that?
Here's one way to do that.
def sqt(n:BigInt) :BigDecimal = {
val d = BigDecimal(n)
var a = BigDecimal(1.0)
var b = d
while(b-a >= 0) {
val mid = (a+b)/2
if (mid*mid-d > 0) b = mid-0.0001 //adjust down
else a = mid+0.0001 //adjust up
}
b
}
There are better algorithms for calculating floating-point square root values. In this case you get better precision by using smaller adjustment values but the efficiency gets much worse.
Can't we return a BigDecimal instead of BigInt? How can we do that?
This makes no sense if you want exact roots: if a BigInt's square root can be represented exactly by a BigDecimal, it can be represented by a BigInt. If you don't want exact roots, you'll need to specify precision and modify the algorithm (and for most cases, Double will be good enough and much much much faster than BigDecimal).
I understand shiftRight(5) means divide by 2^5 ie.by 32 in decimal and so on..but why 8 is added here after shift operation? why exactly 5 shifts? as a first guess?
These aren't the only options. The point is that for every positive n, n/32 + 8 >= sqrt(n) (where sqrt is the mathematical square root). This is easiest to show by a bit of calculus (or just by building a graph of the difference). So at the start we know a <= sqrt(n) <= b (unless n == 0 which can be checked separately), and you can verify this remains true on each step.

Merge Sort algorithm efficiency

I am currently taking an online algorithms course in which the teacher doesn't give code to solve the algorithm, but rather rough pseudo code. So before taking to the internet for the answer, I decided to take a stab at it myself.
In this case, the algorithm that we were looking at is merge sort algorithm. After being given the pseudo code we also dove into analyzing the algorithm for run times against n number of items in an array. After a quick analysis, the teacher arrived at 6nlog(base2)(n) + 6n as an approximate run time for the algorithm.
The pseudo code given was for the merge portion of the algorithm only and was given as follows:
C = output [length = n]
A = 1st sorted array [n/2]
B = 2nd sorted array [n/2]
i = 1
j = 1
for k = 1 to n
if A(i) < B(j)
C(k) = A(i)
i++
else [B(j) < A(i)]
C(k) = B(j)
j++
end
end
He basically did a breakdown of the above taking 4n+2 (2 for the declarations i and j, and 4 for the number of operations performed -- the for, if, array position assignment, and iteration). He simplified this, I believe for the sake of the class, to 6n.
This all makes sense to me, my question arises from the implementation that I am performing and how it effects the algorithms and some of the tradeoffs/inefficiencies it may add.
Below is my code in swift using a playground:
func mergeSort<T:Comparable>(_ array:[T]) -> [T] {
guard array.count > 1 else { return array }
let lowerHalfArray = array[0..<array.count / 2]
let upperHalfArray = array[array.count / 2..<array.count]
let lowerSortedArray = mergeSort(array: Array(lowerHalfArray))
let upperSortedArray = mergeSort(array: Array(upperHalfArray))
return merge(lhs:lowerSortedArray, rhs:upperSortedArray)
}
func merge<T:Comparable>(lhs:[T], rhs:[T]) -> [T] {
guard lhs.count > 0 else { return rhs }
guard rhs.count > 0 else { return lhs }
var i = 0
var j = 0
var mergedArray = [T]()
let loopCount = (lhs.count + rhs.count)
for _ in 0..<loopCount {
if j == rhs.count || (i < lhs.count && lhs[i] < rhs[j]) {
mergedArray.append(lhs[i])
i += 1
} else {
mergedArray.append(rhs[j])
j += 1
}
}
return mergedArray
}
let values = [5,4,8,7,6,3,1,2,9]
let sortedValues = mergeSort(values)
My questions for this are as follows:
Do the guard statements at the start of the merge<T:Comparable> function actually make it more inefficient? Considering we are always halving the array, the only time that it will hold true is for the base case and when there is an odd number of items in the array.
This to me seems like it would actually add more processing and give minimal return since the time that it happens is when we have halved the array to the point where one has no items.
Concerning my if statement in the merge. Since it is checking more than one condition, does this effect the overall efficiency of the algorithm that I have written? If so, the effects to me seems like they vary based on when it would break out of the if statement (e.g at the first condition or the second).
Is this something that is considered heavily when analyzing algorithms, and if so how do you account for the variance when it breaks out from the algorithm?
Any other analysis/tips you can give me on what I have written would be greatly appreciated.
You will very soon learn about Big-O and Big-Theta where you don't care about exact runtimes (believe me when I say very soon, like in a lecture or two). Until then, this is what you need to know:
Yes, the guards take some time, but it is the same amount of time in every iteration. So if each iteration takes X amount of time without the guard and you do n function calls, then it takes X*n amount of time in total. Now add in the guards who take Y amount of time in each call. You now need (X+Y)*n time in total. This is a constant factor, and when n becomes very large the (X+Y) factor becomes negligible compared to the n factor. That is, if you can reduce a function X*n to (X+Y)*(log n) then it is worthwhile to add the Y amount of work because you do fewer iterations in total.
The same reasoning applies to your second question. Yes, checking "if X or Y" takes more time than checking "if X" but it is a constant factor. The extra time does not vary with the size of n.
In some languages you only check the second condition if the first fails. How do we account for that? The simplest solution is to realize that the upper bound of the number of comparisons will be 3, while the number of iterations can be potentially millions with a large n. But 3 is a constant number, so it adds at most a constant amount of work per iteration. You can go into nitty-gritty details and try to reason about the distribution of how often the first, second and third condition will be true or false, but often you don't really want to go down that road. Pretend that you always do all the comparisons.
So yes, adding the guards might be bad for your runtime if you do the same number of iterations as before. But sometimes adding extra work in each iteration can decrease the number of iterations needed.

How to fix stackoverflow exception in scala

def factorial(n : Int) : Int = {
if(n==1)
1
else
n * factorial(n-1)
}
println(factorial(500000))
While I am passing large value its throwing stackoverflow exception Can we fix it?
The question seems theoretical, because factorial of 500000 is really a huge number. The result is so huge it is not representable by IEEE Double, and I doubt there is any practical reason why to compute it.
Some math calculators (like SpeedCrunch) let you compute factorial using the gamma function, probably using some approximation for large numbers. The SpeedCrunch result of gamma(500000 + 1)is 1.02280158465190236533 * 10 2632341.
However, if you insist on doing it, this is how it can be done:
Implement factorial using tail recursion instead. See Tail Recursion in Scala: A Simple Example or http://alvinalexander.com/scala/scala-recursion-examples-recursive-programming
Note: you will still get integer arithmetics overflow for large inputs, and the result will be wrong for them. The largest input for a 32b signed integer for which the result will still fit in a 32b signed result is 12 (cf. Factorial does not work for all values)
You can avoid this by using Double to compute the result (you will get approximate result only for large numbers, and Infinity for 500000) or by using BigInt - the calculation will work for all values, but it will get slower.
Following code should produce the correct result, but it might take very long, and the result will be very long - you might perhaps even get out of memory errors. I tried computing factorial of 50000 with it and it took several seconds, and the resulting number was several pages long.
def factorial(n: Long): BigInt = {
#tailrec
def factorialAccumulator(acc: BigInt, n: Long): BigInt = {
if (n == 0) acc
else factorialAccumulator(n*acc, n-1)
}
factorialAccumulator(1, n)
}