I am looking through Rng sources to see how they generate a list of random values.
They define a function fill:
def fill(n: Int): Rng[List[A]] = sequence(List.fill(n)(this))
where sequence is just an invocation of Traverse.sequence from scalaz:
def sequence[T[_], A](x: T[Rng[A]])(implicit T: Traverse[T]): Rng[T[A]] =
T.sequence(x)
In other words they create a temporary list List[Rang[A]] and then apply sequence: List[Rng[A]] => Rng[List[A]]. I see how it works but the temporary list looks list a waste of memory to me. Is it absolutely necessary ? Can it be improved ?
This is a slightly faster implementation. I didn't profile to see if there was a noticeable impact on the heap. I did a rough timing test and it took roughly 70% of the time Rng.fill took to fill a 1M item list with random Ints. I didn't attempt to find out how these scaled with different size lists. See https://gist.github.com/drstevens/77db6bab6b1e995dac13
def fill[A](a: Rng[A], count: Int): Rng[List[A]] =
Stream.from(0).take(count).traverseU(_ => a).map(_.toList)
The interesting thing about this is that the toList isn't evaluated until unsafePerformIO.
Related
I have a function whose purpose is to divide a dataset into arrays of a given size.
For example - I have a dataset with 123 objects of the Foo type, I provide to the function arraysSize 10 so as a result I will have a Dataset[Array[Foo]] with 12 arrays with 10 Foo's and 1 array with 3 Foo.
Right now function is working on collected data - I would like to change it on dataset based because of performance but I dont know how.
This is my current solution:
private def mapToFooArrays(data: Dataset[Foo],
arraysSize: Int): Dataset[Array[Foo]]= {
data.collect().grouped(arraysSize).toSeq.toDS()
}
The reason for doing this transformation is because the data will be sent in the event. Instead of sending 1 million events with information about 1 object, I prefer to send, for example, 10 thousand events with information about 100 objects
IMO, this is a weird use case. I can not think of any efficient solution to do this, as it is going to require a lot of shuffling no matter how we do it.
But, the following is still better, as it avoids collecting to the driver node and will thus be more scalable.
Things to keep in mind -
what is the value of data.count() ?
what is the size of a single Foo ?
what is the value of arraySize ?
what is your executor configuration ?
Based on these factors you will be able to come up with the desiredArraysPerPartition value.
val desiredArraysPerPartition = 50
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] = {
val size = data.count()
val numArrays = (size.toDouble / arrarySize).ceil
val numPartitions = (numArrays.toDouble / desiredArraysPerPartition).ceil
data
.repartition(numPartitions)
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
}
After reading the edited part, I think that 100 size in 10 thousand events with information about 100 objects is not really important. As it is referred as about 100. There can be more than one events with less than 100 Foo's.
If we are not very strict about that 100 size, then there is no need of reshuffle.
We can locally group the Foo's present in each of the partitions. As this grouping is being done locally and not globally, this might result in more than one (potentially one for each partition) Arrays with less than 100 Foo's.
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] =
data
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
I am fairly new to ScalaCheck and somehow I want to generate a list of distinct values (i.e. a set). With my approach the values are not unique.
val valueList: Gen[List[Int]] = Gen.ListOf(Arbitrary.arbitrary[Int])
What are the possible ways to create a list with unique values/a set? Maybe with using suchThat or distinct?
Crude but effective:
Gen.listOf(Arbitrary.arbitrary[Int]).map(_.toSet)
If you wanted a set of a particular size, you could do something like
def setOfN[A](n: Int, gen: Gen[A]): Gen[Set[A]] =
Gen.listOfN(n, gen).flatMap { lst =>
val set = lst.toSet
if (set.size == n) Gen.const(set)
else if (set.size > n) Gen.const(set.take(n))
else setOfN(n - set.size, gen.retryUntil(x => !set(x))).flatMap(_ ++ set)
}
(It's worth noting that retryUntil can be fragile, especially as n grows relative to the number of commonly generated values (e.g. for Int Scalacheck generates 0, +/- 1, etc. fairly frequently))
And of course, since there's an org.scalacheck.util.Buildable instance for Set
Gen.containerOf[Set, Int](Arbitrary.arbitrary[Int])
Following up on what Levi Ramsey said, you might want to check out https://github.com/rallyhealth/scalacheck-ops for the ability to generate sets of a specific size.
import org.scalacheck.Arbitrary.arbitrary
import org.scalacheck.Gen
import org.scalacheck.ops._
Gen.setOfN(5, arbitrary[Int]).map(_.toVector)
I have a comparator like this:
lazy val seq = mapping.toSeq.sortWith { case ((_, set1), (_, set2)) =>
// Just propose all the most connected nodes first to the users
// But also allow less connected nodes to pop out sometimes
val popOutChance = random.nextDouble <= 0.1D && set2.size > 5
if (popOutChance) set1.size < set2.size else set1.size > set2.size
}
It is my intention to compare sets sizes such that smaller sets may appear higher in a sorted list with 10% chance.
But compiler does not let me do that and throws an Exception: java.lang.IllegalArgumentException: Comparison method violates its general contract! once I try to use it in runtime. How can I override it?
I think the problem here is that, every time two elements are compared, the outcome is random, thus violating the transitive property required of a comparator function in any sorting algorithm.
For example, let's say that some instance a compares as less than b, and then b compares as less than c. These results should imply that a compares as less than c. However, since your comparisons are stochastic, you can't guarantee that outcome. In fact, you can't even guarantee that a will be less than b next time they're compared.
So don't do that. No sort algorithm can handle it. (Such an approach also violates the referential transparency principle of functional programming and will make your program much harder to reason about.)
Instead, what you need to do is to decorate your map's members with a randomly assigned weighting - before attempting to sort them - so that they can be sorted consistently. However, since this happens at the start of a sort operation, the result of the sort will be different each time, which I think is what you're looking for.
It's not clear what type mapping has in your example, but it appears to be something like: Map[Any, Set[_]]. (You can replace the types as required - it's not that important to this approach. For example, say mapping actually has the type Map[String, Set[SomeClass]], then you would replace references below to Any with String and Set[_] to Set[SomeClass].)
First, we'll create a case class that we'll use to score and compare the map elements. Then we'll map the contents of mapping to a sequence of elements of this case class. Next, we sort those elements. Finally, we extract the tuple from the decorated class. The result should look something like this:
final case class Decorated(x: (Any, Set[_]), rand: Double = random.nextDouble)
extends Ordered[Decorated] {
// Calculate a rank for this element. You'll need to change this to suit your precise
// requirements. Here, if rand is less than 0.1 (a 10% chance), I'm adding 5 to the size;
// otherwise, I'll report the actual size. This allows transitive comparisons, since
// rand doesn't change once defined. Values are negated so bigger sets come to the fore
// when sorted.
private def rank: Int = {
if(rand < 0.1) -(x._2.size + 5)
else -x._2.size
}
// Compare this element with another, by their ranks.
override def compare(that: Decorated): Int = rank.compare(that.rank)
}
// Now sort your mapping elements as follows and convert back to tuples.
lazy val seq = mapping.map(x => Decorated(x)).toSeq.sorted.map(_.x)
This should put the elements with larger sets towards the front, but there's 10% chance that sets appear 5 bigger and so move up the list. The result will be different each time the last line is re-executed, since map will create new random values for each element. However, during sorting, the ranks will be fixed and will not change.
(Note that I'm setting the rank to a negative value. The Ordered[T] trait sorts elements in ascending order, so that - if we sorted purely by set size - smaller sets would come before larger sets. By negating the rank value, sorting will put larger sets before smaller sets. If you don't want this behavior, remove the negations.)
When calling nextString() from the built-in scala.util.Random library, what time does it take to run? Is that O(n)?
Yes, it's O(n). It can't be any lower, because it creates a new string and that has O(n) cost. It shouldn't be any higher, because creating a random number is O(1) and that's enough to pick a character or word or something. And in practice it's actually O(n).
The constant factor is pretty high, though, due to how it's implemented. If it is important to you to make random strings really fast, you should get your own high-performance random number generator and pack chars into a char array.
Couldn't find anything on Scala docs, but from the source code:
def nextString(length: Int) = {
def safeChar() = {
val surrogateStart: Int = 0xD800
val res = nextInt(surrogateStart - 1) + 1
res.toChar
}
List.fill(length)(safeChar()).mkString
}
I would say O(n), assuming O(1) from nextInt(), on the length of the string asked
This might be the least important Scala question ever, but it's bothering me. How would I generate a list of n random number. What I have so far:
def n_rands(n : Int) = {
val r = new scala.util.Random
1 to n map { _ => r.nextInt(100) }
}
Which works, but doesn't look very Scalarific to me. I'm open to suggestions.
EDIT
Not because it's relevant so much as it's amusing and obvious in retrospect, the following looks like it works:
1 to 20 map r.nextInt
But the index of each entry in the returned list is also the upper bound of that last. The first number must be less than 1, the second less than 2, and so on. I ran it three or four times and noticed "Hmmm, the result always starts with 0..."
You can either use Don's solution or:
Seq.fill(n)(Random.nextInt)
Note that you don't need to create a new Random object, you can use the default companion object Random, as stated above.
How about:
import util.Random.nextInt
Stream.continually(nextInt(100)).take(10)
regarding your EDIT,
nextInt can take an Int argument as an upper bound for the random number, so 1 to 20 map r.nextInt is the same as 1 to 20 map (i => r.nextInt(i)), rather than a more useful compilation error.
1 to 20 map (_ => r.nextInt(100)) does what you intended. But it's better to use Seq.fill since that more accurately represents what you're doing.