I am processing a large number of records (CDRS) that are essentially (who, where, how much), to save space I use a lookup to map the strings into integer and aggregate the traffic on a map of maps (who maps to a map (where maps how much)
type CDR = (String, String, Int)
type Lookup = scala.collection.mutable.HashMap[String, (Int, Float)]
type Traffic = scala.collection.mutable.HashMap[Int,scala.collection.mutable.HashMap[Int,Int]]enter code here
I have found a strange behavior, when I build the lookup tables in advance the code runs as expected, however when I start processing and build the maps on the fly it slows down as it processes the records.
I use the same function to build the lookup tables for this comparison. I essentially check if the code for the lookup is there, if not i create a new entry (it is a mutable map), like this:
def index(id: String, map: Lookup, reverse: Reverse): Int = {
if (map.contains(id)) {
map(id)._1
} else {
val number = if (map.keys.size == 0) 0 else reverse.keys.max + 1
reverse += ( number -> id)
map += (id -> (number, 0.toFloat))
number
}
}
Am I missing something here?
EDIT----> I can no longer reproduce the slowdown. I will assume I was either too tired or dumber than usual. Running time now seems to be same as I expected to be.
What is mapCellRvs? Default scala Map's .size (and .keys.size, which is the same thing) simply counts all elements by scanning them linearly.
Try replacing mapCellRvs.keys.size == 0 with mapCellRvs.isEmpty ...
Also, reverse.keys.max is linear as well. You may want to just remember the max somewhere separately, rather than compute it every time.
Related
I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a List type Any by using the following command:
val avgX_List = mainDataFrame.groupBy().agg(mean("_c1")).collect().map(_(0)).toList
where "_c1" is the second column of my dataframe. This line of code returns a List with type List[Any].
To pass the result into a variable I used the following command:
var avgX = avgX_List(0)
hoping that the var avgX would be type double automatically but that didn't happen obviously.
So now let the questions begin:
What does map(_(0)) do? I know the basic definition of the map() transformation but I can't find an explanation with this exact argument
I know that by using .toList method in the end of the command my result will be a List with type Any. Is there a way that I could change this into List which contains type Double elements? Or even convert this one
Do you think that it would be much more appropriate to pass the column of my Dataframe into a List[Double] and then calculate the average of its elements?
Is the solution I showed above at any point of view correct based on my problem? I know that "it is working" is different from "correct solution"?
Summing up, I need to calculate the average of a certain column of a Dataframe and have the result as a double type variable.
Note that: I am Greek and I find it hard sometimes to understand some English coding "slang".
map(_(0)) is a shortcut for map( (r: Row) => r(0) ), which is in turn a shortcut for map( (r: Row) => r.apply(0) ). The apply method returns Any, and so you are losing the right type. Try using map(_.getAs[Double](0)) or map(_.getDouble(0)) instead.
Collecting all entries of the column and then computing the average would be highly counterproductive, because you'd have to send huge amounts of data to the master node, and then do all the calculations on this single central node. That would be the exact opposite of what Spark is good for.
You also don't need collect(...).toList, because you can access the 0-th entry directly (it doesn't matter whether you get it from an Array or from a List). Since you are collapsing everything into a single Row anyway, you could get rid of the map step entirely by reordering the methods a little bit:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).collect()(0).getDouble(0)
It can be written even shorter using the first method:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).first().getDouble(0)
#Any dataType in Scala can't be directly converted to Double.
#Use toString & then toDouble on final captured result.
#Eg-
#scala> x
#res22: Any = 1.0
#scala> x.toString.toDouble
#res23: Double = 1.0
#Note- Instead of using map().toList() directly use (0)(0) to get the final value from your resultset.
#TestSample(Scala)-
val wa = Array("one","two","two")
val wrdd = sc.parallelize(wa,3).map(x=>(x,1))
val wdf = wrdd.toDF("col1","col2")
val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#O/p-
#scala> val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#x: Double = 1.0
I have a comparator like this:
lazy val seq = mapping.toSeq.sortWith { case ((_, set1), (_, set2)) =>
// Just propose all the most connected nodes first to the users
// But also allow less connected nodes to pop out sometimes
val popOutChance = random.nextDouble <= 0.1D && set2.size > 5
if (popOutChance) set1.size < set2.size else set1.size > set2.size
}
It is my intention to compare sets sizes such that smaller sets may appear higher in a sorted list with 10% chance.
But compiler does not let me do that and throws an Exception: java.lang.IllegalArgumentException: Comparison method violates its general contract! once I try to use it in runtime. How can I override it?
I think the problem here is that, every time two elements are compared, the outcome is random, thus violating the transitive property required of a comparator function in any sorting algorithm.
For example, let's say that some instance a compares as less than b, and then b compares as less than c. These results should imply that a compares as less than c. However, since your comparisons are stochastic, you can't guarantee that outcome. In fact, you can't even guarantee that a will be less than b next time they're compared.
So don't do that. No sort algorithm can handle it. (Such an approach also violates the referential transparency principle of functional programming and will make your program much harder to reason about.)
Instead, what you need to do is to decorate your map's members with a randomly assigned weighting - before attempting to sort them - so that they can be sorted consistently. However, since this happens at the start of a sort operation, the result of the sort will be different each time, which I think is what you're looking for.
It's not clear what type mapping has in your example, but it appears to be something like: Map[Any, Set[_]]. (You can replace the types as required - it's not that important to this approach. For example, say mapping actually has the type Map[String, Set[SomeClass]], then you would replace references below to Any with String and Set[_] to Set[SomeClass].)
First, we'll create a case class that we'll use to score and compare the map elements. Then we'll map the contents of mapping to a sequence of elements of this case class. Next, we sort those elements. Finally, we extract the tuple from the decorated class. The result should look something like this:
final case class Decorated(x: (Any, Set[_]), rand: Double = random.nextDouble)
extends Ordered[Decorated] {
// Calculate a rank for this element. You'll need to change this to suit your precise
// requirements. Here, if rand is less than 0.1 (a 10% chance), I'm adding 5 to the size;
// otherwise, I'll report the actual size. This allows transitive comparisons, since
// rand doesn't change once defined. Values are negated so bigger sets come to the fore
// when sorted.
private def rank: Int = {
if(rand < 0.1) -(x._2.size + 5)
else -x._2.size
}
// Compare this element with another, by their ranks.
override def compare(that: Decorated): Int = rank.compare(that.rank)
}
// Now sort your mapping elements as follows and convert back to tuples.
lazy val seq = mapping.map(x => Decorated(x)).toSeq.sorted.map(_.x)
This should put the elements with larger sets towards the front, but there's 10% chance that sets appear 5 bigger and so move up the list. The result will be different each time the last line is re-executed, since map will create new random values for each element. However, during sorting, the ranks will be fixed and will not change.
(Note that I'm setting the rank to a negative value. The Ordered[T] trait sorts elements in ascending order, so that - if we sorted purely by set size - smaller sets would come before larger sets. By negating the rank value, sorting will put larger sets before smaller sets. If you don't want this behavior, remove the negations.)
New to Scala. I'm iterating a for loop 100 times. 10 times I want condition 'a' to be met and 90 times condition 'b'. However I want the 10 a's to occur at random.
The best way I can think is to create a val of 10 random integers, then loop through 1 to 100 ints.
For example:
val z = List.fill(10)(100).map(scala.util.Random.nextInt)
z: List[Int] = List(71, 5, 2, 9, 26, 96, 69, 26, 92, 4)
Then something like:
for (i <- 1 to 100) {
whenever i == to a number in z: 'Condition a met: do something'
else {
'condition b met: do something else'
}
}
I tried using contains and == and =! but nothing seemed to work. How else can I do this?
Your generation of random numbers could yield duplicates... is that OK? Here's how you can easily generate 10 unique numbers 1-100 (by generating a randomly shuffled sequence of 1-100 and taking first ten):
val r = scala.util.Random.shuffle(1 to 100).toList.take(10)
Now you can simply partition a range 1-100 into those who are contained in your randomly generated list and those who are not:
val (listOfA, listOfB) = (1 to 100).partition(r.contains(_))
Now do whatever you want with those two lists, e.g.:
println(listOfA.mkString(","))
println(listOfB.mkString(","))
Of course, you can always simply go through the list one by one:
(1 to 100).map {
case i if (r.contains(i)) => println("yes: " + i) // or whatever
case i => println("no: " + i)
}
What you consider to be a simple for-loop actually isn't one. It's a for-comprehension and it's a syntax sugar that de-sugares into chained calls of maps, flatMaps and filters. Yes, it can be used in the same way as you would use the classical for-loop, but this is only because List is in fact a monad. Without going into too much details, if you want to do things the idiomatic Scala way (the "functional" way), you should avoid trying to write classical iterative for loops and prefer getting a collection of your data and then mapping over its elements to perform whatever it is that you need. Note that collections have a really rich library behind them which allows you to invoke cool methods such as partition.
EDIT (for completeness):
Also, you should avoid side-effects, or at least push them as far down the road as possible. I'm talking about the second example from my answer. Let's say you really need to log that stuff (you would be using a logger, but println is good enough for this example). Doing it like this is bad. Btw note that you could use foreach instead of map in that case, because you're not collecting results, just performing the side effects.
Good way would be to compute the needed stuff by modifying each element into an appropriate string. So, calculate the needed strings and accumulate them into results:
val results = (1 to 100).map {
case i if (r.contains(i)) => ("yes: " + i) // or whatever
case i => ("no: " + i)
}
// do whatever with results, e.g. print them
Now results contains a list of a hundred "yes x" and "no x" strings, but you didn't do the ugly thing and perform logging as a side effect in the mapping process. Instead, you mapped each element of the collection into a corresponding string (note that original collection remains intact, so if (1 to 100) was stored in some value, it's still there; mapping creates a new collection) and now you can do whatever you want with it, e.g. pass it on to the logger. Yes, at some point you need to do "the ugly side effect thing" and log the stuff, but at least you will have a special part of code for doing that and you will not be mixing it into your mapping logic which checks if number is contained in the random sequence.
(1 to 100).foreach { x =>
if(z.contains(x)) {
// do something
} else {
// do something else
}
}
or you can use a partial function, like so:
(1 to 100).foreach {
case x if(z.contains(x)) => // do something
case _ => // do something else
}
Say I have a map that looks like this
val map = Map("Shoes" -> 1, "heels" -> 2, "sneakers" -> 3, "dress" -> 4, "jeans" -> 5, "boyfriend jeans" -> 6)
And also I have a set or collection that looks like this:
val set = Array(Array("Shoes", "heels", "sneakers"), Array("dress", "maxi dress"), Array("jeans", "boyfriend jeans", "destroyed jeans"))
I would like to perform a filter operation on my map so that only one element in each of my set retains. Expected output should be something like this:
map = Map("Shoes" -> 1, "dress" -> 4 ,"jeans" -> 5)
The purpose of doing this is so that if I have multiple sets that indicate different categories of outfits, my output map doesn't "repeat" itself on technically the same objects.
Any help is appreciated, thanks!
So first get rid of the confusion that your sets are actually arrays. For the rest of the example I will use this definition instead:
val arrays = Array(Array("Shoes", "heels", "sneakers"), Array("dress", "maxi dress"), Array("jeans", "boyfriend jeans", "destroyed jeans"))
So in a sense you have an array of arrays of equivalent objects and want to remove all but one of them?
Well first you have to find which of the elements in an array are actually used as keys in the mep. So we just filter out all elements that are not used as keys:
array.filter(map.keySet)
Now, we have to chose one element. As you said, we just take the first one:
array.filter(map.keySet).head
As your "sets" are actually arrays, this is really the first element in your array that is also used as a key. If you would actually use sets this code would still work as sets actually have a "first element". It is just highly implementations specific and it might not even be deterministic over various executions of the same program. At least for immutable sets it should however be deterministic over several calls to head, i.e., you should always get the same element.
Instead of the first element we are actually interested in all other elements, as we want to remove them from the map:
array.filter(map.keySet).tail
Now, we just have to remove those from the map:
map -- array.filter(map.keySet).tail
And to do it for all arrays:
map -- arrays.flatMap(_.filter(map.keySet).tail)
This works fine as long as the arrays are disjoined. If they are not, we can not take the initial map to filter the array in every step. Instead, we have to use one array to compute a new map, then take the next starting with the result from the last and so on. Luckily, we do not have to do much:
arrays.foldLeft(map){(m,a) => m -- a.filter(m.keySet).tail}
Note: Sets are also functions from elements to Boolean, this is, why this solution works.
This code solves the problem:
var newMap = map
set.foreach { list =>
var remove = false
list.foreach { _key =>
if (remove) {
newMap -= _key
}
if (newMap.contains(_key)) {
remove = true
}
}
}
I'm completely new at Scala. I have taken this as my first Scala
example, please any hints from Scala's Gurus is welcome.
The basic idea is to use groupBy. Something like
map.groupBy{ case (k,v) => g(k) }.
map{ case (_, kvs) => kvs.head }
This is the general way to group similar things (using some function g). Now the question is just how to make the g that you need. One way is
val g = set.zipWithIndex.
flatMap{ case (a, i) => a.map(x => x -> i) }.
toMap
which labels each set with a number, and then forms a map so you can look it up. Maps have an apply function, so you can use it as above.
A slightly simpler version
set.flatMap(_.find(map.contains).map(y => y -> map(y)))
If I have a custom generator then the shrinker will remember my suchThat clause and not shrink with invalid values:
val myGen = Gen.identifier.suchThat { _.length > 3 }
// all shrinks have > 3 characters
property("failing case") = forAll (myGen) { (a: String) =>
println(s"Gen suchThat Value: $a")
a == "Impossible"
}
If I do something further to the generated value (ie map it) then the shrinker "forgets" my suchThat clause:
// the shrinker will shrink all the way down to ""
property("failing case") = forAll (myGen.map{_ + "bbb"}) { (a: String) =>
println(s"Gen with map Value: $a")
a == "Impossible"
}
Is it possible to have suchThat values propagate through generators. In my real project I am doing more than a simple map but that seems to be the simplest example of the limitation I am hitting.
I'm fairly certain the answer is no (at least at this point in time).
This is quite annoying although perhaps not as trivial as it seems. The generator result does attempt to keep track of the sieve although it gets lost in map and flatMap. Apart from applying the sieve to the result of the shrink there isn't any other connection back to the generator. Even if there were all the intermediate results would need to be retained and applied to each sieve at the correct points. That then raises the question of: What exactly is being shrunk? The generated result or the original generator(s)?
The only solution that I have found so far is to either:
Disable shrinking, or
Implement a custom Shrink, or
Add a whenever clause that rechecks the generated value.
This can be quite challenging, especially when composing multiple generators.