getting number of values within reduceByKey RDD - scala

when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?

.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.

Related

How does scala's VectorMap work and how is it different than ListMap?

How does scala's VectorMap work? It says that it is constant time for look up.
I think ListMap has to iterate through everything to find an entry. Why would vector map be different?
Is it a hash table combined with a vector, where the hash table will map a key to an index in the vector, which has the entries?
Essentially, yes. It has a regular Map inside that maps keys to tuples (index, value), where index is pointing into a Vector of (keys), which is only used for in-order access (.head, .tail, .last, .iterator etc).

Is any ways to speedup work of Map.getOrElse(val, 0) on big tuple maps?

I has simple immutable Map in Scala:
// ... - mean and so on
val myLibrary = Map("qwe" -> 1.2, "qasd" -> -0.59, ...)
And for that myMap i calling MyFind method which call getOrElse(val, 0):
def MyFind (srcMap: Map[String,Int], str: String): Int ={
srcMap.getOrElse(str,0)
}
val res = MyFind(myLibrary, "qwe")
Problem in that this method called several times for different input strings. E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
What you can advice to speedup that code?
Maybe use another type of Map?
Maybe another collection?
Map does not have linear time lookup. Default concrete implementation of Map is HashMap
Map is the interface for immutable maps
while scala.collection.immutable.HashMap is a concrete implementation.
which has effective constant lookup time, as per collections performance characteristic
lookup add remove min
HashSet/HashMap eC eC eC L
E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
No, it won't. That's rather the point of Map in the first place. While it allows implementations which do require checking each value one-by-one (such as ListMap) they are very rarely used and by default when calling Map(...) you'll get a HashMap which doesn't. Its lookup is logarithmic time (with a large base), so basically when going from 100 to 10000 it doubles instead of increasing by 100 times.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
10000 is quite small.
Actually, look at http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html#performance. You can also see that mutable maps are much faster. Note that this predates collection changes in Scala 2.13, so may have changed.

max() with struct() in Spark Dataset

I have something like the below in spark of which I'm grouping and then trying to find the one with the highest value from my struct.
test.map(x => tester(x._1, x._2, x._3, x._4, x._5))
.toDS
.select($"ac", $"sk", struct($"num1", struct($"time", $"num1")).as("grp"))
.groupBy($"ac", $"sk")
.agg(max($"grp")).show(false)
I am not sure how the max function figures out how to decide the max. The reason I used a nested struct is because it seemed to make the max function using num1 instead of the next numbers when everything was in the same struct.
The StructTypes are compared lexicographically - field by field, from left to right and all fields have to recursively orderable. So in your case:
It will compare the first element of the struct.
If the elements are not equal it will return the struct with higher value.
Otherwise it will proceed to the point 2.
Since the second field is complex as well, it will repeat procedure from point 1 this time comparing time fields first.
Note that nested num1 can be evaluated on if top level num1 fields are equal, therefore it doesn't affect the ordering in practice.

Functional Opposite of Subtract by Key

I have two RDDs of the form RDD1[K, V1] and RDD2[K, V2]. I was hoping to remove values in RDD2 which are not in RDD1. (Essentially an inner join on each of the RDD's keys, but I don't want to copy over RDD1's values.)
I understand that there's a method subtractByKey which performs the opposite of this. (Keeps those that are distinct.)
You cannot avoid having some type of value here so applying join and mapping values seems to be the way to go. You can use:
rdd2.join(rdd1.mapValues(_ => None)).mapValues(_._1)
which replaces values with dummies (usually you can skip that because there is not much to gain here unless values are largish):
_.mapValues(_ => None)
joins, and drops placeholders:
_.mapValues(_._1)

Efficiently take one value for each key out of a RDD[(key,value)]

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)