Is any ways to speedup work of Map.getOrElse(val, 0) on big tuple maps? - scala

I has simple immutable Map in Scala:
// ... - mean and so on
val myLibrary = Map("qwe" -> 1.2, "qasd" -> -0.59, ...)
And for that myMap i calling MyFind method which call getOrElse(val, 0):
def MyFind (srcMap: Map[String,Int], str: String): Int ={
srcMap.getOrElse(str,0)
}
val res = MyFind(myLibrary, "qwe")
Problem in that this method called several times for different input strings. E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
What you can advice to speedup that code?
Maybe use another type of Map?
Maybe another collection?

Map does not have linear time lookup. Default concrete implementation of Map is HashMap
Map is the interface for immutable maps
while scala.collection.immutable.HashMap is a concrete implementation.
which has effective constant lookup time, as per collections performance characteristic
lookup add remove min
HashSet/HashMap eC eC eC L

E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
No, it won't. That's rather the point of Map in the first place. While it allows implementations which do require checking each value one-by-one (such as ListMap) they are very rarely used and by default when calling Map(...) you'll get a HashMap which doesn't. Its lookup is logarithmic time (with a large base), so basically when going from 100 to 10000 it doubles instead of increasing by 100 times.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
10000 is quite small.
Actually, look at http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html#performance. You can also see that mutable maps are much faster. Note that this predates collection changes in Scala 2.13, so may have changed.

Related

Multiple types in a list?

Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?
DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.

Why is there no `reduceByValue` in Spark collections?

I am learning Spark and Scala and keep coming across this pattern:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
While I understand what it does, I don't understand why it is used instead of having something like:
val lines = sc.textFile("data.txt")
val counts = lines.reduceByValue((v1, v2) => v1 + v2)
Given that Spark is designed to process large amounts of data efficiently, it seems counter intuitive to always have to perform an additional step of converting a list into a map and then reducing by key, instead of simply being able to reduce by value?
First, this "additional step" doesn't really cost much (see more details at the end) - it doesn't shuffle the data, and it is performed together with other transformations: transformations can be "pipelined" as long as they don't change the partitioning.
Second - the API you suggest seems very specific for counting - although you suggest reduceByValue will take a binary operator f: (Int, Int) => Int, your suggested API assumes each value is mapped to the value 1 before applying this operator for all identical values - an assumption that is hardly useful in any scenario other than counting. Adding such specific APIs would just bloat the interface and is never going to cover all use cases anyway (what's next - RDD.wordCount?), so it's better to give users minimal building blocks (along with good documentation).
Lastly - if you're not happy with such low-level APIs, you can use Spark-SQL's DataFrame API to get some higer-level APIs that will hide these details - that's one of the reasons DataFrames exist:
val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
EDIT: as requested - some more details about why the performance impact of this map operation is either small or entirely negligibile:
First, I'm assuming you are interested in producing the same result as the map -> reduceByKey code would produce (i.e. word count), which means somewhere the mapping from each record to the value 1 must take place, otherwise there's nothing to perform the summing function (v1, v2) => v1 + v2 on (that function takes Ints, they must be created somewhere).
To my understanding - you're just wondering why this has to happen as a separate map operation
So, we're actually interested in the overhead of adding another map operation
Consider these two functionally-identical Spark transformations:
val rdd: RDD[String] = ???
/*(1)*/ rdd.map(s => s.length * 2).collect()
/*(2)*/ rdd.map(s => s.length).map(_ * 2).collect()
Q: Which one is faster?
A: They perform the same
Why? Because as long as two consecutive transformations on an RDD do not change the partitioning (and that's the case in your original example too), Spark will group them together, and perform them within the same task. So, per record, the difference between these two will come down to the difference between:
/*(1)*/ s.length * 2
/*(2)*/ val r1 = s.length; r1 * 2
Which is negligible, especially when you're discussing distributed execution on large datasets, where execution time is dominated by things like shuffling, de/serialization and IO.

Efficiently take one value for each key out of a RDD[(key,value)]

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)

getting number of values within reduceByKey RDD

when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?
.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.

Overriding Ordering[Int] in Scala

I'm trying to sort an array of integers with a custom ordering.
E.g.
quickSort[Int](indices)(Ordering.by[Int, Double](value(_)))
Basically, I'm trying to sort indices of rows by the values of a particular column. I end up with a stackoverflow error when I run this on a fairly large data. If I use a more direct approach (e.g. sorting Tuple), this is not a problem.
Is there a problem if you try to extend the default Ordering[Int]?
You can reproduce this like this:
val indices = (0 to 99999).toArray
val values = Array.fill[Double](100000)(math.random)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values(_))) // Works
val values2 = Array.fill[Double](100000)(0.0)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values2(_))) // Fails
Update:
I think that I found out what the problem is (am answering my own question). It seems that I've created a paradoxical situation by changing the ordering definition of integers.
Within the quickSort algorithm itself, array positions are also integers, and there are certain statements comparing positions of arrays. This position comparison should be following the standard integer ordering.
But because of the new definition, now these position comparators are also following the indexed value comparator and things are getting really messed up.
I suppose that at least for the time being, I shouldn't be changing these default value type ordering as library might depend on default value type ordering.
Update2
It turns out that the above is in fact not the problem and there's a bug in quickSort when used together with Ordering. When a new Ordering is defined, the equality operator among Ordering is 'equiv', however the quickSort uses '=='. This results in the indices being compared, rather than indexed values being compared.