Multiple types in a list? - scala

Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?

DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.

Related

Merge objects in mutable list on Scala

I have recently started looking at Scala code and I am trying to understand how to go about a problem.
I have a mutable list of objects, these objects have an id: String and values: List[Int]. The way I get the data, more than one object can have the same id. I am trying to merge the items in the list, so if for example, I have 3 objects with id 123 and whichever values, I end up with just one object with the id, and the values of the 3 combined.
I could do this the java way, iterating, and so on, but I was wondering if there is an easier Scala-specific way of going about this?
The first thing to do is avoid using mutable data and think about transforming one immutable object into another. So rather than mutating the contents of one collection, think about creating a new collection from the old one.
Once you have done that it is actually very straightforward because this is the sort of thing that is directly supported by the Scala library.
case class Data(id: String, values: List[Int])
val list: List[Data] = ???
val result: Map[String, List[Int]] =
list.groupMapReduce(_.id)(_.values)(_ ++ _)
The groupMapReduce call breaks down into three parts:
The first part groups the data by the id field and makes that the key. This gives a Map[String, List[Data]]
The second part extracts the values field and makes that the data, so the result is now Map[String, List[List[Int]]]
The third part combines all the values fields into a single list, giving the final result Map[String, List[Int]]

Is any ways to speedup work of Map.getOrElse(val, 0) on big tuple maps?

I has simple immutable Map in Scala:
// ... - mean and so on
val myLibrary = Map("qwe" -> 1.2, "qasd" -> -0.59, ...)
And for that myMap i calling MyFind method which call getOrElse(val, 0):
def MyFind (srcMap: Map[String,Int], str: String): Int ={
srcMap.getOrElse(str,0)
}
val res = MyFind(myLibrary, "qwe")
Problem in that this method called several times for different input strings. E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
What you can advice to speedup that code?
Maybe use another type of Map?
Maybe another collection?
Map does not have linear time lookup. Default concrete implementation of Map is HashMap
Map is the interface for immutable maps
while scala.collection.immutable.HashMap is a concrete implementation.
which has effective constant lookup time, as per collections performance characteristic
lookup add remove min
HashSet/HashMap eC eC eC L
E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
No, it won't. That's rather the point of Map in the first place. While it allows implementations which do require checking each value one-by-one (such as ListMap) they are very rarely used and by default when calling Map(...) you'll get a HashMap which doesn't. Its lookup is logarithmic time (with a large base), so basically when going from 100 to 10000 it doubles instead of increasing by 100 times.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
10000 is quite small.
Actually, look at http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html#performance. You can also see that mutable maps are much faster. Note that this predates collection changes in Scala 2.13, so may have changed.

Spark Cassandra connector - using IN for filtering with dynamic data

Let's assume that I have an RDD with items of type
case class Foo(name: String, nums: Seq[Int])
and a table my_schema.foo in Cassandra with a partitioning key composed of name and num columns
Now, I'd like to fetch for each element in the input RDD all corresponding rows, i.e. something like:
SELECT * from my_schema.foo where name = :name and num IN :nums
I've tried the following approaches:
use the joinWithCassandraTable extension: rdd.joinWithCassandraTable("my_schema", "foo").on(SomeColumns("name")) but I don't know how I could specify the IN constraint
For each element of the input RDD issue a separate query (within a map function). This does not work, as the spark context is not serializable and cannot be passed into the map
Flatmap the input RDD to generate a separate item (name, num) for each num in nums. This will work, but it will probably be way less efficient than using an IN clause.
What would be a proper way of solving this problem?

Overriding Ordering[Int] in Scala

I'm trying to sort an array of integers with a custom ordering.
E.g.
quickSort[Int](indices)(Ordering.by[Int, Double](value(_)))
Basically, I'm trying to sort indices of rows by the values of a particular column. I end up with a stackoverflow error when I run this on a fairly large data. If I use a more direct approach (e.g. sorting Tuple), this is not a problem.
Is there a problem if you try to extend the default Ordering[Int]?
You can reproduce this like this:
val indices = (0 to 99999).toArray
val values = Array.fill[Double](100000)(math.random)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values(_))) // Works
val values2 = Array.fill[Double](100000)(0.0)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values2(_))) // Fails
Update:
I think that I found out what the problem is (am answering my own question). It seems that I've created a paradoxical situation by changing the ordering definition of integers.
Within the quickSort algorithm itself, array positions are also integers, and there are certain statements comparing positions of arrays. This position comparison should be following the standard integer ordering.
But because of the new definition, now these position comparators are also following the indexed value comparator and things are getting really messed up.
I suppose that at least for the time being, I shouldn't be changing these default value type ordering as library might depend on default value type ordering.
Update2
It turns out that the above is in fact not the problem and there's a bug in quickSort when used together with Ordering. When a new Ordering is defined, the equality operator among Ordering is 'equiv', however the quickSort uses '=='. This results in the indices being compared, rather than indexed values being compared.

In which order will the elements of Scala Buffer be accessed in a for loop?

for(elt <- bufferObject: scala.collection.mutable.Buffer)
// Do something with the element of the collection
In which order will the elements in the for loop be accessed ? Randomly ?
From the Scala API one can see that Buffer is a subclass of Seq, in which the elements are ordered. Does this also hold for the loop above ?
Here's a selection of the super-types of mutable.Buffer[A], and the traversal guarantees they provide:
Seq[A] -- All elements have a position, with an index associated; they are always traversed one by one, from lowest index to highest index.
GenSeq[A] -- All elements have a position, with an index associated; they may be traversed one by one or in parallel; if a new collection is generated, the position of the elements in the new collection will correspond to the old collection, even if the traversal is in parallel.
Iterable[A] -- Elements may be traversed in any order, but will always be traversed in the same order (that is, it can't change from one iteration to another); they'll be traversed one by one.
GenIterable[A] -- Elements may be traversed in any order, but will always be traversed in the same order (that is, it can't change from one iteration to another); traversal may happen one by one or in paralllel; if a new collection is generated, the position of the elements in the new collection will correspond to the old collection, even if the traversal is in parallel.
Traversable[A] -- Same guarantees as Iterable[A], with the additional limitation that you can interrupt traversal, but you cannot determine when the next element will be choosen (which you can in Iterable[A] and descendants, by producing an Iterator).
GenTraversable[A] -- Same guarantees as GenIterable[A], with the additional limitation that you can interrupt traversal, but you cannot determine when the next element will be choosen (which you can in GenIterable[A] and descendants, by producing an Iterator).
TraversableOnce[A] -- Same guarantees as in Traversable[A], with the additional limitation that you might not be able to traverse the elements more than once.
GenTraversableOnce[A] -- Same guarantees as in GenTraversable[A], with the additional limitation that you might not be able to traverse the elements more than once.
Now, all guarantees apply, with the fewer limitations, which effectively means that everything said about Seq[A] holds true for mutable.Buffer[A].
Now, to the for loop:
for(elt <- bufferObject: scala.collection.mutable.Buffer)
doSomething(elt)
is the same thing as:
bufferObject.foreach(elt => dosomething(elt))
And
for(elt <- bufferObject: scala.collection.mutable.Buffer)
yield makeSomething(elt)
is the same thing as:
bufferObject.map(elt => makeSomething(elt))
In fact, all for variants will be translated into methods available on Buffer (or whatever other type you have there), so the guarantees provided by the collections all apply. Note, for instance, that a GenSeq used with map (for-yield) may process all elements in parallel, but will produce a collection where newCollection(i) == makeSomething(bufferObject(i)), that is, the indices are preserved.
Yes, the for-comprehension will desugar to some combination of map, flatMap, and foreach, and these all follow the Seq's defined order.
Unless you're using parallel collections (via par method), the order of operations (like for comprehension, map, foreach and other sequential methods) in mutable Buffer is guaranteed.
Should be ordered. Buffer defaults to ArrayBuffer I believe.
scala> import scala.collection.mutable.Buffer
import scala.collection.mutable.Buffer
scala> val x = Buffer(1, 2, 3, 4, 5)
x: scala.collection.mutable.Buffer[Int] = ArrayBuffer(1, 2, 3, 4, 5)
scala> for (y <- x) println(y)
1
2
3
4
5