If I am returning a Seq[T] from a function, when there maybe a chance that it is empty, is it still a Seq or will it error?
In other words, do I need to wrap it in an Option or is that overkill?
It's generally overkill although it may convey some information depending on the context. Suppose you have a huge database of people, where some data could be missing. You could write queries like:
def getChildren( p: Person ): Seq[Person]
But if it returns an empty sequence, you cannot guess if the data is missing or if the data is available that there are no children. In contrast with the definition:
def getChildren( p: Person ): Option[Seq[Person]]
You will obtain None when the data is missing and Some(s) where s is an empty sequence if there are no children.
Seq is like a Monoid, it has a zero form.
Related
I have recently started looking at Scala code and I am trying to understand how to go about a problem.
I have a mutable list of objects, these objects have an id: String and values: List[Int]. The way I get the data, more than one object can have the same id. I am trying to merge the items in the list, so if for example, I have 3 objects with id 123 and whichever values, I end up with just one object with the id, and the values of the 3 combined.
I could do this the java way, iterating, and so on, but I was wondering if there is an easier Scala-specific way of going about this?
The first thing to do is avoid using mutable data and think about transforming one immutable object into another. So rather than mutating the contents of one collection, think about creating a new collection from the old one.
Once you have done that it is actually very straightforward because this is the sort of thing that is directly supported by the Scala library.
case class Data(id: String, values: List[Int])
val list: List[Data] = ???
val result: Map[String, List[Int]] =
list.groupMapReduce(_.id)(_.values)(_ ++ _)
The groupMapReduce call breaks down into three parts:
The first part groups the data by the id field and makes that the key. This gives a Map[String, List[Data]]
The second part extracts the values field and makes that the data, so the result is now Map[String, List[List[Int]]]
The third part combines all the values fields into a single list, giving the final result Map[String, List[Int]]
I'm working with Datasets and trying to group by and then use map.
I am managing to do it with RDD's but with dataset after group by I don't have the option to use map.
Is there a way I can do it?
You can apply groupByKey:
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
which returns KeyValueGroupedDataset and then mapGroups:
def mapGroups[U](f: (K, Iterator[V]) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
Suppose I have a RDD with the following type:
RDD[(Long, List(Integer))]
Can I assume that the entire list is located at the same worker? I want to know if certain operations are acceptable on the RDD level or should be calculated at driver. For instance:
val data: RDD[(Long, List(Integer))] = someFunction() //creates list for each timeslot
Please note that the List may be the result of aggregate or any other operation and not necessarily be created as one piece.
val diffFromMax = data.map(item => (item._1, findDiffFromMax(item._2)))
def findDiffFromMax(data: List[Integer]): List[Integer] = {
val maxItem = data.max
data.map(item => (maxItem - item))
}
The thing is that is the List is distributed calculating the maxItem may cause a lot of network traffic. This can be handles with an RDD of the following type:
RDD[(Long, Integer /*Max Item*/,List(Integer))]
Where the max item is calculated at driver.
So the question (actually 2 questions) are:
At what point of RDD data I can assume that the data is located at one worker? (answers with reference to doc or personal evaluations would be great) if any? what happens in the case of Tuple inside Tuple: ((Long, Integer), Double)?
What is the common practice for design of algorithms with Tuples? Should I always treat the data as if it may appear on different workers? should I always break it to the minimal granularity at the first Tuple field - for a case where there is data(Double) for user(String) in timeslot(Long) - should the data be (Long, (Strong, Double)) or ((Long, String), Double) or maybe (String, (Long, Double))? or maybe this is not optimal and matrices are better?
The short answer is yes, your list would be located in a single worker.
Your tuple is a single record in the RDD. A single record is ALWAYS on a single partition (which would be on a single worker).
When you do your findDiffFromMax you are running it on the target worker (so the function is serialized to all the workers to run).
The thing you should note is that when you generate a tuple of (k,v) in general this means a key value pair so you can do key based operations on the RDD. The order ( (Long, (Strong, Double)) vs. ((Long, String), Double) or any other way) doesn't really matter as it is all a single record. The only thing that would matter is which is the key in order to do key operations so the question would be the logic of your calculation
Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?
DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.
I have a query that takes Seq[Int] as it's argument (and performs filtering like WHERE x IN (...)), and I need to compile it since this query is failry complex. However, when I try the naive approach:
Compiled((xs: Set[Int]) => someQuery.filter(_.x inSet xs))
It fails with message that
Computation of type Set[Int] => Query[SomeTable, SomeValue, Seq] cannot be compiled (as type C)
Can Slick compile queries that takes a sets of integer as parameters?
UPDATE: I use PostgreSQL as database, so it can be possible to use arrays instead of IN clause, but how?
As for the PostgreSQL database, the solution is much simpler than I expected.
First of all, there is a need of special Slick driver for PostgreSQL that support arrays. It usually already included in projects that rely on PgSQL features, so there is no trouble at all. I use this driver.
The main idea is to replace plain SQL IN (...) clause which takes the same amount of bind parameters as the amount of items in list, and thus cannot be statically compiled by Slick with PgSQL-specific array operator x = ANY(arr), which takes only one parameter for the array. It's easy to do with code like this:
val compiledQuery = Compiled((x: Rep[List[Int]]) => query.filter(_.id === x.any))
This code will generate query like WHERE x = ANY(?) which will use only one parameter, so Slick will accept it for compilation.