Merge objects in mutable list on Scala - scala

I have recently started looking at Scala code and I am trying to understand how to go about a problem.
I have a mutable list of objects, these objects have an id: String and values: List[Int]. The way I get the data, more than one object can have the same id. I am trying to merge the items in the list, so if for example, I have 3 objects with id 123 and whichever values, I end up with just one object with the id, and the values of the 3 combined.
I could do this the java way, iterating, and so on, but I was wondering if there is an easier Scala-specific way of going about this?

The first thing to do is avoid using mutable data and think about transforming one immutable object into another. So rather than mutating the contents of one collection, think about creating a new collection from the old one.
Once you have done that it is actually very straightforward because this is the sort of thing that is directly supported by the Scala library.
case class Data(id: String, values: List[Int])
val list: List[Data] = ???
val result: Map[String, List[Int]] =
list.groupMapReduce(_.id)(_.values)(_ ++ _)
The groupMapReduce call breaks down into three parts:
The first part groups the data by the id field and makes that the key. This gives a Map[String, List[Data]]
The second part extracts the values field and makes that the data, so the result is now Map[String, List[List[Int]]]
The third part combines all the values fields into a single list, giving the final result Map[String, List[Int]]

Related

Scala 2.10.6 + Spark 1.6.0: Most appropriate way to map from DataFrame to case classes

I'm looking for the most appropriate way to map the information contained in a DataFrame to some case classes I've defined, according to the following situation.
I have 2 Hive tables, and a third table which represents the many-to-many relationship between them. Lets call them "Item", "Group", and "GroupItems".
I'm considering executing a single query joining them all, to get the information of a single group, and all its items.
So, each row of the resulting DataFrame, would contain the fields of the Group, and the fields of an Item.
Then, I've created 4 different case classes to use this information in my application. Lets call them:
- ItemProps1: its properties match with some of the Item fields
- ItemProps2: its properties match with some of the Item fields
- Item: contains some properties which match with some of the Item fields, and has 1 object of type ItemProps1, and another of type ItemProps2
- Group: its properties match with the Group fields, and contains a list of items
What I want to do is to map the info contained in the resulting DataFrame into these case classes, but I don't know which would be the most appropriate way.
I know DataFrame has a method "as[U]" which is very useful to perform this kind of mapping, but I'm afraid in my case it wont be useful.
Then, I've found some options to perform the mapping manually, like the following ones:
df.map {
case Row(foo: Int, bar: String) => Record(foo, bar)
}
-
df.collect().foreach { row =>
val foo = row.getAs[Int]("foo")
val bar = row.getAs[String]("bar")
Record(foo, bar)
}
Is any of these approaches the most appropriate one, or should I do it in another way?
Thanks a lot!

parallelization level of Tupled RDD data

Suppose I have a RDD with the following type:
RDD[(Long, List(Integer))]
Can I assume that the entire list is located at the same worker? I want to know if certain operations are acceptable on the RDD level or should be calculated at driver. For instance:
val data: RDD[(Long, List(Integer))] = someFunction() //creates list for each timeslot
Please note that the List may be the result of aggregate or any other operation and not necessarily be created as one piece.
val diffFromMax = data.map(item => (item._1, findDiffFromMax(item._2)))
def findDiffFromMax(data: List[Integer]): List[Integer] = {
val maxItem = data.max
data.map(item => (maxItem - item))
}
The thing is that is the List is distributed calculating the maxItem may cause a lot of network traffic. This can be handles with an RDD of the following type:
RDD[(Long, Integer /*Max Item*/,List(Integer))]
Where the max item is calculated at driver.
So the question (actually 2 questions) are:
At what point of RDD data I can assume that the data is located at one worker? (answers with reference to doc or personal evaluations would be great) if any? what happens in the case of Tuple inside Tuple: ((Long, Integer), Double)?
What is the common practice for design of algorithms with Tuples? Should I always treat the data as if it may appear on different workers? should I always break it to the minimal granularity at the first Tuple field - for a case where there is data(Double) for user(String) in timeslot(Long) - should the data be (Long, (Strong, Double)) or ((Long, String), Double) or maybe (String, (Long, Double))? or maybe this is not optimal and matrices are better?
The short answer is yes, your list would be located in a single worker.
Your tuple is a single record in the RDD. A single record is ALWAYS on a single partition (which would be on a single worker).
When you do your findDiffFromMax you are running it on the target worker (so the function is serialized to all the workers to run).
The thing you should note is that when you generate a tuple of (k,v) in general this means a key value pair so you can do key based operations on the RDD. The order ( (Long, (Strong, Double)) vs. ((Long, String), Double) or any other way) doesn't really matter as it is all a single record. The only thing that would matter is which is the key in order to do key operations so the question would be the logic of your calculation

Is there a way to add extra metadata for Spark dataframes?

Is it possible to add extra meta data to DataFrames?
Reason
I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.
Current solution
I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.
Is there a better solution to store such extra information on DataFrames?
To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.
As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. E.g. (when used with the separate spark-csv library):
customSchema = StructType([
StructField("cat_id", IntegerType(), True,
{'description': "Unique id, primary key"}),
StructField("cat_title", StringType(), True,
{'description': "Name of the category, with underscores"}) ])
categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
.options(header='false')
.load(csvFilename, schema = customSchema) )
f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]
["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
"cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]
This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA, and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. See the SPARK-3569: Add metadata field to StructField design document.
I'd like to see this used more widely, e.g. for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc.
Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc.
For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas.
For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays.
And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? · Issue #2485 · pydata/pandas.
See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas
If you want to have less tedious work, I think you can add an implicit conversion between DataFrame and your custom wrapper (haven't tested it yet though).
implicit class WrappedDataFrame(val df: DataFrame) {
var metadata = scala.collection.mutable.Map[String, Long]()
def addToMetaData(key: String, value: Long) {
metadata += key -> value
}
...[other methods you consider useful, getters, setters, whatever]...
}
If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.:
df.addtoMetaData("size", 100)
This way also makes your metadata mutable, so you should not be forced to compute it only once and carry it around.
I would store a wrapper around your dataframe. For example:
case class MyDFWrapper(dataFrame: DataFrame, metadata: Map[String, Long])
val maxIndex = df1.agg("index" ->"MAX").head.getLong(0)
MyDFWrapper(df1, Map("maxIndex" -> maxIndex))
A lot of people saw the word "metadata" and went straight to "column metadata". This does not seem to be what you wanted, and was not what I wanted when I had a similar problem. Ultimately, the problem here is that a DataFrame is an immutable data structure that, whenever an operation is performed on it, the data passes on but the rest of the DataFrame does not. This means that you can't simply put a wrapper on it, because as soon as you perform an operation you've got a whole new DataFrame (potentially of a completely new type, especially with Scala/Spark's tendencies toward implicit conversions). Finally, if the DataFrame ever escapes its wrapper, there's no way to reconstruct the metadata from the DataFrame.
I had this problem in Spark Streaming, which focuses on RDDs (the underlying datastructure of the DataFrame as well) and came to one simple conclusion: the only place to store the metadata is in the name of the RDD. An RDD name is never used by the core Spark system except for reporting, so it's safe to repurpose it. Then, you can create your wrapper based on the RDD name, with an explicit conversion between any DataFrame and your wrapper, complete with metadata.
Unfortunately, this does still leave you with the problem of immutability and new RDDs being created with every operation. The RDD name (our metadata field) is lost with each new RDD. That means you need a way to re-add the name to your new RDD. This can be solved by providing a method that takes a function as an argument. It can extract the metadata before the function, call the function and get the new RDD/DataFrame, then name it with the metadata:
def withMetadata(fn: (df: DataFrame) => DataFrame): MetaDataFrame = {
val meta = df.rdd.name
val result = fn(wrappedFrame)
result.rdd.setName(meta)
MetaDataFrame(result)
}
Your wrapping class (MetaDataFrame) can provide convenience methods for parsing and setting metadata values, as well as implicit conversions back and forth between Spark DataFrame and MetaDataFrame. As long as you run all your mutations through the withMetadata method, your metadata will carry along though your entire transformation pipeline. Using this method for every call is a bit of a hassle, yes, but the simple reality is that there is not a first-class metadata concept in Spark.

Scala groupBy TreeMap / SortedMap?

currently I have some kind of lazy groupBy stuff on scala Which should create some kind of table structure from my ordered SQL Rows,
this works great However at the end I need something like
Map[String, Map[String List[Object]]]
however the second Map should be ordered so that I could get values by the index while using zipWithIndex.
Is there way to do so?
Currently this is one of my two statements:
p1.toList.groupBy(_._1.layout).mapValues(_.groupBy(_._2.name).mapValues(_.map(toRow)))

Is an empty Seq still a Seq?

If I am returning a Seq[T] from a function, when there maybe a chance that it is empty, is it still a Seq or will it error?
In other words, do I need to wrap it in an Option or is that overkill?
It's generally overkill although it may convey some information depending on the context. Suppose you have a huge database of people, where some data could be missing. You could write queries like:
def getChildren( p: Person ): Seq[Person]
But if it returns an empty sequence, you cannot guess if the data is missing or if the data is available that there are no children. In contrast with the definition:
def getChildren( p: Person ): Option[Seq[Person]]
You will obtain None when the data is missing and Some(s) where s is an empty sequence if there are no children.
Seq is like a Monoid, it has a zero form.