First, I am running Spark 2.* on Scala 2.11 and want to use RDDs.
I am given a RDD of case classes :
case class Purchase(uid: Int, obj: Int, price: Double)
// val rdd : RDD[Purchase]
I want to compute a value that needs the average price per uid and the average price per obj.
val perObj = rdd.groupBy(r => r.obj).map(u => (u._1, u._2.map(r => r.price).sum/u._2.size)).collect.toMap
val perUser = rdd.groupBy(r => r.uid).map(u => (u._1, u._2.map(r => r.price).sum/u._2.size)).collect.toMap
val res : Double = rdd.map(r => r.price - f(r.uid, r.obj, perUser, perObj)).reduce(_+_) / rdd.count.toDouble
Where f is a function that uses perUser(r.uid) and perObj(r.obj).
However this code uses collect() which sends the variable back to the driver node. I think that it has a big computational cost, especially when rdd is very large and I don't know if the last result makes good use of the parallelization.
Is there a way to optimize this code to make a better use of Spark ?
Related
How can I reduce a list like below concisely
Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
to
List(Temp(a,2), Temp(b,1))
Only keep Temp objects with unique first param and max of second param.
My solution is with lot of groupBys and reduces which is giving a lengthy answer.
you have to
groupBy
sortBy values in ASC order
get the last one which is the largest
Example,
scala> final case class Temp (a: String, value: Int)
defined class Temp
scala> val data : Seq[Temp] = List(Temp("a",1), Temp("a",2), Temp("b",1))
data: Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
scala> data.groupBy(_.a).map { case (k, group) => group.sortBy(_.value).last }
res0: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
or instead of sortBy(fn).last you can maxBy(fn)
scala> data.groupBy(_.a).map { case (k, group) => group.maxBy(_.value) }
res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
You can generate a Map with groupBy, compute the max in mapValues and convert it back to the Temp classes as in the following example:
case class Temp(id: String, value: Int)
List(Temp("a", 1), Temp("a", 2), Temp("b", 1)).
groupBy(_.id).mapValues( _.map(_.value).max ).
map{ case (k, v) => Temp(k, v) }
// res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
Worth noting that the solution using maxBy in the other answer is more efficient as it minimizes necessary transformations.
You can do this using foldLeft:
data.foldLeft(Map[String, Int]().withDefaultValue(0))((map, tmp) => {
map.updated(tmp.id, max(map(tmp.id), tmp.value))
}).map{case (i,v) => Temp(i, v)}
This is essentially combining the logic of groupBy with the max operation in a single pass.
Note This may be less efficient because groupBy uses a mutable.Map internally which avoids constantly re-creating a new map. If you care about performance and are prepared to use mutable data, this is another option:
val tmpMap = mutable.Map[String, Int]().withDefaultValue(0)
data.foreach(tmp => tmpMap(tmp.id) = max(tmp.value, tmpMap(tmp.id)))
tmpMap.map{case (i,v) => Temp(i, v)}.toList
Use a ListMap if you need to retain the data order, or sort at the end if you need a particular ordering.
Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.
I'm trying to update an RDD with more information from another Map....I wrote this but is not working.
Where:
LocalCurrencies is a Sequence of Currency class
rdd: RDD[String, String]
...
val localCurrencies = Await.result(CurrencyDAO.currencies, 30 seconds)
//update ISO3
rdd.map(r => r.updated("currencyiso3", localCurrencies.find(c => c.CurrencyId ==
rdd.get("currencyid")).get.ISO3))
//Update exponent
rdd.map(r => r.updated("exponent", localCurrencies.find(c => c.CurrencyId ==
rdd.get("currencyid")).get.Exponent))
Any suggestion ?
Thanks
map doesn't modify an RDD, it creates a new one (the same applies to every Spark transformation). If you don't actually do anything with this new RDD, Spark won't even bother creating it. So you want to write
val rdd1 = rdd.map(...).map(...) // better to combine two `map`s into one
and work with rdd1 from then one (you can still use rdd as well, if needed). This isn't necessarily the only error, but you'll still need to fix it.
I have a simple map and reduce job over an RDD loaded from Cassandra.
The code looks something like this
sc.cassandraTable("app","channels").select("id").toArray.foreach((o) => {
val orders = sc.cassandraTable("fam", "table")
.select("date", "f2", "f3", "f4")
.where("id = ?", o("id")) # This o("id") is the ID i want later append to the finished list
val month = orders
.map( oo => {
var total_revenue = List(oo.getIntOption("f2"), oo.getIntOption("f3"), oo.getIntOption("f4")).flatten.reduce(_ + _)
(getDateAs("hour", oo.getDate("date")), total_revenue)
})
.reduceByKey(_ + _)
})
So this code summs the revenue up and returns something like this
(2014-11-23 18:00:00, 12412)
(2014-11-23 19:00:00, 12511)
Now I want to save this back to a Cassandra Table revenue_hour but i need the ID somehow in that list, something like that.
(2014-11-23 18:00:00, 12412, "CH1")
(2014-11-23 19:00:00, 12511, "CH1")
How can I make this work with more then just a (key, value) list? How can i pass along more values, which should not be transformed, instead just passed through to the end so I can save it back to Cassandra?
Maybe you could use a class and work with it through the flow. I mean, define RevenueHour class
case class RevenueHour(date: java.util.Date,revenue: Long, id: String)
Then built an intermediate RevenueHour in the map phase and then another one in the reduce phase.
val map: RDD[(Date, RevenueHour)] = orders.map(row =>
(
getDateAs("hour", oo.getDate("date")),
RevenueHour(
row.getDate("date"),
List(row.getIntOption("f2"),row.getIntOption("f3"),row.getIntOption("f4")).flatten.reduce(_ + _),
row.getString("id")
)
)
).reduceByKey((o1: RevenueHour, o2: RevenueHour) => RevenueHour(getDateAs("hour", o1.date), o1.revenue + o2.revenue, o1.id))
I use o1 RevenueHour because both o1 and o2 will have same key and same id (because the where clause before).
Hope it helps.
The approach presented on the question is sequencing the processing of data by iterating over a array of ids and applying a Spark job on only a (potentially small) subset of the data.
Without knowing how is the relation between the 'channels' and 'table' data, I see two options to fully utilize the ability of Spark of processing data in parallel:
Option 1
If the data on the 'table' table (called "orders" from here on) contains all the set of ids that we require in the report, we could apply the reporting logic to the whole table:
Based on the question, we will use this C* schema:
CREATE TABLE example.orders (id text,
date TIMESTAMP,
f2 decimal,
f3 decimal,
f4 decimal,
PRIMARY KEY(id, date)
);
It makes is a lot easier to access cassandra data by providing a case class that represents the schema of the table:
case class Order(id: String, date:Long, f2:Option[BigDecimal], f3:Option[BigDecimal], f4:Option[BigDecimal]) {
lazy val total = List(f2,f3,f4).flatten.sum
}
Then we can define an rdd based on the cassandra table. When we provide the case class as type, the spark-cassandra driver can directly perform a conversion for our convenience:
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val revenueByIDPerHour = ordersRDD.map{order => ((order.id, getDateAs("hour", order.date)), order.total)}.reduceByKey(_ + _)
And finally save back to Cassandra:
revenueByIDPerHour.map{ case ((id,date), revenue) => (id, date, revenue)}
.saveToCassandra("example","revenue", SomeColumns("id", "date", "total"))
Option 2
if the ids contained in the ("app","channels") table should be used to filter the set of ids (e.g. valid ids), then, we can join the ids from this table with the orders. The job will be similar to the previous on, with the addition of:
val idRDD = sc.cassandraTable("app","channels").select("id").map(_.getString)
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val validOrders = idRDD.join(ordersRDD.map(order => (id,order))
These two ways illustrate how to work with Cassandra and Spark, making use of the distributed nature of Spark's operations. It should also be considerably faster then executing a query for each ID in the 'channels' table.
Does Spark support distributed Map collection types ?
So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?
Since I found some new info I thought I'd turn my comments into an answer. #maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))
// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None
It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious
The quick answer: Partially.
You can transform a Map[A,B] into an RDD[(A,B)] by first forcing the map into a sequence of (k,v) pairs but by doing so you loose the constrain that keys of a map must be a set. ie. you loose the semantics of the Map structure.
From a practical perspective, you can still resolve an element into its corresponding value using kvRdd.lookup(element) but the result will be a sequence, given that you have no warranties that there's a single lookup value as explained before.
A spark-shell example to make things clear:
val englishNumbers = Map(1 -> "one", 2 ->"two" , 3 -> "three")
val englishNumbersRdd = sc.parallelize(englishNumbers.toSeq)
englishNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one)
val spanishNumbers = Map(1 -> "uno", 2 -> "dos", 3 -> "tres")
val spanishNumbersRdd = sc.parallelize(spanishNumbers.toList)
val bilingueNumbersRdd = englishNumbersRdd union spanishNumbersRdd
bilingueNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one, uno)