Apache Spark map and reduce with passing values - scala

I have a simple map and reduce job over an RDD loaded from Cassandra.
The code looks something like this
sc.cassandraTable("app","channels").select("id").toArray.foreach((o) => {
val orders = sc.cassandraTable("fam", "table")
.select("date", "f2", "f3", "f4")
.where("id = ?", o("id")) # This o("id") is the ID i want later append to the finished list
val month = orders
.map( oo => {
var total_revenue = List(oo.getIntOption("f2"), oo.getIntOption("f3"), oo.getIntOption("f4")).flatten.reduce(_ + _)
(getDateAs("hour", oo.getDate("date")), total_revenue)
.reduceByKey(_ + _)
So this code summs the revenue up and returns something like this
(2014-11-23 18:00:00, 12412)
(2014-11-23 19:00:00, 12511)
Now I want to save this back to a Cassandra Table revenue_hour but i need the ID somehow in that list, something like that.
(2014-11-23 18:00:00, 12412, "CH1")
(2014-11-23 19:00:00, 12511, "CH1")
How can I make this work with more then just a (key, value) list? How can i pass along more values, which should not be transformed, instead just passed through to the end so I can save it back to Cassandra?

Maybe you could use a class and work with it through the flow. I mean, define RevenueHour class
case class RevenueHour(date: java.util.Date,revenue: Long, id: String)
Then built an intermediate RevenueHour in the map phase and then another one in the reduce phase.
val map: RDD[(Date, RevenueHour)] = orders.map(row =>
getDateAs("hour", oo.getDate("date")),
List(row.getIntOption("f2"),row.getIntOption("f3"),row.getIntOption("f4")).flatten.reduce(_ + _),
).reduceByKey((o1: RevenueHour, o2: RevenueHour) => RevenueHour(getDateAs("hour", o1.date), o1.revenue + o2.revenue, o1.id))
I use o1 RevenueHour because both o1 and o2 will have same key and same id (because the where clause before).
Hope it helps.

The approach presented on the question is sequencing the processing of data by iterating over a array of ids and applying a Spark job on only a (potentially small) subset of the data.
Without knowing how is the relation between the 'channels' and 'table' data, I see two options to fully utilize the ability of Spark of processing data in parallel:
Option 1
If the data on the 'table' table (called "orders" from here on) contains all the set of ids that we require in the report, we could apply the reporting logic to the whole table:
Based on the question, we will use this C* schema:
CREATE TABLE example.orders (id text,
f2 decimal,
f3 decimal,
f4 decimal,
PRIMARY KEY(id, date)
It makes is a lot easier to access cassandra data by providing a case class that represents the schema of the table:
case class Order(id: String, date:Long, f2:Option[BigDecimal], f3:Option[BigDecimal], f4:Option[BigDecimal]) {
lazy val total = List(f2,f3,f4).flatten.sum
Then we can define an rdd based on the cassandra table. When we provide the case class as type, the spark-cassandra driver can directly perform a conversion for our convenience:
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val revenueByIDPerHour = ordersRDD.map{order => ((order.id, getDateAs("hour", order.date)), order.total)}.reduceByKey(_ + _)
And finally save back to Cassandra:
revenueByIDPerHour.map{ case ((id,date), revenue) => (id, date, revenue)}
.saveToCassandra("example","revenue", SomeColumns("id", "date", "total"))
Option 2
if the ids contained in the ("app","channels") table should be used to filter the set of ids (e.g. valid ids), then, we can join the ids from this table with the orders. The job will be similar to the previous on, with the addition of:
val idRDD = sc.cassandraTable("app","channels").select("id").map(_.getString)
val ordersRDD = sc.cassandraTable[Order]("example", "orders").select("id", "date", "f2", "f3", "f4")
val validOrders = idRDD.join(ordersRDD.map(order => (id,order))
These two ways illustrate how to work with Cassandra and Spark, making use of the distributed nature of Spark's operations. It should also be considerably faster then executing a query for each ID in the 'channels' table.


How to fill Scala Seq of Sets with unique values from Spark RDD?

I'm working with Spark and Scala. I have an RDD of Array[String] that I'm going to iterate through. The RDD contains values for attributes like (name, age, work, ...). I'm using a Sequence of mutable Sets of Strings (called attributes) to collect all unique values for each attribute.
Think of the RDD as something like this:
In the end I want something like this:
attributes = (("name1","name2","name3"),("21","22"),("JobA","JobB"))
I have the following code:
val someLength = 10
val attributes = Seq.fill[mutable.Set[String]](someLength)(mutable.Set())
val splitLines = rdd.map(line => line.split("\t"))
lines.foreach(line => {
for {(value, index) <- line.zipWithIndex} {
// #1
// #2
When I debug and stop at the line marked with #1, everything is fine, attributes is correctly filled with unique values.
But after the loop, at line #2, attributes is empty again. Looking into it shows, that attributes is a sequence of sets, that are all of size 0.
What am I doing wrong? Is there some kind of scoping going on, that I'm not aware of?
The answer lies in the fact that Spark is a distributed engine. I will give you a rough idea of the problem that you are facing. Here the elements in each RDD are bucketed into Partitions and each Partition potentially lives on a different node.
When you write rdd1.foreach(f) that f is wrapped inside a closure (Which gets copies of the corresponding objects). Now, this closure is serialized and then sent to each node where it is applied for each element in that Partition.
Here, your f will get a copy of attributes in its wrapped closure and hence when f is executed, it interacts with that copy of attributes and not with attributes that you want. This results in your attributes being left out without any changes.
I hope the problem is clear now.
val yourRdd = sc.parallelize(List(
val yourNeededRdd = yourRdd
.flatMap({ case (name, age, work) => List(("name", name), ("age", age), ("work", work)) })
.groupBy({ case (attrName, attrVal) => attrName })
.map({ case (attrName, group) => (attrName, group.toList.map(_._2).distinct })
// RDD(
// ("name", List("name1", "name2", "name3")),
// ("age", List("21", "22")),
// ("work", List("JobA", "JobB"))
// )
// Or
val distinctNamesRdd = yourRdd.map(_._1).distinct
// RDD("name1", "name2", "name3")
val distinctAgesRdd = yourRdd.map(_._2).distinct
// RDD("21", "22")
val distinctWorksRdd = yourRdd.map(_._3).distinct
// RDD("JobA", "JobB")

Pass columnNames dynamically to cassandraTable().select()

I'm reading query off of a file at run-time and executing it on the SPark+Cassandra environment.
I'm executing :
sparkContext.cassandraTable.("keyspaceName", "colFamilyName").select("col1", "col2", "col3").where("some condition = true")
Query in FIle :
select col1, col2, col3
from keyspaceName.colFamilyName
where somecondition = true
Here Col1,col2,col3 can vary depending on the query parsed from the file.
Question :
How do I pick columnName from query and pass them to select() and runtime.
I have tried many ways to do it :
1. dumbest thing done (which obviously threw an error) -
var str = "col1,col2,col3"
var selectStmt = str.split("\\,").map { x => "\"" + x.trim() + "\"" }.mkString(",")
var queryRDD = sc.cassandraTable().select(selectStmt)
Any ideas are welcome.
Side Notes :
1. I do not want to use cassandraCntext becasue it will be depricated/ removed in next realase (https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCCcontext.html)
2. I'm on
- a. Scala 2.11
- b. spark-cassandra-connector_2.11:1.6.0-M1
- c. Spark 1.6
Use Cassandra Connector
Your use case sounds like you actually want to use CassandraConnector Objects. These give you a direct access to a per ExecutorJVM session pool and are ideal for just executing random queries. This will end up being much more efficient than creating an RDD for each query.
This would look something like
rddOfStatements.mapPartitions( it =>
CassandraConnector.withSessionDo { session =>
it.map(statement =>
But you most likely would want to use executeAsync and handle the futures separately for better performance.
Programatically specifying columns in cassandraTable
The select method takes ColumnRef* which means you need to pass in some number of ColumnRefs. Normally there is an implicit conversion from String --> ColumnRef which is why you can pass in just a var-args of strings.
Here it's a little more complicated because we want to pass var args of another type so we end up with double implicits and Scala doesn't like that.
So instead we pass in ColumnName objects as varargs (:_*)
Keyspace: test
Table: dummy
- id : java.util.UUID (partition key column)
- txt : String
val columns = Seq("id", "txt")
columns: Seq[String] = List(id, txt)
//Convert the strings to ColumnNames (a subclass of ColumnRef) and treat as var args
Array(CassandraRow{id: 74f25101-75a0-48cd-87d6-64cb381c8693, txt: hello world})
//Only use the first column
Array(CassandraRow{id: 74f25101-75a0-48cd-87d6-64cb381c8693})
//Only use the last column
Array(CassandraRow{txt: hello world})

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

I am new to SPARK and figuring out a better way to achieve the following scenario.
There is a database table containing 3 fields - Category, Amount, Quantity.
First I try to pull all the distinct Categories from the database.
val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning.
def execute(category: String): Unit = {
val dfCategory = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME WHERE CATEGORY="+category)
Is it possible to do something like this ? Or is there any better alternative ?
// You could get all your data with a single query and convert it to an rdd
val data = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME).rdd
// then group the data by category
val groupedData = data.groupBy(row => row.getAs[String]("category"))
// then you get an RDD[(String, Iterable[org.apache.spark.sql.Row])]
// and you can iterate over it and execute your pipeline
groupedData.map { case (categoryName, items) =>
//executePipeline(categoryName, items)
Your code would fail on a TaskNotSerializable exception since you're trying to use the SQLContext (which isn't serializable) inside the execute method, which should be serialized and sent to workers to be executed on each record in the categories RDD.
Assuming you know the number of categories is limited, which means the list of categories isn't too large to fit in your driver memory, you should collect the categories to driver, and iterate over that local collection using foreach:
val categoriesRdd: RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
val categories: Seq[String] = categoriesRdd.collect()
Another improvement would be reusing the dataframe that you loaded instead of performing another query, using a filter for each category:
def executePipeline(singleCategoryDf: DataFrame) { /* ... */ }
categories.foreach(cat => {
val filtered = df.filter(col(CATEGORY) === cat)
NOTE: to make sure the re-use of df doesn't reload it for every execution, make sure you cache() it before collecting the categories.

Write Parquet files from Spark RDD to dynamic folders

Given the following snippet (Spark version: 1.5.2):
which saves RDD data to flattened Parquet files, I would like my storage to have a structure like:
The data itself contains a country column and a timestamp one, so I started with this method. However, since I only have a timestamp in my data, I can't partition the whole thing by year/yearmonth/yearmonthday as those are not columns per se...
And this solution seemed pretty nice, except I can't get to adapt it to Parquet files...
Any idea?
I figured it out. In order for the path to be dynamically linked to the RDD, one first has to create a tuple from the rdd:
rdd.map(model => (model.country, model))
Then, the records will all have to be parsed, to retrieve the distinct countries:
val countries = rdd.map {
case (country, model) => country
Now that the countries are known, the records can be written according to their distinct country:
countries.map {
country => {
val countryRDD = rdd.filter {
case (c, model) => c == country
countryRDD.toDF().write.parquet(pathToStorage + "/" + country)
Of course, the whole collection has to be parsed twice, but it is the only solution I found so far.
Regarding the timestamp, you will just have to do the same process with a 3-tuple (the third being something like 20160214); I went with the current timestamp finally.

Distributed Map in Scala Spark

Does Spark support distributed Map collection types ?
So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?
Since I found some new info I thought I'd turn my comments into an answer. #maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))
// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None
It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious
The quick answer: Partially.
You can transform a Map[A,B] into an RDD[(A,B)] by first forcing the map into a sequence of (k,v) pairs but by doing so you loose the constrain that keys of a map must be a set. ie. you loose the semantics of the Map structure.
From a practical perspective, you can still resolve an element into its corresponding value using kvRdd.lookup(element) but the result will be a sequence, given that you have no warranties that there's a single lookup value as explained before.
A spark-shell example to make things clear:
val englishNumbers = Map(1 -> "one", 2 ->"two" , 3 -> "three")
val englishNumbersRdd = sc.parallelize(englishNumbers.toSeq)
res: Seq[String] = WrappedArray(one)
val spanishNumbers = Map(1 -> "uno", 2 -> "dos", 3 -> "tres")
val spanishNumbersRdd = sc.parallelize(spanishNumbers.toList)
val bilingueNumbersRdd = englishNumbersRdd union spanishNumbersRdd
res: Seq[String] = WrappedArray(one, uno)