Batching within an Apache Spark RDD map - classification

I have a situation where an underlying function operates significantly more efficiently when given batches to work on. I have existing code like this:
// subjects: RDD[Subject]
val subjects = Subject.load(job, sparkContext, config)
val classifications = subjects.flatMap(subject => classify(subject)).reduceByKey(_ + _)
classifications.saveAsTextFile(config.output)
The classify method works on single elements but would be more efficient operating on groups of elements. I considered using coalesce to split the RDD into chunks and acting on each chunk as a group, however there are two problems with this:
I'm not sure how to return the mapped RDD.
classify doesn't know in advance how big the groups should be and it varies based on the contents of the input.
Sample code on how classify could be called in an ideal situation (the output is kludgey since it can't spill for very large inputs):
def classifyRdd (subjects: RDD[Subject]): RDD[(String, Long)] = {
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}
This way classifyInBatches can have code like this internally:
def classifyInBatches(subject: Subject) {
if (!internals.canAdd(subject)) {
partialResults.add(internals.processExisting)
}
internals.add(subject) // Assumption: at least one will fit.
}
What can I do in Apache Spark that will allow behavior somewhat like this?

Try using the mapPartitions method, which allows your map function to consume a partition as an iterator and produce an iterator of output.
You should be able to write something like this:
subjectsRDD.mapPartitions { subjects =>
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}

Related

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

How to use combineByKey on dataframe

I am trying to achieve secondary sorting in spark. To be precise, for all events of a user session, I want to sort them based on timestamp. Post secondary sorting, I need to iterate through each event of a session to implement a business logic. I am doing it as follows:
def createCombiner = (row: Row) => Array(row)
def mergeValue = (rows: Array[Row], row: Row) => {
rows :+ row
}
def mergeCombiner = (rows1: Array[Row], rows2: Array[Row]) => rows1 ++ rows2
def attribute(eventsList: List[Row]): List[Row] = {
for (row: Row <- eventsList) {
// some logic
}
}
var groupedAndSortedRows = rawData.rdd.map(row => {
(row.getAs[String]("session_id"), row)
}).combineByKey(createCombiner, mergeValue, mergeCombiner)
.mapValues(_.toList.sortBy(_.getAs[String]("client_ts")))
.mapValues(attribute)
But I fear this is not the most time efficient way to do this as converting to RDD would require de-serialization and serialization, which I believe is not required when working with dataframes/datasets.
I am not sure if there is an aggregator function that return the entire row
rawData.groupBy("session_id").someAggregateFunction()
I want the someAggregateFunction() to return list of Rows. I do not want to aggregate on some columns but want the list of entire Rows corresponding to a session_id. Is it possible to do this?
The answer is yes, but may not be what you expect. Depends on how complicated your business logic is, there are 2 alernatives other than the combineByKey
If you just need mean, min, max and other known function defined in [spark.sql.functions][1]
[1]: https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/functions.scala you could certainly with groupBy(...).agg(...) . I guess that's not your case. So If you'd like to implement your own UDAF that's no better than the combineByKey unless this business logic is quite common and could be re-used for other dataset
Or you need slightly complicated logic you could use window function
To specify a window spec with Window.partitionBy($"session_id").orderBy($"client_ts" desc) then you could easily implement topN, moving average, ntile etc.See https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html and you could also implement the custom window aggegration function yourself

How to filter RDD relying on hash map?

I'm new to using spark and scala but I have to solve the following problem:
I have one ORC file containing rows which I have to check against a certain condition comming from a hash map.
I build the hash map (filename,timestamp) with 120,000 entries this way (getTimestamp returns an Option[Long] type):
val tgzFilesRDD = sc.textFile("...")
val fileNameTimestampRDD = tgzFilesRDD.map(itr => {
(itr, getTimestamp(itr))
})
val fileNameTimestamp = fileNameTimestampRDD.collect.toMap
And retrieve the RDD with 6 million entries like this:
val sessionDataDF = sqlContext.read.orc("...")
case class SessionEvent(archiveName: String, eventTimestamp: Long)
val sessionEventsRDD = sessionDataDF.as[SessionEvent].rdd
And do the check:
val sessionEventsToReport = sessionEventsRDD.filter(se => {
val timestampFromFile = fileNameTimestamp.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
Is this the right and performant way to do it? Is there a caching recommended?
Will the Map fileNameTimestamp get shuffled to the clusters where the parititons where processed?
fileNameTimestamp will get serialized for each task, and with 120,000 entries, it may be quite expensive. You should broadcast large objects and reference the broadcast variables:
val fileNameTimestampBC = sc.broadcast(fileNameTimestampRDD.collect.toMap)
Now only one of these object will be shipped to each worker. There is also no need to drop down to the RDD API, as the Dataset API has a filter method:
val sessionEvents = sessionDataDF.as[SessionEvent]
val sessionEventsToReport = sessionEvents.filter(se => {
val timestampFromFile = fileNameTimestampBC.value.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
The fileNameTimestamp Map you collected exists on Spark Master Node. In order to be referenced efficiently like this in a query, the worker nodes need to have access to it. This is done by broadcasting.
In essence, you rediscovered Broadcast Hash Join: You are left joining sessionEventsRDD with tgzFilesRDD to gain access to the optional timestamp and then you filter accordingly.
When using RDDs, you need to explicitly code the joining strategy. Dataframes/Datasets API has a query optimiser that can make the choice for you. You can also explicitly ask the API to use the above broadcast join technique behind the scenes. You can find examples for both approaches here.
Let me know if this is clear enough :)

RDD Remove elements by key

I have 2 RDD's that are pulled in with the following code:
val fileA = sc.textFile("fileA.txt")
val fileB = sc.textFile("fileB.txt")
I then Map and Reduce it by key:
val countsB = fileB.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
val countsA = fileA.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
I now wan't to find and remove all keys in countB if the key exist in countA
I have tried something like:
countsB.keys.foreach(b => {
if(countsB.collect().exists(_ == b)){
countsB.collect().drop(countsB.collect().indexOf(b))
}
})
but it doesn't seem like it removes them by the key.
There are 3 issues with your suggested code:
You are collecting the RDDs, which means they are not RDDs anymore, they are copied into the driver application's memory as plain Scala collections, so you lose Spark's parallelism and risk OutOfMemory errors in case your dataset is large
When calling drop on an immutable Scala collection (or an RDD), you don't change the original collection, you get a new collection with those records dropped, so you can't expect original collection to change
You cannot access an RDD within a function passed to any of the RDDs higher-order methods (e.g. foreach in this case) - any function passed to these method is serialized and sent to workers, and RDDs are (intentionally) not serializable - it makes no sense to fetch them into driver memory, serialize them, and send back to workers - the data is already distributed on the workers!
To solve all these - when you want to use one RDD's data to transform/filter another one, you usually want to use some type of join. In this case you can do:
// left join, and keep only records for which there was NO match in countsA:
countsB.leftOuterJoin(countsA).collect { case (key, (valueB, None)) => (key, valueB) }
NOTE that this collect that I'm using here isn't the collect you used - this one takes a PartialFunction as an argument, and behaves like a combination of map and filter, and most importantly: it doesn't copy all data into driver memory.
EDIT: as The Archetypal Paul commented - you have a much shorter and nicer option - subtractByKey:
countsB.subtractByKey(countsA)

How do you parallelize RDD / DataFrame creation in Spark?

Say I have a spark job that looks like following:
def loadTable1() {
val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
table1.cache().registerTempTable("table1")
}
def loadTable2() {
val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
table2.cache().registerTempTable("table2")
}
def loadAllTables() {
loadTable1()
loadTable2()
}
loadAllTables()
How do I parallelize this Spark job so that both tables are created at the same time?
You don't need to parallelize it. The RDD/DF creation operations don't do anything. These data structures are lazy, so any actual calculation will only happen when you start using them. And when a Spark calculation does happen, it will be automatically parallelized (partition-by-partition). Spark will distribute the work across the executors. So you would not generally gain anything by introducing further parallelism.
Use Futures!
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def loadAllTables() {
Future { loadTable1() }
Future { loadTable2() }
}
You can do this with standard scala threading mechanism. Personally I'd like to do a list of pairs with path & table name and then parallel map over that. You could also look at futures or standard threads.