How to update the "bytes written" count in custom Spark data source? - scala

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?

I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

Related

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

How to perform one operation on each executor once in spark

I have a weka model stored in S3 which is of size around 400MB.
Now, I have some set of record on which I want to run the model and perform prediction.
For performing prediction, What I have tried is,
Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD.
----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.
Download and load the model on driver as a static object and send it to executor in each map operation.
-----> Working (Not efficient, as in each map operation, i am passing 400MB object)
Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)
Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?
You have two options:
1. Create a singleton object with a lazy val representing the data:
object WekaModel {
lazy val data = {
// initialize data here. This will only happen once per JVM process
}
}
Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.
elementsRDD.map { element =>
// use WekaModel.data here
}
Advantages
is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.
Disadvantages
Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.
2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.
This allows you to initialize whatever you need for the entire partition.
elementsRDD.mapPartition { elements =>
val model = new WekaModel()
elements.map { element =>
// use model and element. there is a single instance of model per partition.
}
}
Advantages:
Provides more flexibility in the initialization and deinitialization of objects.
Disadvantages
Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.
Here's what worked for me even better than the lazy initializer. I created an object level pointer initialized to null, and let each executor initialize it. In the initialization block you can have run-once code. Note that each processing batch will reset local variables but not the Object-level ones.
object Thing1 {
var bigObject : BigObject = null
def main(args: Array[String]) : Unit = {
val sc = <spark/scala magic here>
sc.textFile(infile).map(line => {
if (bigObject == null) {
// this takes a minute but runs just once
bigObject = new BigObject(parameters)
}
bigObject.transform(line)
})
}
}
This approach creates exactly one big object per executor, rather than the one big object per partition of other approaches.
If you put the var bigObject : BigObject = null within the main function namespace, it behaves differently. In that case, it runs the bigObject constructor at the beginning of each partition (ie. batch). If you have a memory leak, then this will eventually kill the executor. Garbage collection would also need to do more work.
Here is what we usually do
define a singleton client that do those kind of stuff to ensure only one client is present in each executors
have a getorcreate method to create or fetch the client information, usulaly let's you have a common serving platform you want to serve for multiple different models, then we can use like concurrentmap to ensure threadsafe and computeifabsent
the getorcreate method will be called inside RDD level like transform or foreachpartition, so make sure init happen in executor level
You can achieve this by broadcasting a case object with a lazy val as follows:
case object localSlowTwo {lazy val value: Int = {Thread.sleep(1000); 2}}
val broadcastSlowTwo = sc.broadcast(localSlowTwo)
(1 to 1000).toDS.repartition(100).map(_ * broadcastSlowTwo.value.value).collect
The event timeline for this on three executors with three threads each looks as follows:
Running the last line again from the same spark-shell session does not initialize any more:
This works for me and it's threadsafe if you use singleton and synchronized like shown below
object singletonObj {
var data: dataObj =null
def getDataObj(): dataObj = this.synchronized {
if (this.data==null){
this.data = new dataObj()
}
this.data
}
}
object app {
def main(args: Array[String]): Unit = {
lazy val mydata: dataObj = singletonObj.getDataObj()
df.map(x=>{ functionA(mydata) })
}
}

What is the best practice for reading/parsing Spark generated files which saved using SaveAsTextFile?

As you know, if you use saveAsTextFile on an RDD[String, Int], the output looks like this:
(T0000036162,1747)
(T0000066859,1704)
(T0000043861,1650)
(T0000075501,1641)
(T0000071951,1638)
(T0000075623,1638)
(T0000070102,1635)
(T0000043868,1627)
(T0000094043,1626)
You may want to use this file in Spark again and what should be best practice for reading and parsing it? Should it be something like that or is there any elegant way for it?
val lines = sc.textFile("result/hebe")
case class Foo(id: String, count: Long)
val parsed = lines
.map(l => l.stripPrefix("(").stripSuffix(")").split(","))
.map(l => new Foo(id=l(0),count = l(1).toLong))
It depends what you're looking for.
If you want something pretty I'd consider possibly adding an alternative constructor to Foo which takes a single string so you could have something like
lines.map(new Foo)
And Foo would look like
case class Foo(id: String, count: Long) {
def apply(l: String): Foo = {
val split = l.stripPrefix("(").stripSuffix(")").split(",")
new Foo(l(0), l(1))
}
}
If you have no requirement to output the data like that then I'd consider saving it as a sequence file.
If performance isn't an issue then its fine. I'd just say the most important thing is to just isolate the text parsing so that later you could unit test it and come back to it later and easily edit it.
you should either save it as a Dataframe which will use the case class as a schema (that allows you to easily parse it back into Spark) or you should map out the individual components of your RDD (so you remove the brackets before saving) since it only makes the file larger:
yourRDD.toDF("id","count").saveAsParquetFile(path)
when you load in the DF, you can pass it through a schema definition to get it back into a RDD if you want
RDDInput = input.map(x=>(x.getAs[Long]("id"),x.getAs[Int]("count")))
If you prefer to store as a text file, you could consider mapping the elements without the brackets:
yourRDD.map(x => s"${x._1}, ${x._2}")
The best way will be, you write dataframes instead of RDD directly as file.
Code that writing files -
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd.toDF()
df.write.parquet("dir”)
Code that reading files -
val rdd = sqlContext.read.parquet(“dir”).rdd.map(row => (row.getString(0),row.getLong(1)))
Before making saveAsTextFile use map(x=>x.mkString(",").
rdd.map(x=>x.mkString(",").saveAsTextFile(path). Output will not have bracket.
Output of this will be:-
T0000036162,1747
T0000066859,1704

When exactly a Spark task can be serialized?

I read some related questions about this topic, but still cannot understand the following. I have this simple Spark application which reads some JSON records from a file:
object Main {
// implicit val formats = DefaultFormats // OK: here it works
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("Spark Test App")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/alex/data/person.json")
implicit val formats = DefaultFormats // Exception: Task not serializable
val persons = input.flatMap { line ⇒
// implicit val formats = DefaultFormats // OK: here it also works
try {
val json = parse(line)
Some(json.extract[Person])
} catch {
case e: Exception ⇒ None
}
}
}
}
I suppose the implicit formats is not serializable since it includes some ThreadLocal for the date format. But, why it works when placed as a member of the object Main or inside the closure of flatMap, and not as a common val inside the main function?
Thanks in advance.
If the formats is inside the flatMap, it's only created as part of executing the mapping function. So the mapper can be serialized and sent to the cluster, since it doesn't contain a formats yet. The flipside is that this will create formats anew every time the mapper runs (i.e. once for every row) - you might prefer to use mapPartitions rather than flatMap so that you can have the value created once for each partition.
If formats is outside the flatMap then it's created once on the master machine, and you're attempting to serialize it and send it to the cluster.
I don't understand why formats as a field of Main would work. Maybe objects are magically pseudo-serializable because they're singletons (i.e. their fields aren't actually serialized, rather the fact that this is a reference to the single static Main instance is serialized)? That's just a guess though.
The best way to answer your question I think is in three short answers:
1) Why it works when placed as a member of the object Main?, the question here is that code works because it's inside an Object, not necessary the Main Object. And now: Why? because Spark serializes your whole object and send it to each of the executors, moreover an Object in Scala is generated like a JAVA Static class and the initial values of static fields in a Java class are stored in the jar and workers can use it directly. This is not the same if you use a class instead an Object.
2) The second question is: why it works if it's inside a flatmap?.
When you run transformations on a RDD (filter, flatMap ... etc), your transformation code is: serialized on the driver node, send to worker, once there it will be deserialized and executed. As you can see exactly the same as in 1) the code will be serialized "automatycally".
And finally the 3) question: Why this is not working as a common val inside the main function? this is because the val is not serialized "automatically", but you can test it like this: val yourVal = new yourVal with Serializable

How do I map a java.util.List[Array[String]] to a Scala vector?

Background
I have a java.util.List[Array[String]] called rawdata coming directly from opencsv's CSVReader
val reader = new CSVReader( new FileReader( "foobar.csv" ))
val rawdata = reader.readAll();
Currently, I'm looping through rawdata and grabbing rawdata.get(i)(4) and rawdata.get(i)(5) for fields 4 and 5 in record i where i goes from 0 to 99,999.
Problem
Instead, I would like to map rawdata into a Vector[Record] where the constructor for Record takes fields 4 and 5 from above. There are 100,000 records in rawdata.
This is where I hit a bit of cognitive dissonance because Vector is immutable, but java.util.List[Array[String]] requires that I loop through it (there is no map for me to call, AFAIK)...
Question
How do I map the java.util.List[Array[String]] to Vector[Record]?
Scala provides a set of conversions from Java collections, which you can use like this:
import scala.collection.JavaConverters._
val records = rawdata.asScala.toVector.map(toRecord)
Where toRecord is some method like the following:
def toMethod(fields: Array[String]) = Record(fields(4), fields(5))
You could also perform the mapping operation with a function literal:
val records = rawdata.asScala.toVector.map { fields =>
Record(fields(4), fields(5))
}
Both of these versions will convert the java.util.List to a scala.collection.mutable.Buffer, then to a Vector, and then perform the mapping operation. You could save one intermediate collection like this:
val records: Vector[Record] = rawdata.asScala.map(toRecord)(collection.breakOut)
Or you could convert to an iterator on the Java side:
val records = rawdata.iterator.asScala.map(toRecord).toVector
The simplest version's probably best, though, unless you're sure this is a bottleneck in your program.