Spark Map partiton not working in yarn-cluster mode - scala

I am running a spark scala program for performing text scanning in input file. I am trying to achieve parallelism by using rdd.mappartition. Inside the mappartition section i am performing few checks and calling the map function to achieve parallel execution for each partition. Inside the map function i am calling a custom method where i am performing the scanning and sending the results back.
Now, the code is working fine when i submit the code using --master local[*] but the same is not working when i submit using --master yarn-cluster. It is working without any error but the call is not getting inside the mappartition itself.I verified this by placing few println statements.
Please help me with your suggestions.
Here is the sample code:
def main(args: Array[String]) {
val inputRdd = sc.textFile(inputFile,2)
val resultRdd = inputRdd.mapPartitions{ iter =>
println("Inside scanning method..")
var scanEngine = ScanEngine.getInstance();
...
....
....
var mapresult = iter.map { y =>
line = y
val last = line.lastIndexOf("|");
message = line.substring(last + 1, line.length());
getResponse(message)
}
}
val finalRdd = sc.parallelize(resultRdd.map(x => x.trim()))
finalRdd.coalesce(1, true).saveAsTextFile(hdfsOutpath)
}
def getResponse(input: String): String = {
var result = "";
val rList = new ListBuffer[String]();
try {
//logic here
}
return result;
}

If your evidence of it working is seeing Inside scanning method.. printed out, it won't show up when run on the cluster because that code is executed by the workers, not the driver.
You're going to have to go over the code in forensic detail, with an open mind and try to find why the job has no output. Usually when a job works on local mode but not on a cluster it is because of some subtlety in where the code is executed, or where output is recorded.
There's too much clipped code to provide a more specific answer.

Spark achieves parallelism using the map function as well as mapPartitions. The number of partitions determines the amount of parallelism, but each partition will execute independently whether or not you use the mapPartitions function.
There are only a few reasons to use mapPartitions over map; e.g. there isa high initialization cost for a function, but then can call it multiple times such as doing some NLP task on text

Related

Writing unit test for a spark scala function that takes no arguments?

I am writing unit tests of scala function where I am passing mocked spark data frame to the function and then using assertSmallDataFrameEquality(actualDF, expectedDF) function to check whether my function is transforming correctly or not.
Recently I came across a function that is taking no argument and returning Column type. Now since it is not expecting any argument. How should I write test case for this function. My function is given below.
def arriveDateMinusRuleDays: Column = {
expr(s"date_sub(${Columns.ARRIVE_DATE},${Columns.RULE_DAYS})")
}
Test blueprint is written below
test("arrive date minus rule days") {
import spark.implicits._
val today = Date.valueOf(LocalDate.now)
val inputDF = Seq(
(Y, today, 0, 80852),
(S, today, 1, 18851))
.toDF(FLAG, ARRIVE_DT, RULE_DAYS,ITEM_NBR)
val actualOutput = DataAggJob.arriveDateMinusRuleDays() // How to pass my column values to this function
// val exepectedoutput
assertmethod(actualoutput, expectedoutput)
// print(actualOutput)
}
You don't need to test each individual function. The purpose of the unit test is to assert the contract between implementation and downstream consumer, not implementation details.
If your job returns the expected output given the input, then it is working correctly, regardless of what that particular function is doing. It should really just be made private to avoid confusion,

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

Ignite instance name thread local must be set or this method should be accessed under org.apache.ignite.thread.IgniteThread

I am trying to access ignite cache values from spark map operation
Ignite grid name thread local must be set or this method should be accessed under org.apache.ignite.thread.IgniteThread
I have exact same problem, and tried some method suggested by the person who asked the same question
val cache = ignite.getOrCreateCache[String,String]("newCache")
val cache_value = cache.get("key")
val myTransformedRdd = myRdd.map { x =>println(cache_value)}.take(2)
This is my sample code, I understood that, when we initiates ignite(Ignition.start()), it may only initiates in spark driver, but spark executes in executors. So in some executors the ignite may not be initiated.
So I tried this also,
val myTransformedRdd = myRdd.map { x =>
if(Ignition.state.toString=="STOPPED")
{
Ignition.start("/etc/ignite/examples/config/example-ignite1.xml")
}
println(cache_value)
}
From this I got the same error.
It seems, that ignite in your sample is taken from the outer scope somewhere, outside the mapper function. Make sure, that you don't try to send this object over the network.
In your example you use cache_value taken from the driver's context. Your mapper function should look something like
val myTransformedRdd = rdd.map { _ =>
val igniteCfg = Ignition.loadSpringBean("/etc/ignite/examples/config/example-ignite1.xml", "ignite.cfg")
val ignite = Ignition.getOrStart(igniteCfg)
val cache = ignite.getOrCreateCache[String,String]("newCache")
val cacheValue = cache.get("key")
println(cacheValue)
}
Note, that example-ignite1.xml file should have a defenition of a ignite.cfg bean of type IgniteConfiguration.

Track requests per Minute in Spark Streaming

I am currently trying to track requests per minute in a Spark Application to use them in another transformation. However the code below will never result in another value than the originally set value of 0 when using the variable in the transformation
var rpm: Long = 0
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5)).foreachRDD(rdd => {
rdd.foreach(x => {
rpm = x
})
})
stream.foreachRDD { rdd =>
rdd.foreach(x => {
//do something including parameter rpm
})
}
I assume it has to do something with parellization - what I also tries was to use an RDD or a Broadcast instead of the plain variable. However that resulted in the code not being executed.
What is the recommended way to achieve this in SparkStreaming?
EDIT:
The incoming objects are timestamped, if that helps with anything.
In Spark Streaming, there are two levels of execution:
The scheduling of operations, executed in the driver and,
The distributed computation on RDDs, executed in the cluster
There are two operations that provide access to both levels: transform and foreachRDD. In these operations, we have access to the driver's context and we have a reference to an RDD, that we can use to apply computations on it.
In the specific case of the question, to update a local variable, the operation must be executed in the driver's context:
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5))
requestsPerMinute.foreachRDD{ rdd =>
val computedRPM = rdd.collect()(0) // this gets the data locally
rpm = computedRPM
}
In the original case:
rdd.foreach(x => {
rpm = x
})
the closure: f(x): Long => Unit = rpm = x is serialized and executed on the cluster. The side-effects are applied in the remote context and lost after the operation finishes. At the driver level, the value of the variable never changes.
Also, note that is not a good idea to use side-effecting functions for remote execution.

Spark's RDD.map() will not execute unless the item inside RDD is visited

I'm not quite sure about how Scala and Spark works, maybe I write the code in the wrong way.
The function I want to achieve is, for a given Seq[String, Int], assign a random item in v._2.path to _._2.
To do that, I implement a method and call this method in the next line
def getVerticesWithFeatureSeq(graph: Graph[WikiVertex, WikiEdge.Value]): RDD[(VertexId, WikiVertex)] = {
graph.vertices.map(v => {
//For each token in the sequence, assign an article to them based on its path(root to this node)
println(v._1+" before "+v._2.featureSequence)
v._2.featureSequence = v._2.featureSequence.map(f => (f._1, v._2.path.apply(new scala.util.Random().nextInt(v._2.path.size))))
println(v._1+" after "+v._2.featureSequence)
(v._1, v._2)
})
}
val dt = getVerticesWithFeatureSeq(wikiGraph)
When I execute it, I suppose the println should print out something, but it didn't.
If I add another line of code
dt.foreach(println)
println inside map will print correctly.
Is there some latency of spark's code execution? Like if no one is accessing a variable, the computing will be deferred or even canceled?
Is graph.vertices an RDD? That would explain your issue, since Spark transformations are lazy until no action is executed, foreach in your case:
val dt = getVerticesWithFeatureSeq(wikiGraph) //no result is computed yet, map transformation is 'recorded'
dt.foreach(println) //foreach action requires a result, this triggers the computation
RDD's remember the transformations applied and they are only computed when an action requires a result to be returned to the driver program.
You can check http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations for further details and a list of available transformations and actions.