Apache Spark and Remote Method Invocation - scala

I am trying to understand how Apache Spark works behind the scenes. After coding a little in Spark I am pretty quite sure that it implements the RDD as RMI Remote objects, doesn't it?
In this way, it can modify them inside transformation, such as maps, flatMaps, and so on. Object that are not part of an RDD are simply serialized and sent to a worker during execution.
In the example below, lines and tokenswill be treated as remote objects, while the string toFind will be simply serialized and copied to the workers.
val lines: RDD[String] = sc.textFile("large_file.txt")
val toFind = "Some cool string"
val tokens =
lines.flatMap(_ split " ")
.filter(_.contains(toFind))
Am I wrong? I googled a little but I've not found any reference to how Spark RDD are internally implemented.

You are correct. Spark serializes closures to perform remote method invocation.

Related

org.mockito.exceptions.misusing.WrongTypeOfReturnValue on spark test cases

I'm currently writing test cases for spark with mockito and I'm mocking a sparkContext which gets wholeTextFiles called on it. I have something like this
val rdd = sparkSession.sparkContext.makeRDD(Seq(("Java", 20000),
("Python", 100000), ("Scala", 3000)))
doReturn(rdd).when(mockContext).wholeTextFiles(testPath)
However, I keep getting an error saying wholeTextFiles is supposed to output an int
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: ParallelCollectionRDD cannot be returned by wholeTextFiles$default$2()
wholeTextFiles$default$2() should return int
I know this isn't the case, the spark docs say that wholeTextFiles returns an RDD object, any tips on how I can fix this? I can't have my doReturn be of type int, because then the rest of my function fails, since I turn the wholeTextFiles output into a dataframe.
Resolved this with some alternative approaches. It works just fine if I do the Mockito.when() thenReturn pattern, but I decided to instead change the entire test case to start a local sparksession and load some sample files into an rdd instead, because that's a more in depth test of my function imo

New to Scala and apache flink, why does my map function run correctly REPL but fail in Flink

I'm trying (unsuccessfully) to run a simple hello world type program in Apache Flink. The code takes a message from Apache Kafka adds a "." after each letter and prints the new string to stdout. The code correctly gets the message from Kafka, but the map function to add the "." fails. I've tried the function at the REPL prompt and the scala code works correctly there.
Scala code:
scala> input = "hello"
input: String = hello
scala> val output = input.flatMap(value => value + ".")
output: String = h.e.l.l.o
flink program:
flink code
cut off line reads
val messageStream = env.addSource(new FlinkKafkaConsumer09("CL", new SimpleStringSchema, properties))
I cant figure out where I'm going wrong, I've tried the apache documentation to no avail. any help you could give me would be well received.
First of all I would recommend studying some basics about functional programming and operations like map/flatMap/reduce etc.
All of those mentioned functions apply to collections. In your scala example as #pedrofuria pointed out you apply the flatMap function to String which is collection of chars
In the flink example messageStream can be abstracted as a collection of strings, so to perform the operation you described you ought to do sth like:
val stream = messageStream.map(str => str.mkString("."))
I used the mkString instead of flatMap from your example because the former rather than
h.e.l.l.o (as you wrote)
it produces
h.e.l.l.o.
But once more really start with basics of functional programming.
thanks for the help, i figured out the problem, sort of.
Although the flatMap function works at the scala prompt, it doesn't work in Flink proper, as Flink requires FlatMap be passed a new FlatMapFunction with an override. Still unsure why it works at the scala prompt in flink, but the code now compiles and runs as expected.
Because flatMap in scala prompt is not the same with flatMap in your flink program.
flatMap in your scala prompt is just a function in scala.
The flink flatMap can be applied while input likes below:
val input = benv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,")
val counts = input
.flatMap { _.toLowerCase.split("\\W+") }
.map { (_, 1) }.groupBy(0).sum(1)
See: scala-shell

Spark Closures with Array [duplicate]

This question already has an answer here:
LinkedHashMap variable is not accessable out side the foreach loop
(1 answer)
Closed 7 years ago.
I have an array that works when it is inside the closure (it has some values) but outside the loop, the array size is 0. I want to know what causes the behavior to be like that?
I need the hArr to be accessible outside for batch HBase put.
val hArr = new ArrayBuffer[Put]()
rdd.foreach(row => {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hArr += hRow
println("hArr: " + hArr.toArray.mkString(","))
})
println("hArr.size: " + hArr.size)
The problem is that any items in a rdd closure are copied and use local versions. foreach should only be used for saving to disk or something along those lines.
If you want this in an array, then you can map and then collect
rdd.map(row=> {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hRow
}).collect()
I found quite some new Spark users are confused about how the mapper and reducer functions get run and how they are related to things defined in driver's program. In general, all the mapper/reducer functions you defined and registered by map or foreach or reduceByKey or many other variants will not get executed in your driver's program. In your driver's program, you just register them for Spark to run them remotely and distributedly. When those functions reference some objects you instantiated in your driver's program, you literally created a "Closure", which will compile OK most of the time. But usually that's not what you intended and you will usually run into problem in runtime, by either seeing NotSerializable or ClassNotFound exceptions.
You can either do all outputing work remotely by foreach() variants or try to collecting all data back to your driver's program for output by calling collect(). But be careful with collect() as it'll collect all data from distributed nodes to your driver's program. You only do it when you are absolutely sure your final aggregated data is small.

Spark streams: enrich stream with reference data

I have spark streaming set up so that it reads from a socket, does some enrichment of the data before publishing it on a rabbit queue.
The enrichment looks up information from a Map that was instantiated by reading a regular text file (Source.fromFile...) before setting up the streaming context.
I have a feeling that this is not really the way it should be done. On the other hand, when using a StreamingContext, I can only read from streams, not from static files as I would be able to do with a SparkContext.
I could try to allow multiple contexts but I'm not sure if this is the right way either.
Any advice would be greatly appreciated.
Making the assumption that the map being used for enrichment is fairly small to be held in memory, a recommended way to use that data in a Spark job is through Broadcast variables. The content of such variable will be sent once to each executor, avoiding in that way overhead of serializing datasets captured in a closure.
Broadcast variables are wrappers instantiated in the driver and the data is 'unwrapped' using the broadcastVar.value method in a closure.
This would be an example of how to use broadcast variables with a DStream:
// could replace with Source.from File as well. This is just more practical
val data = sc.textFile("loopup.txt").map(toKeyValue).collectAsMap()
// declare the broadcast variable
val bcastData = sc.broadcast(data)
... initialize streams ...
socketDStream.map{ elem =>
// doing every step here explicitly for illustrative purposes. Usually, one would typically just chain these calls
// get the map within the broadcast wrapper
val lookupMap = bcastData.value
// use the map to lookup some data
val lookupValue = lookupMap.getOrElse(elem, "not found")
// create the desired result
(elem, lookupValue)
}
socketDStream.saveTo...
If your file is small and not on a distributed file system, Source.fromFile is fine (whatever gets the job done).
If you want to read files via the SparkContext, you can still access it via streamingContext.sparkContext and combine it with the DStream in transform or foreachRDD.

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.