New to Spark, mapping with graphx graphs - NullPointerException - scala

My goal is to count triangles in multiple subgraphs from a common full graph. The subgraph is defined by a constant set of nodes + a node from an RDD[Long]. I'm new to spark/graphx, so this may be an improper use of map. The code I'm sharing will reproduce my error.
To start, I have a subgraph of a full graph declared as shown below
import org.apache.spark.rdd._
import org.apache.spark.graphx._
val nodes: RDD[(VertexId, String)] = sc.parallelize(Array((3L, "3"), (7L, "7"), (5L, "5"), (2L, "2"),(4L,"4")))
val vertices: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "a"), Edge(3L, 5L, "b"), Edge(2L, 5L, "c"), Edge(5L, 7L, "d"), Edge(2L, 7L, "e"),Edge(4L,5L,"f")))
val graph: Graph[String,String] = Graph(nodes, vertices, "z")
val baseNodes: Array[Long] = Array(2L,5L,7L)
val subgraph = graph.subgraph(vpred = (vid,attr)=> baseNodes contains vid)
Then I declare an RDD[Long] of other nodes from the graph.
val testNodes: RDD[Long] = sc.parallelize(Array(3L,4L))
I want to add each testNode to the subgraph and count the triangles present at testNode.
val triangles: RDD[(Long,Int)] = testNodes.map{ newNode =>
val newNodes: Array[Long] = baseNodes :+ newNode
val newSubgraph = graph.subgraph(vpred = (vid,attr)=> newNodes contains vid)
(newNode,findTriangles(7L,newSubgraph))
}
triangles.foreach(x=>x.toString)
My findTriangles works fine if I call it outside of the map function.
def findTriangles(id:Long,subgraph:Graph[String,String]): Int = {
val triCounts = subgraph.triangleCount().vertices
val count:Int = triCounts.filter{case(item,count)=> {item.toInt == id}}.map{case(item,count)=>count}.first
count
}
val triangles = findTriangles(7L,subgraph) //1
But when I run my map function to calculate triangles, I get a NullPointerException. I think the problem is in using my graph val inside the mapping function. Is that the issue? Is there a way to workaround this?

I think that the issue should be the baseNodes variable. Variables that are declared locally, such as baseNodes in your example, are only visible in the Spark driver, not in the executors that actually execute transformations and actions. To avoid the NullPointerException, you need to parallelize any variable that you'll need in the transformations (like map) that are executed on the executors. As an alternative, if the variable you have is read-only, you can broadcast that variable to executors using the broadcast construct in Spark. In your case, it seems that baseNodes doesn't get modified within the map operation, so it's a good candidate to be broadcast instead of parallelized.

Related

Graphx Scala Vertices alwayses return me 0

I'm tring to implement this Graphx example:
import org.apache.spark._
import org.apache.spark.graphx._
val conf = new SparkConf().setAppName("GraphX Example")
val sc = new SparkContext(conf)
// Create an RDD of vertices
val verticesRDD = sc.parallelize(Seq((-1L, "nowhere"), (1L, "yahou"), (2L, "sanae"), (3L, "hanane"), (4L, "said"), (5L, "halima")))
// Create an RDD of edges
val edgesRDD = sc.parallelize(Seq(Edge(1L, 3L, "commenter"), Edge(1L, 3L, "suivre"), Edge(2L, 3L, "commenter"), Edge(2L, 5L, "connecter"), Edge(4L, 2L, "connecter")))
// Create the graph with the default vertex
val graph = Graph(verticesRDD, edgesRDD, "nowhere")
graph.vertices.collect.foreach(println)
graph.edges.collect.foreach(println)
val numVertices = graph.numVertices
val numEdges = graph.numEdges
println(s"Number of vertices: $numVertices")
println(s"Number of edges: $numEdges")
and it returns me always 0 on $numVertices
it doesn't seem that something is wrong
PS: In my example i expect the result to be 6
The issue finally was with:
val conf = new SparkConf(). setAppName("GraphX Example") val sc = new SparkContext(conf)
so when i use it two times in a spark-shell it shutdowns automaticlly
the solution is that i restart my machine and rexecute the script without these two lines and it worked, thank you all

Spark 2.3: Reading dataframe inside rdd.map()

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.
Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

Scala sortbykey and collect function

I am a beginner in Spark in Scala. So I am writing a program where I am reading a CSV file, then I am counting the total spending done by a particular ID number. So after counting the spending, when I am sorting the RDD using sortByKey(), it's not sorting the RDD properly, but after applying collect() it's printing in a proper manner.
Before collect()
(0,5524.9497)
(51,4975.2197)
(1,4958.5996)
(52,5245.0605)
(2,5994.591)
(53,4945.3)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997)```
**After Collect**
```(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997) ```
**Code**
``` def main(args: Array[String])= {
Logger.getLogger("org").setLevel(Level.ERROR) //Set for displaying errors in the program if any
val sc = new SparkContext("local[*]", "CustomerSpending")
val lines = sc.textFile("../customer-orders.csv")
val field = lines.map(x => (x.split(",")(0).toInt, x.split(",")(2).toFloat))
val collectThemAll = field.reduceByKey((x,y) => x+y)
val sorted = collectThemAll.sortByKey().collect()
sorted.foreach(println)
}
}
Spark applies transformations lazily i.e. only when you call an action like collect or take etc. So your call to sortByKey() is only applied after you call the collect.
I created an App based on your sample data. I printed the RDD dependency using toDebugString so you can get insight into what is happening behind the scenes.
App
import org.apache.spark.sql.SparkSession
object PlaygroundApp extends App {
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val lines = sc.parallelize(Seq(
(0, 5524.9497),
(51, 4975.2197),
(1, 4958.5996),
(52, 5245.0605),
(2, 5994.591),
(53, 4945.3),
(9, 5322.6494),
(10, 4819.6997))
)
val collectThemAll = lines.reduceByKey((x, y) => x + y)
println("---Before sort")
collectThemAll.foreach(println)
println(collectThemAll.toDebugString)
println()
println("---After sort")
val sorted = collectThemAll.sortByKey()
sorted.collect().foreach(println)
println(sorted.toDebugString)
}
Output
---Before sort
(2,5994.591)
(53,4945.3)
(0,5524.9497)
(52,5245.0605)
(10,4819.6997)
(9,5322.6494)
(1,4958.5996)
(51,4975.2197)
(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []
---After sort
(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(9,5322.6494)
(10,4819.6997)
(51,4975.2197)
(52,5245.0605)
(53,4945.3)
(8) ShuffledRDD[4] at sortByKey at PlaygroundApp.scala:37 []
+-(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []

Spark Structured Streaming MemoryStream + Row + Encoders issue

I am trying to run some tests on my local machine with spark structured streaming.
In batch mode here is the Row that i am dealing with:
val recordSchema = StructType(List(StructField("Record", MapType(StringType, StringType), false)))
val rows = List(
Row(
Map("ID" -> "1",
"STRUCTUREID" -> "MFCD00869853",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "803.482",
"FORMULA" -> "C44H69NO12",
"NAME" -> "Tacrolimus",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
)),
Row(
Map("ID" -> "2",
"STRUCTUREID" -> "MFCD00869854",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "603.482",
"FORMULA" -> "",
"NAME" -> "Tacrolimus2",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
))
)
val df = spark.createDataFrame(spark.sparkContext.parallelize(rows), recordSchema)
Working with that in Batch more works as a charm, no issue.
Now I'm try to move in streaming mode using MemoryStream for testing. I added the following:
implicit val ctx = spark.sqlContext
val intsInput = MemoryStream[Row]
But the compiler complain with the as follows:
No implicits found for parameter evidence$1: Encoder[Row]
Hence, my question: What should I do here to get that working
Also i saw that if I add the following import the error goes away:
import spark.implicits._
Actually, I now get the following warning instead of an error
Ambiguous implicits for parameter evidence$1: Encoder[Row]
I do not understand the encoder mechanism well and would appreciate if someone could explain to me how not to use those implicits. The reason being that I red the following in a book when it comes to the creation of DataFrame from Rows.
Recommended appraoch:
val myManualSchema = new StructType(Array(
new StructField("some", StringType, true),
new StructField("col", StringType, true),
new StructField("names", LongType, false)))
val myRows = Seq(Row("Hello", null, 1L))
val myRDD = spark.sparkContext.parallelize(myRows)
val myDf = spark.createDataFrame(myRDD, myManualSchema)
myDf.show()
And then the author goes on with this:
In Scala, we can also take advantage of Spark’s implicits in the
console (and if you import them in your JAR code) by running toDF on a
Seq type. This does not play well with null types, so it’s not
necessarily recommended for production use cases.
val myDF = Seq(("Hello", 2, 1L)).toDF("col1", "col2", "col3")
If someone could take the time to explain what is happening in my scenario when i use the implicit, and if it is rather safe to do so, or else is there a way to do it more explicitly without importing the implicit.
Finally, if someone could point me to a good doc around Encoder and Spark Type mapping that would be great.
EDIT1
I finally got it to work with
implicit val ctx = spark.sqlContext
import spark.implicits._
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
Although my problem here is that i am not confident about what I am doing. It seems to me that it is like in some situation I need to create a DataSet to be able to convert it in an DF[ROW] with toDF conversion. I understood that working with DS is typeSafe but slower than with DF. So why this intermediary with DataSet? This is not the first time that i see that in Spark Structured Streaming. Again if someone could help me with those, that would be great.
I encourage you to use Scala's case classes for data modeling.
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
Now you can have a List of Product in memory:
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
The structured streaming API makes it easy to reason about stream processing by using the widely known Dataset[T] abstraction. Roughly speaking, you just have to worry about three things:
Source: a source can generate an input data stream which we can represent as a Dataset[Input]. Every new data item Input that arrives is going to be appended into this unbounded dataset. You can manipulate the data as you wish (e.g. Dataset[Input] => Dataset[Output]).
StreamingQueries and Sink: a query generates a result table that's updated from the Source every trigger interval. Changes are written into external storage called a Sink.
Output modes: there are different modes on which you can write data into the Sink: complete mode, append mode, and update mode.
Let's assume that you want to know the products that contain a molecular weight bigger than 200 units.
As you said, using the batch API is fairly simple and straight-forward:
// Create an static dataset using the in-memory data
val staticData: Dataset[Product] = spark.createDataset(inMemoryRecords)
// Processing...
val result: Dataset[Product] = staticData.filter(_.weight > 200)
// Print results!
result.show()
When using the Streaming API you just need to define a source and a sink as an extra step. In this example, we can use the MemoryStream and the console sink to print out the results.
// Create an streaming dataset using the in-memory data (memory source)
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val streamingData: Dataset[Product] = productSource.toDS()
// Processing...
val result: Dataset[Product] = streamingData.filter(_.weight > 200)
// Print results by using the console sink.
val query: StreamingQuery = result.writeStream.format("console").start()
// Stop streaming
query.awaitTermination(timeoutMs=5000)
query.stop()
Note that the staticData and the streamingData have the exact type signature (i.e., Dataset[Product]). This allows us to apply the same processing steps regardless of using the Batch or Streaming API. You can also think of implementing a generic method def processing[In, Out](inputData: Dataset[In]): Dataset[Out] = ??? to avoid repeating yourself in both approaches.
Complete code example:
object ExMemoryStream extends App {
// Boilerplate code...
val spark: SparkSession = SparkSession.builder
.appName("ExMemoryStreaming")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext
// Define your data models
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
// Create some in-memory instances
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
// Defining processing step
def processing(inputData: Dataset[Product]): Dataset[Product] =
inputData.filter(_.weight > 200)
// STATIC DATASET
val datasetStatic: Dataset[Product] = spark.createDataset(inMemoryRecords)
println("This is the static dataset:")
processing(datasetStatic).show()
// STREAMING DATASET
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val datasetStreaming: Dataset[Product] = productSource.toDS()
println("This is the streaming dataset:")
val query: StreamingQuery = processing(datasetStreaming).writeStream.format("console").start()
query.awaitTermination(timeoutMs=5000)
// Stop query and close Spark
query.stop()
spark.close()
}

Spark: Sort records in groups?

I have a set of records which I need to:
1) Group by 'date', 'city' and 'kind'
2) Sort every group by 'prize
In my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
val recs = Array (
Record("n1", "d1", "k1", "c1", 10),
Record("n1", "d1", "k1", "c1", 9),
Record("n1", "d1", "k1", "c1", 8),
Record("n2", "d2", "k2", "c2", 1),
Record("n2", "d2", "k2", "c2", 2),
Record("n2", "d2", "k2", "c2", 3)
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
val rsGrp = rs.groupBy(r => (r.day, r.kind, r.city)).map(_._2)
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
x.sortByKey()
}
}
When I try to sort group I get an error:
value sortByKey is not a member of org.apache.spark.rdd.RDD[List[(Int,
Sort.Record)]]
What is wrong? How to sort?
You need define a Key and then mapValues to sort them.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
// Define your data
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.setMaster("local")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
// Generate pair RDD neccesary to call groupByKey and group it
val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
// Once grouped you need to sort values of each Key
val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
// Print result
values.collect.foreach(println)
}
}
groupByKey is expensive, it has 2 implications:
Majority of the data get shuffled in the remaining N-1 partitions in average.
All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.
Depending of your use case you have different better options:
If you don't care about the ordering, use reduceByKey or aggregateByKey.
If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.
Replace map with flatMap
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
this will give you a
org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
and then you can call sortBy(_._1) on the RDD above.
As an alternative to #gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.