subgraphing in a foreach loop in spark and graphx problems - scala

Hoping somebody can help.
I'm trying to write a program which needs to carry out a function on each edge ID connected to each node in a network on graphx.
To do this I want to iterate over each node and identify all edges connected to it, I then want to iterate over each edge with a function. My problem seems to arise when doing any kind of subgraphing or filtering within a foreach loop.
So for example the code below should output the id of each edge connected to a node
graph.vertices.foreach {
network =>
val KeyVert = network._1
val EGraph = graph.subgraph(e => e.dstId == KeyVert)
println(KeyVert)
EGraph.edges.foreach(println)
}
However it will only work if you add the collect function to collect the graph data from the rdd e.g.
graph.vertices.collect.foreach {
network =>
val KeyVert = network._1
val EGraph = graph.subgraph(e => e.dstId == KeyVert)
println(KeyVert)
EGraph.edges.foreach(println)
}
The network is too large to be collecting edge data so any help would be much appreciated.

em...the problem is you did not understand the driver and the worker...when you call collect function, all data are collected to the driver, and then foreach function looks work well. In fact, graph.vertices.foreach did not report any error, right? because it works well really, just println the info at the worker's log. you know what I said? hope it helps.

graph.vertices.map {
network =>
val KeyVert = network._1
val EGraph = graph.subgraph(e => e.dstId == KeyVert)
println(KeyVert)
EGraph.edges.map(println)
}
That may solve your problem.

Related

Filtering a collection of IO's: List[IO[Page]] scala

I am refactoring a scala http4s application to remove some pesky side effects causing my app to block. I'm replacing .unsafeRunSync with cats.effect.IO. The problem is as follows:
I have 2 lists: alreadyAccessible: IO[List[Page]] and pages: List[Page]
I need to filter out the pages that are not contained in alreadyAccessible.
Then map over the resulting list to "grant Access" in the database to these pages. (e.g. call another method that hits the database and returns an IO[Page].
val addable: List[Page] = pages.filter(p => !alreadyAccessible.contains(p))
val added: List[Page] = addable.map((p: Page) => {
pageModel.grantAccess(roleLst.head.id, p.id) match {
case Right(p) => p
}
})
This is close to what I want; However, it does not work because filter requires a function that returns a Boolean but alreadyAccessible is of type IO[List[Page]] which precludes you from removing anything from the IO monad. I understand you can't remove data from the IO so maybe transform it:
val added: List[IO[Page]] = for(page <- pages) {
val granted = alreadyAccessible.flatMap((aa: List[Page]) => {
if (!aa.contains(page))
pageModel.grantAccess(roleLst.head.id, page.id) match { case Right(p) => p }
else null
})
} yield granted
this unfortunately does not work with the following error:
Error:(62, 7) ';' expected but 'yield' found.
} yield granted
I think because I am somehow mistreating the for comprehension syntax, I just don't understand why I cannot do what I'm doing.
I know there must be a straight forward solution to such a problem, so any input or advice is greatly appreciates. Thank you for your time in reading this!
granted is going to be an IO[List[Page]]. There's no particular point in having IO inside anything else unless you truly are going to treat the actions like values and reorder them/filter them etc.
val granted: IO[List[Page]] = for {
How do you compute it? Well, the first step is to execute alreadyAccessible to get the actual list. In fact, alreadyAccessible is misnamed. It is not the list of accessible pages; it is an action that gets the list of accessible pages. I would recommend you rename it getAlreadyAccessible.
alreadyAccessible <- getAlreadyAccessible
Then you filter pages with it
val required = pages.filterNot(alreadyAccessible.contains)
Now, I cannot decipher what you're doing to these pages. I'm just going to assume you have some kind of function grantAccess: Page => IO[Page]. If you map this function over required, you will get a List[IO[Page]], which is not desirable. Instead, we should traverse with grantAccess, which will produce a IO[List[Page]] that executes each IO[Page] and then assembles all the results into a List[Page].
granted <- required.traverse(grantAccess)
And we're done
} yield granted

Creating Seq after waiting for all results from map/foreach in Scala

I am trying to loop over inputs and process them to produce scores.
Just for the first input, I want to do some processing that takes a while.
The function ends up returning just the values from the 'else' part. The 'if' part is done executing after the function returns the value.
I am new to Scala and understand the behavior but not sure how to fix it.
I've tried inputs.zipWithIndex.map instead of foreach but the result is the same.
def getscores(
inputs: inputs
): Future[Seq[scoreInfo]] = {
var scores: Seq[scoreInfo] = Seq()
inputs.zipWithIndex.foreach {
case (f, i) => {
if (i == 0) {
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
scores = scores.:+(score)
})
})
} else {
scores = scores.:+(
scoreInfo(
id = "",
score = 5
)
)
}
}
}
Future {
scores
}
}
For what you need, I would drop the mutable variable and replace foreach with map to obtain an immutable list of Futures and recover to handle exceptions, followed by a sequence like below:
def getScores(inputs: Inputs): Future[List[ScoreInfo]] = Future.sequence(
inputs.zipWithIndex.map{ case (input, idx) =>
if (idx == 0)
getGeoScore(input).map(_.getOrElse(defaultScore)).recover{ case e => errorHandling(e) }
else
Future.successful(ScoreInfo("", 5))
})
To capture/print the result, one way is to use onComplete:
getScores(inputs).onComplete(println)
The part your missing is understanding a tricky element of concurrency, and that is that the order of execution when using multiple futures is not guaranteed.
If your block here is long running, it will take a while before appending the score to scores
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
// stick a println("here") in here to see what happens, for demonstration purposes only
scores = scores.:+(score)
})
})
Since that executes concurrently, your getscores function will also simultaneously continue its work iterating over the rest of inputs in your zipWithindex. This iteration, especially since it's trivial work, likely finishes well before the long-running getgeoscore(f) completes the execution of the Future it scheduled, and the code will exit the function, moving on to whatever code is next after you called getscores
val futureScores: Future[Seq[scoreInfo]] = getScores(inputs)
futureScores.onComplete{
case Success(scoreInfoSeq) => println(s"Here's the scores: ${scoreInfoSeq.mkString(",")}"
}
//a this point the call to getgeoscore(f) could still be running and finish later, but you will never know
doSomeOtherWork()
Now to clean this up, since you can run a zipWithIndex on your inputs parameter, I assume you mean it's something like a inputs:Seq[Input]. If all you want to do is operate on the first input, then use the head function to only retrieve the first option, so getgeoscores(inputs.head) , you don't need the rest of the code you have there.
Also, as a note, if using Scala, get out of the habit of using mutable vars, especially if you're working with concurrency. Scala is built around supporting immutability, so if you find yourself wanting to use a var , try using a val and look up how to work with the Scala's collection library to make it work.
In general, that is when you have several concurrent futures, I would say Leo's answer describes the right way to do it. However, you want only the first element transformed by a long running operation. So you can use the future return by the respective function and append the other elements when the long running call returns by mapping the future result:
def getscores(inputs: Inputs): Future[Seq[ScoreInfo]] =
getgeoscore(inputs.head)
.map { optInfo =>
optInfo ++ inputs.tail.map(_ => scoreInfo(id = "", score = 5))
}
So you neither need zipWithIndex nor do you need an additional future or join the results of several futures with sequence. Mapping the future just gives you a new future with the result transformed by the function passed to .map().

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Iterator[Something] to Iterator[Seq[Something]]

I need to process a "big" file (something that does not fit in memory).
I want to batch-process the data. Let's say for the example that I want to insert them into a database. But since it is too big to fit in memory, it is too slow too to process elements one-by-one.
So I'l like to go from an Iterator[Something] to an Iterator[Iterable[Something]] to batch elements.
Starting with this:
CSVReader.open(new File("big_file"))
.iteratorWithHeaders
.map(Something.parse)
.foreach(Jdbi.insertSomething)
I could do something dirty in the foreach statement with mutable sequences and flushes every x elements but I'm sure there is a smarter way to do this...
// Yuk... :-(
val buffer = ArrayBuffer[Something]()
CSVReader.open(new File("big_file"))
.iteratorWithHeaders
.map(Something.parse)
.foreach {
something =>
buffer.append(something)
if (buffer.size == 1000) {
Jdbi.insertSomethings(buffer.toList)
buffer.clear()
}
}
Jdbi.insertSomethings(buffer.toList)
If your batches can have a fixed size (as in your example), the grouped method on Scala's Iterator does exactly what you want:
val iterator = Iterator.continually(1)
iterator.grouped(10000).foreach(xs => println(xs.size))
This will run in a constant amount of memory (not counting whatever text in stored by your terminal in memory, of course).
I'm not sure what your iteratorWithHeaders returns, but if it's a Java iterator, you can convert it to a Scala one like this:
import scala.collection.JavaConverters.
val myScalaIterator: Iterator[Int] = myJavaIterator.asScala
This will remain appropriately lazy.
If I undestood correctly your problem, you can just use Iterator.grouped. So adapting a little bit your example:
val si: Iterator[Something] = CSVReader.open(new File("big_file"))
.iteratorWithHeaders
.map(Something.parse)
val gsi: GroupedIterator[Something] = si.grouped(1000)
gsi.foreach { slst: List[Something] =>
Jdbi.insertSomethings(slst)
}

Preferred way of processing this data with parallel arrays

Imagine a sequence of java.io.File objects. The sequence is not in any particular order, it gets populated after a directory traversal. The names of the files can be like this:
/some/file.bin
/some/other_file_x1.bin
/some/other_file_x2.bin
/some/other_file_x3.bin
/some/other_file_x4.bin
/some/other_file_x5.bin
...
/some/x_file_part1.bin
/some/x_file_part2.bin
/some/x_file_part3.bin
/some/x_file_part4.bin
/some/x_file_part5.bin
...
/some/x_file_part10.bin
Basically, I can have 3 types of files. First type is the simple ones, which only have a .bin extension. The second type of file is the one formed from _x1.bin till _x5.bin. And the third type of file can be formed of 10 smaller parts, from _part1 till _part10.
I know the naming may be strange, but this is what I have to work with :)
I want to group the files together ( all the pieces of a file should be processed together ), and I was thinking of using parallel arrays to do this. The thing I'm not sure about is how can I perform the reduce/acumulation part, since all the threads will be working on the same array.
val allBinFiles = allBins.toArray // array of java.io.File
I was thinking of handling something like that:
val mapAcumulator = java.util.Collections.synchronizedMap[String,ListBuffer[File]](new java.util.HashMap[String,ListBuffer[File]]())
allBinFiles.par.foreach { file =>
file match {
// for something like /some/x_file_x4.bin nameTillPart will be /some/x_file
case ComposedOf5Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
case ComposedOf10Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
// simple file, without any pieces
case _ => {
mapAcumulator.getOrElseUpdate(file.toString,new ListBuffer[File]()) += file
}
}
}
I was thinking of doing it like I've shown in the above code. Having extractors for the files, and using part of the path as key in the map. Like for example, /some/x_file can hold as values /some/x_file_x1.bin to /some/x_file_x5.bin. I also think there could be a better way of handling this. I would be interested in your opinions.
The alternative is to use groupBy:
val mp = allBinFiles.par.groupBy {
case ComposedOf5Name(x) => x
case ComposedOf10Name(x) => x
case f => f.toString
}
This will return a parallel map of parallel arrays of files (ParMap[String, ParArray[File]]). If you want a sequential map of sequential sequences of files from this point:
val sqmp = mp.map(_.seq).seq
To ensure that the parallelism kicks in, make sure you have enough elements in you parallel array (10k+).