Adding contents in an RDD[(Array[String], Long)] into a new array into a new RDD: RDD[Array[(Array[String], Long)]] - scala

I have an RDD[Array[String]] which I zipWithIndex:
val dataWithIndex = data.zipWithIndex()
Now I have a RDD[(Array[String], Long)], I would like to add all the pairs in the RDD to an array and still have it in the RDD. Is there an efficient way to do so? My final datastructure should be RDD[Array[(Array[String], Long)]] where the RDD essentially only contains one element.
Right now I do the following, but it is very ineffective because of collect():
val dataWithIndex = data.zipWithIndex()
val dataNoRDD = dataWithIndex.collect()
val dataArr = ListBuffer[Array[(Array[String], Long)]]()
dataArr += dataNoRDD
val initData = sc.parallelize(dataArr)

The conclusion is that this seems to be extremely hard to do with standard functionality.
Instead, if the input comes from a Hadoop filesystem it is possible to do. This can be done by extending certain Hadoop classes.
First you need to implement WritableComparable<> and define a custom format that the RDD will contain. In order for this to work, you need to define a custom FileInputFormat and extend it in order to support your custom Writable. In order for FileInputFormat to know what to do with data being read, a custom RecordReader has to be written by extending it and here specifically the method nextKeyValue() has to be written which defines what each RDD element will contain. All of these three are written in Java, but with some simple tricks it is possible to do.

Related

Using key-value pair RDD to build a kdtree in Spark

I am trying to build kd-trees from points in a pair RDD called "RDDofPoints" with type RDD[BoundingBox[Double],(Double,Double)]. All the points are assigned to a particular bounding box and my goal is to build a kd-tree for each of the Bounding Boxes.
I am trying to use reduceByKey for this purpose. However, I am stuck at how to call the buildtree function in this case.
The function declaration of buildtree is:
def buildtree(points: RDD[(Double, Double)], depth: Int = 0): Option[KdNodeforRDD]
And, I am trying to call it as:
val treefromPairRDD = RDDofPoints.reduceByKey((k,v) => buildtree(v))
This does not work obviously. I am fairly new to Scala and Spark, please suggest what could be the appropriate way to go about in this situation. I am not sure about using reduceByKey, if some other pair RDD function can be applied here, which one would it be?
Thank you.

How to convert Dataset to a Scala Iterable?

Is there a way to convert a org.apache.spark.sql.Dataset to a scala.collection.Iterable? It seems like this should be simple enough.
You can do myDataset.collect or myDataset.collectAsList.
But then it will no longer be distributed. If you want to be able to spread your computations out on multiple machines you need to use one of the distributed datastructures such as RDD, Dataframe or Dataset.
You can also use toLocalIterator if you just need to iterate the contents on the driver as it has the advantage of only loading one partition at a time, instead of the entire dataset, into memory. Iterator is not an Iterable (although it is a Traverable) but depending on what you are doing it may be what you want.
You could try something like this:
def toLocalIterable[T](dataset: Dataset[T]): Iterable[T] = new Iterable[T] {
def iterator = scala.collection.JavaConverters.asScalaIterator(dataset.toLocalIterator)
}
The conversion via JavaConverters.asScalaIterator is necessary because the toLocalIterator method of Dataset returns a java.util.Iterator instead of a scala.collection.Iterator (which is what the toLocalIterator on RDD returns.) I suspect this is a bug.
In Scala 2.11 you can do the following:
import scala.collection.JavaConverters._
dataset.toLocalIterator.asScala.toIterable

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();

modifying RDD of object in spark (scala)

I have:
val rdd1: RDD[myClass]
it has been initialized, i checked while debugging all the members have got thier default values
If i do
rdd1.foreach(x=>x.modifier())
where modifier is a member function of myClass which modifies some of the member variables
After executing this if i check the values inside the RDD they have not been modified.
Can someone explain what's going on here?
And is it possible to make sure the values are modified inside the RDD?
EDIT:
class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long) {
def calcAvg(){
// calculate avg by summing over sessions and dividing by legnth
// Store this average in avgsession
}
}
The avgsession attribute is not updating if i do
myrdd.foreach(x=>x.calcAvg())
RDD are immutable, calling a mutating method on the objects it contains will not have any effect.
The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:
case class MyClass(id:String, avgsession: Long) {
def modifier(a: Int):MyClass =
this.copy(avgsession = this.avgsession + a)
}
Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:
rdd2 = rdd1.map (_.modifier(18) )
The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)
If you run that example you will get Set() as the result just like the original answer states.
However if you were to run the exact same thing with a cache call:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)
Now the result will print as Set(1). So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.
Objects are immutable. By using map, you can iterate over the rdd and return a new one.
val rdd2 = rdd1.map(x=>x.modifier())
I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.

Subtract an RDD from another RDD doesn't work correctly

I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract can do that. Actually, when I tested subtract, the final RDD remains the same and the values are not removed!
Is there any other function to do that? Or am I using subtract incorrectly?
Here is the code that I used:
val vertexRDD: org.apache.spark.rdd.RDD[(VertexId, Array[Int])]
val clusters = vertexRDD.takeSample(false, 3)
val clustersRDD: RDD[(VertexId, Array[Int])] = sc.parallelize(clusters)
val final = vertexRDD.subtract(clustersRDD)
final.collect().foreach(println(_))
Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended.
Try using a immutable type instead.
I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure.
If your rdd is composed of mutables object it wont work... problem is it wont show an error either so this kind of problems are hard to identify, i had a similar one yesterday and i used a workaround.
rdd.keyBy( someImmutableValue ) -> do this using the same key value to
both your rdds
val resultRDD = rdd.subtractByKey(otherRDD).values
Recently I tried the subtract operation of 2 RDDs (of array List) and it is working. The important note is - the RDD val after .subtract method should be the list from where you're subtracting, not the other way around.
Correct: val result = theElementYouWantToSubtract.subtract(fromList)
Incorrrect: val reuslt = fromList.subtract(theElementYouWantToSubtract) (will not give any compile/runtime error message)