Spark - Convert Tuples to Tab Seperated String - scala

I want to create a function that takes an RDD of tuples and converts each tuple to a tab separated string. I want the function to be able to handle Tuples of any size.
If I already have this RDD created, I can get the desired output using:
rdd.map(line => (0 to (line.productArity-1)).map(line.productElement(_)).toList.mkString("\t"))
How can I convert this piece of code to work as a function that takes an RDD of tuples, or is there a good library that already does this?

Something like this should work:
def toTab[T <: Product](rdd:RDD[T]) = rdd.map(_.productIterator.mkString("\t"))

Related

How to print a String or String[Array] in Scala(spark)?

I'm trying to unit test the values returned in a String, but when I'm trying to print the console gives
MapPartitionsRDD[32]
My code is as follows:
UPDATED:
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
val dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray()
for (print1 <- src) {
var n1:String = src.toString()
var sourceArr: Array[String] = n1.split(",")
for (print2 <- dest) {
var n2: String = dest.toString()
for (i <- 0 until sourceArr.length) {
if (n1.split(",")(i).equals(n2.split(",")(i))) {
}
}
I also tried println(n1.mkstring())
I'm trying to compare both src and dest RDD's to find out the differences between both the rows
If you want to see each record in the RDD printed as a separate line, you can use:
src.foreach(println)
This will run the println function on each record, within the executor that holds it (which might be several different executors). If this runs in some test using Spark's "local" mode, there's only one "executor" and it's the same process as the driver, so that's not a problem.
Alternatively, if you do have more than one executor (non-local mode) and you want to make sure the RDD's elements are printed to the driver console, you can first collect the RDD's elements into a local collection and then print them:
src.collect().foreach(println)
NOTE that this assumes the RDD is small enough to be collected into a single machine's memory.
Calling toString on an RDD does not access the RDD's data (as it might be too large to fit as a String in the driver machine's memory), as you observed it just prints the type of the RDD and its ID.
You don't have a list or array. You'd need to collect() an RDD in order to get one, or you need to iterate it via foreach.
Calling println on any object already calls the toString method for it, by the way. And RDD doesn't have a mkString method
Calling toString on src just means you are getting a string representation which can be anything. For RDD this is not the content of the RDD (as this would require getting all the content of the RDD to the driver and printing it at once).
As other have mentioned in order to print the content of the RDD you need to first get all the data to the driver.
Let's consider the simple solution already proposed:
src.collect().foreach(println)
The first part - collect tells spark to get all the content of the RDD and bring it to the driver as a sequence of records. The foreach tells scala to go over each record in the sequence and pass it as argument to the println function which would print it. You could of course use mkstring instead of foreach to get a single string.

Apache Spark in Scala not printing rdd values

I am new to Spark and Scala as well, so this might be a very basic question.
I created a text file with 4 lines of some words. The rest of the code is as below:
val data = sc.textFile("file:///home//test.txt").map(x=> x.split(" "))
println(data.collect)
println(data.take(2))
println(data.collect.foreach(println))
All the above "println" commands are producing output as: [Ljava.lang.String;#1ebec410
Any idea how do I display the actual contents of the rdd, I have even tried "saveAstextfile", it also save the same line as java...
I am using Intellij IDE for spark scala and yes, I have gone through other posts related to this, but no help. Thanking you in advance
The final return type of RDD is RDD[Array[String]] Previously you were printing the Array[String] that prints something like this [Ljava.lang.String;#1ebec410) Because the toString() method of Array is not overridden so it is just printing the HASHCODE of object
You can try casting Array[String] to List[String] by using implicit method toList now you will be able to see the content inside the list because toString() method of list in scala in overridden and shows the content
That Means if you try
data.collect.foreach(arr => println(arr.toList))
this will show you the content or as #Raphael has suggested
data.collect().foreach(arr => println(arr.mkString(", ")))
this will also work because arr.mkString(", ")will convert the array into String and Each element Seperated by ,
Hope this clears you doubt
Thanks
data is of type RDD[Array[String]], what you print is the toString of the Array[String] ( [Ljava.lang.String;#1ebec410), try this:
data.collect().foreach(arr => println(arr.mkString(", ")))

How to flatmap nested lists in spark

I have a RDD in spark which looks like this -
[Foo1, Bar[bar1,bar2]]
The Bar object has a getList method which may return the lists [bar11,bar12,bar13] and [bar21, bar22] respectively. I want the output to look like this -
[Foo1, [bar11, bar12, bar13, bar21, bar22]]
The approach that I am able to think of is something like this -
my_rdd.map(x => (x._1,x._2.getList))
.flatmap{
case(x,y) => y.map(x, _)
}
The first map operation is returning me Foo1 and all the lists. However I am not able to flatten them beyond that.
In your code the x._2.getList returns a list of lists. Use flatten method as follows to have the expected result :
my_rdd.map(x => (x._1,x._2.getList.flatten))
You can do this with one line:
my_rdd.mapValues(_.flatMap(_.getList))
There is another answer which uses map instead of mapValues. While this produces the same RDD elements, I think it's important to get in the practice of using the "minimal" function necessary with Spark RDDs, because you can actually pay a pretty huge performance cost for using map instead of mapValues without realizing it -- The map function on RDD strips the partitioner, if it exists, and mapValues does not.
If you have an RDD[(K, V)] and call rdd.groupByKey(), you'll end up with an RDD[(K, Array[V])] that is partitioned by K. If you want to join with another RDD by K, you've already done most of the work.
If you add a map in between the groupByKey() and join, Spark will re-shuffle that RDD. This is very painful! mapValues is safe.

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();

Subtract an RDD from another RDD doesn't work correctly

I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract can do that. Actually, when I tested subtract, the final RDD remains the same and the values are not removed!
Is there any other function to do that? Or am I using subtract incorrectly?
Here is the code that I used:
val vertexRDD: org.apache.spark.rdd.RDD[(VertexId, Array[Int])]
val clusters = vertexRDD.takeSample(false, 3)
val clustersRDD: RDD[(VertexId, Array[Int])] = sc.parallelize(clusters)
val final = vertexRDD.subtract(clustersRDD)
final.collect().foreach(println(_))
Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended.
Try using a immutable type instead.
I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure.
If your rdd is composed of mutables object it wont work... problem is it wont show an error either so this kind of problems are hard to identify, i had a similar one yesterday and i used a workaround.
rdd.keyBy( someImmutableValue ) -> do this using the same key value to
both your rdds
val resultRDD = rdd.subtractByKey(otherRDD).values
Recently I tried the subtract operation of 2 RDDs (of array List) and it is working. The important note is - the RDD val after .subtract method should be the list from where you're subtracting, not the other way around.
Correct: val result = theElementYouWantToSubtract.subtract(fromList)
Incorrrect: val reuslt = fromList.subtract(theElementYouWantToSubtract) (will not give any compile/runtime error message)