Apache Spark in Scala not printing rdd values - scala

I am new to Spark and Scala as well, so this might be a very basic question.
I created a text file with 4 lines of some words. The rest of the code is as below:
val data = sc.textFile("file:///home//test.txt").map(x=> x.split(" "))
println(data.collect)
println(data.take(2))
println(data.collect.foreach(println))
All the above "println" commands are producing output as: [Ljava.lang.String;#1ebec410
Any idea how do I display the actual contents of the rdd, I have even tried "saveAstextfile", it also save the same line as java...
I am using Intellij IDE for spark scala and yes, I have gone through other posts related to this, but no help. Thanking you in advance

The final return type of RDD is RDD[Array[String]] Previously you were printing the Array[String] that prints something like this [Ljava.lang.String;#1ebec410) Because the toString() method of Array is not overridden so it is just printing the HASHCODE of object
You can try casting Array[String] to List[String] by using implicit method toList now you will be able to see the content inside the list because toString() method of list in scala in overridden and shows the content
That Means if you try
data.collect.foreach(arr => println(arr.toList))
this will show you the content or as #Raphael has suggested
data.collect().foreach(arr => println(arr.mkString(", ")))
this will also work because arr.mkString(", ")will convert the array into String and Each element Seperated by ,
Hope this clears you doubt
Thanks

data is of type RDD[Array[String]], what you print is the toString of the Array[String] ( [Ljava.lang.String;#1ebec410), try this:
data.collect().foreach(arr => println(arr.mkString(", ")))

Related

Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:
rdd: RDD[WikipediaArticle])
val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)
meinVal.collect.foreach(println) gives:
(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))
I have two questions:
Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.
I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.
I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.
Anyone? Please?
Irene
Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey.
To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.
val meinVal = rdd.flatMap( article=> langs.map(lang=> { if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)

How to print a String or String[Array] in Scala(spark)?

I'm trying to unit test the values returned in a String, but when I'm trying to print the console gives
MapPartitionsRDD[32]
My code is as follows:
UPDATED:
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
val dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray()
for (print1 <- src) {
var n1:String = src.toString()
var sourceArr: Array[String] = n1.split(",")
for (print2 <- dest) {
var n2: String = dest.toString()
for (i <- 0 until sourceArr.length) {
if (n1.split(",")(i).equals(n2.split(",")(i))) {
}
}
I also tried println(n1.mkstring())
I'm trying to compare both src and dest RDD's to find out the differences between both the rows
If you want to see each record in the RDD printed as a separate line, you can use:
src.foreach(println)
This will run the println function on each record, within the executor that holds it (which might be several different executors). If this runs in some test using Spark's "local" mode, there's only one "executor" and it's the same process as the driver, so that's not a problem.
Alternatively, if you do have more than one executor (non-local mode) and you want to make sure the RDD's elements are printed to the driver console, you can first collect the RDD's elements into a local collection and then print them:
src.collect().foreach(println)
NOTE that this assumes the RDD is small enough to be collected into a single machine's memory.
Calling toString on an RDD does not access the RDD's data (as it might be too large to fit as a String in the driver machine's memory), as you observed it just prints the type of the RDD and its ID.
You don't have a list or array. You'd need to collect() an RDD in order to get one, or you need to iterate it via foreach.
Calling println on any object already calls the toString method for it, by the way. And RDD doesn't have a mkString method
Calling toString on src just means you are getting a string representation which can be anything. For RDD this is not the content of the RDD (as this would require getting all the content of the RDD to the driver and printing it at once).
As other have mentioned in order to print the content of the RDD you need to first get all the data to the driver.
Let's consider the simple solution already proposed:
src.collect().foreach(println)
The first part - collect tells spark to get all the content of the RDD and bring it to the driver as a sequence of records. The foreach tells scala to go over each record in the sequence and pass it as argument to the println function which would print it. You could of course use mkstring instead of foreach to get a single string.

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();

Spark - Convert Tuples to Tab Seperated String

I want to create a function that takes an RDD of tuples and converts each tuple to a tab separated string. I want the function to be able to handle Tuples of any size.
If I already have this RDD created, I can get the desired output using:
rdd.map(line => (0 to (line.productArity-1)).map(line.productElement(_)).toList.mkString("\t"))
How can I convert this piece of code to work as a function that takes an RDD of tuples, or is there a good library that already does this?
Something like this should work:
def toTab[T <: Product](rdd:RDD[T]) = rdd.map(_.productIterator.mkString("\t"))

Scala getLines() with yield not as I expect

My understanding of the "Programming in Scala" book is that the following should return an Array[String] when instead it returns an Iterator[String]. What am I missing?
val data = for (line <- Source.fromFile("data.csv").getLines()) yield line
I'm using Scala 2.9.
Thanks in advance.
The chapter you want to read to understand what's happening is http://www.artima.com/pins1ed/for-expressions-revisited.html
for (x <- expr_1) yield expr_2
is translated to
expr_1.map(x => expr_2)
So if expr_1 is an Iterator[String] as it is in your case, then expr_1.map(line => line) is also an Iterator[String].
Nope, it returns an Iterator. See: http://www.scala-lang.org/api/current/index.html#scala.io.BufferedSource
But the following should work if an Array is your goal:
Source.fromFile("data.csv").getLines().toArray
If you want to convert an Iterator to an Array (as mentioned in your comment), then try the following after you've yielded your Iterator:
data.toArray
#dhg is correct and here is a bit more detail on why.
The code in your example calls calls Source.fromFile which returns a BufferedSource. Then you call getLines which returns an iterator. That iterator is then yielded and stored as data.
Calling toArray on the Iterator will get you an array of Strings like you want.