Spark 2.0 Scala - RDD.toDF() - scala

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?

It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder

Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs

Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()

I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()

Related

Scala - Turning RDD[String] into a Map

I have a very large file that contains individual JSONs which I would like to iterate through, turning each one into a Map using the Jackson library:
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.module.scala.DefaultScalaModule
import com.fasterxml.module.scala.ScalaObjectMapper
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.register(DefaultScalaModule)
val lines = sc.textFile(fileName)
on a single JSON string, I can perform without issues:
mapper.readValue[Map[String, Object]](JSONString)
to get my map.
However, if I try the following by iterating through an RDD[String] like so I get the following error:
lines.foreach(line=> mapper.readValue[Map[String, Object]])
org.apache.Spark.SparkException: Task not serializable
I can do lines.take(10000) or so and then work on that but this file is so huge I can't "take" or "collect" the whole file in one go and I want to be able to use the same solution across files of all different sizes.
After the string becomes a Map, I need to perform functions on it and write to a string, so any solution that allows me to do that without going over my allocated memory will help. Thank you!
Managed to solve this with the below:
import scala.util.parsing.json._
val myMap = JSON.parseFull(jsonString).get.asInstanceOf[Map[String, Object]]
The above will work on an RDD[String]

org.mockito.exceptions.misusing.WrongTypeOfReturnValue on spark test cases

I'm currently writing test cases for spark with mockito and I'm mocking a sparkContext which gets wholeTextFiles called on it. I have something like this
val rdd = sparkSession.sparkContext.makeRDD(Seq(("Java", 20000),
("Python", 100000), ("Scala", 3000)))
doReturn(rdd).when(mockContext).wholeTextFiles(testPath)
However, I keep getting an error saying wholeTextFiles is supposed to output an int
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: ParallelCollectionRDD cannot be returned by wholeTextFiles$default$2()
wholeTextFiles$default$2() should return int
I know this isn't the case, the spark docs say that wholeTextFiles returns an RDD object, any tips on how I can fix this? I can't have my doReturn be of type int, because then the rest of my function fails, since I turn the wholeTextFiles output into a dataframe.
Resolved this with some alternative approaches. It works just fine if I do the Mockito.when() thenReturn pattern, but I decided to instead change the entire test case to start a local sparksession and load some sample files into an rdd instead, because that's a more in depth test of my function imo

Scala: Print content of function definition

I have spark application and I implemented DataFrame extension -
def transform : Dataframe => Dataframe
,so app developer can pass custom transformations in my framework. Like
builder.load(path).transform(_.filter(col("sample") == lit(""))).
Now I want to track what was happened during spark execution:
Log:
val df = spark.read()
val df2 = df.filter(col("sample") == lit("")))
...
So, the idea is keep log of actions and pretty-print it at the end, but to do this I need somehow get the content of Dataframe => DataFrame function. Possibly, macros can help me, but I am not sure. I actually don't need the code(however will appreciate it), but just get the direction to go.

Accessing Spark.SQL

I am new to Spark. Following the below example in a book, I found that the command below was giving the error. What would be the best way to run a Spark-SQL command, whilst coding in general in Spark?
scala> // Use SQL to create another DataFrame containing the account
summary records
scala> val acSummary = spark.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
<console>:37: error: not found: value spark
I tried importing import org.apache.spark.SparkContext or using the sc object, but no luck.
Assuming you're in the spark-shell, then first get a sql context thus:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Then you can do:
val acSummary = sqlContext.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
So the value spark that is available in spark-shell is actually an instance of SparkSession (https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.SparkSession)
val spark = SparkSession.builder().getOrCreate()
will give you one.
What version are you using? It appears you're in the shell and this should work, but only in Spark 2+ - otherwise you have to use sqlContext.sql

SQLContext implicits

I am learning spark and scala. I am well versed in java, but not so much in scala. I am going through a tutorial on spark, and came across the following line of code, which has not been explained:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
(sc is the SparkContext instance)
I know the concepts behind scala implicits (atleast I think I know). Could somebody explain to me what exactly is meant by the import statement above? What implicits are bound to the sqlContext instance when it is instantiated and how? Are these implicits defined inside the SQLContext class?
EDIT
The following seems to work for me as well (fresh code):
val sqlc = new SQLContext(sc)
import sqlContext.implicits._
In this code just above. what exactly is sqlContext and where is it defined?
From ScalaDoc:
sqlContext.implicits contains "(Scala-specific) Implicit methods available in Scala for converting common Scala objects into DataFrames. "
And is also explained in Spark programming guide:
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
For example in the code below .toDF() won't work unless you will import sqlContext.implicits:
val airports = sc.makeRDD(Source.fromFile(airportsPath).getLines().drop(1).toSeq, 1)
.map(s => s.replaceAll("\"", "").split(","))
.map(a => Airport(a(0), a(1), a(2), a(3), a(4), a(5), a(6)))
.toDF()
What implicits are bound to the sqlContext instance when it is
instantiated and how? Are these implicits defined inside the SQLContext class?
Yes they are defined in object implicits inside SqlContext class, which extends SQLImplicits.scala. It looks there are two types of implicit conversions defined there:
RDD to DataFrameHolder conversion, which enables using above mentioned rdd.toDf().
Various instances of Encoder which are "Used to convert a JVM object of type T to and from the internal Spark SQL representation."