value toDF is not a member of Seq[(Int,String)] - scala

I am trying to execute the following code but getting this error:
value toDF is not a member of Seq[(Int,String)].
I have the case class outside main and I have imported implicits too. But still I am getting this error. Can someone help me to resolve this ? I am using Spark 2.11-2.1.0 and Scala 2.11.8
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark._
final case class Email(id: Int, text: String)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val emails = Seq(
"This is an email from...",
"SPAM SPAM spam",
"Hello, We'd like to offer you")
.zipWithIndex.map(_.swap).toDF("id", "text").as[Email]
}
}

You already have a SparkSession you can just import the spark.implicits._ will work in your case
val spark = SparkSession.builder.appName("SampleKMeans")
.master("local[*]")
.getOrCreate()
import spark.implicits._
Now toDF method works as expected.
If the error still exists, You need to check the version of spark and scala libraries that you are using.
Hope this helps!

Related

import sqlContext cannot be resolved although defined with SQLContext instance

I followed the solutions in here, however, I am still getting the "cannot resolve symbol SQLContext" error. ".implicits._" cannot be resolved either. What would be the reason for it?
Spark/Scala versions I use:
Scala 2.12.13
Spark 3.0.1 (without bundled Hadoop)
Here is my related code part:
import org.apache.log4j.LogManager
import org.apache.spark.{SparkConf, SparkContext}
object Count {
def main(args: Array[String]) {
...
...
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
}}
You didn't import SQLContext at all:
import org.apache.spark.sql.SQLContext
You should probably not use SQLContext anymore in the first place though:
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SQLContext.html
See how to use a SparkSession from SparkContext at How to create SparkSession from existing SparkContext and then import sparkSession.implicits._.

toDF is not working in spark scala ide , but works perfectly in spark-shell [duplicate]

This question already has answers here:
Spark 2.0 Scala - RDD.toDF()
(4 answers)
Closed 2 years ago.
I am new to Spark and I am trying to run the below commands both from spark-shell and spark scala eclipse ide
When I ran it from shell , it perfectly works .
But in ide , it gives the compilation error.
Please help
package sparkWCExample.spWCExample
import org.apache.log4j.Level
import org.apache.spark.sql.{ Dataset, SparkSession, DataFrame, Row }
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object TwitterDatawithDataset {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Spark Scala WordCount Example")
.setMaster("local[1]")
val spark = SparkSession.builder()
.config(conf)
.appName("CsvExample")
.master("local")
.getOrCreate()
val csvData = spark.sparkContext
.textFile("C:\\Sankha\\Study\\data\\bank_data.csv", 3)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Bank(age: Int, job: String)
val bankDF = dfData.map(x => Bank(x(0).toInt, x(1)))
val df = bankDF.toDF()
}
}
Exception is as below on compile time itself
Description Resource Path Location Type
value toDF is not a member of org.apache.spark.rdd.RDD[Bank] TwitterDatawithDataset.scala /spWCExample/src/main/java/sparkWCExample/spWCExample line 35 Scala Problem
To toDF(), you must enable implicit conversions:
import spark.implicits._
In spark-shell, it is enabled by default and that's why the code works there. :imports command can be used to see what imports are already present in your shell:
scala> :imports
1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (385 terms)
This works fine for me in Eclipse Scala IDE:
case class Bank(age: Int, job: String)
val u = Array((1, "manager"), (2, "clerk"))
import spark.implicits._
spark.sparkContext.makeRDD(u).map(r => Bank(r._1, r._2)).toDF().show()

Running wordcount failed in scala

I am trying to run wordcount program in scala. Here's how my code looks like.
package myspark;
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.implicits._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext( "local", "Word Count", "/home/hadoop/spark-2.2.0-bin-hadoop2.7/bin", Nil, Map(), Map())
val input = sc.textFile("/myspark/input.txt")
Val count = input.flatMap(line ⇒ line.split(" "))
.map(word ⇒ (word, 1))
.reduceByKey(_ + _)
count.saveAsTextFile("outfile")
System.out.println("OK");
}
}
Then I tried to execute it in spark.
spark-shell -i /myspark/WordCount.scala
And I get this error.
... 149 more
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
That file does not exist
Can someone please explain the error in this code? I am very new to Spark and Scala both. I have verified that the input.txt file is in the mentioned location.
You can take a look here to get started : Learning Spark-WordCount
Other than that there are many a errors that I can see
import org.apache.spark..implicits._: the two dots wont work
Other than that have you added spark-dependency in your project ? Maybe even as provided ? You must do that atleast to run the spark code.
First of all check whether you have added the right dependencies . An i can see you did few mistake in your code .
create Sparksession not Sparkcontext SparkSessionAPI
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
Then use this spark variable
import spark.implicits._
I am not sure why you have mentioned import org.apache.spark..implicits._ 2 dot between the spark..implicits

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method