toDF is not working in spark scala ide , but works perfectly in spark-shell [duplicate] - scala

This question already has answers here:
Spark 2.0 Scala - RDD.toDF()
(4 answers)
Closed 2 years ago.
I am new to Spark and I am trying to run the below commands both from spark-shell and spark scala eclipse ide
When I ran it from shell , it perfectly works .
But in ide , it gives the compilation error.
Please help
package sparkWCExample.spWCExample
import org.apache.log4j.Level
import org.apache.spark.sql.{ Dataset, SparkSession, DataFrame, Row }
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object TwitterDatawithDataset {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Spark Scala WordCount Example")
.setMaster("local[1]")
val spark = SparkSession.builder()
.config(conf)
.appName("CsvExample")
.master("local")
.getOrCreate()
val csvData = spark.sparkContext
.textFile("C:\\Sankha\\Study\\data\\bank_data.csv", 3)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Bank(age: Int, job: String)
val bankDF = dfData.map(x => Bank(x(0).toInt, x(1)))
val df = bankDF.toDF()
}
}
Exception is as below on compile time itself
Description Resource Path Location Type
value toDF is not a member of org.apache.spark.rdd.RDD[Bank] TwitterDatawithDataset.scala /spWCExample/src/main/java/sparkWCExample/spWCExample line 35 Scala Problem

To toDF(), you must enable implicit conversions:
import spark.implicits._
In spark-shell, it is enabled by default and that's why the code works there. :imports command can be used to see what imports are already present in your shell:
scala> :imports
1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (385 terms)

This works fine for me in Eclipse Scala IDE:
case class Bank(age: Int, job: String)
val u = Array((1, "manager"), (2, "clerk"))
import spark.implicits._
spark.sparkContext.makeRDD(u).map(r => Bank(r._1, r._2)).toDF().show()

Related

value toDF is not a member of Seq[(Int,String)]

I am trying to execute the following code but getting this error:
value toDF is not a member of Seq[(Int,String)].
I have the case class outside main and I have imported implicits too. But still I am getting this error. Can someone help me to resolve this ? I am using Spark 2.11-2.1.0 and Scala 2.11.8
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark._
final case class Email(id: Int, text: String)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val emails = Seq(
"This is an email from...",
"SPAM SPAM spam",
"Hello, We'd like to offer you")
.zipWithIndex.map(_.swap).toDF("id", "text").as[Email]
}
}
You already have a SparkSession you can just import the spark.implicits._ will work in your case
val spark = SparkSession.builder.appName("SampleKMeans")
.master("local[*]")
.getOrCreate()
import spark.implicits._
Now toDF method works as expected.
If the error still exists, You need to check the version of spark and scala libraries that you are using.
Hope this helps!

Running wordcount failed in scala

I am trying to run wordcount program in scala. Here's how my code looks like.
package myspark;
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.implicits._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext( "local", "Word Count", "/home/hadoop/spark-2.2.0-bin-hadoop2.7/bin", Nil, Map(), Map())
val input = sc.textFile("/myspark/input.txt")
Val count = input.flatMap(line ⇒ line.split(" "))
.map(word ⇒ (word, 1))
.reduceByKey(_ + _)
count.saveAsTextFile("outfile")
System.out.println("OK");
}
}
Then I tried to execute it in spark.
spark-shell -i /myspark/WordCount.scala
And I get this error.
... 149 more
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
That file does not exist
Can someone please explain the error in this code? I am very new to Spark and Scala both. I have verified that the input.txt file is in the mentioned location.
You can take a look here to get started : Learning Spark-WordCount
Other than that there are many a errors that I can see
import org.apache.spark..implicits._: the two dots wont work
Other than that have you added spark-dependency in your project ? Maybe even as provided ? You must do that atleast to run the spark code.
First of all check whether you have added the right dependencies . An i can see you did few mistake in your code .
create Sparksession not Sparkcontext SparkSessionAPI
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
Then use this spark variable
import spark.implicits._
I am not sure why you have mentioned import org.apache.spark..implicits._ 2 dot between the spark..implicits

Error found when importing spark.implicits

I am using spark 1.4.0
When I tried to import spark.implicits using this command:
import spark.implicits._, this error appear:
<console>:19: error: not found: value spark
import spark.implicits._
^
Can anyone help me to resolve this problem ?
It's because SparkSession is avialable from Spark 2.0 and spark value is an object of type SparkSession in Spark REPL.
In Spark 1.4 use
import sqlContext.implicits._
Value sqlContext is automatically created in Spark REPL for Spark 1.x
To make it complete, first you have to create a sqlContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setMaster("local").setAppName("my app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.