Cannot resolve symbol read in org.apache.spark.read - scala

What is the problem with following code in scala spark?
import org.apache.spark
// ...
val path in = "D:\\myfolder\\myfile.csv"
spark.read(pathIn).csv()
error: cannot resolve symbol "read"
pom.xml dependencies:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.12.1</scala.version>
<scala.compat.version>2.12</scala.compat.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.1</version>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.12</artifactId>
<version>3.0.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
</dependencies>
I have added some dependencies cuz can't import SparkSession.

Please use following. You need spark-core and spark-sql as dependencies.
import org.apache.spark.sql.SparkSession
val spark : SparkSession = SparkSession.builder
.appName("test")
.master("local[2]")
.getOrCreate()
import spark.implicits._
val pathIn = "D:\\myfolder\\myfile.csv"
spark.read.csv(pathIn).show()

Related

scala -object sql is not a member of package org.apache.spark

When i am trying to build maven project in eclipse IDE based on scala nature.
Getting error
object sql is not a member of package org.apache.spark
We tried
Adding this dependency in pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.0</version>
</dependency>
Input code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
object MyApp {
def main(args: Array[String]) {
//Read from KAFKA TOPIC
val conf = new SparkConf().setMaster("local[*]").setAppName("Spark-Kafk-Integration")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(5))
val kafkaStream = KafkaUtils.createStream(ssc, "hostname:2181", "spark-streaming-consumer-group", Map("test4" -> 1))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
kafkaStream.foreachRDD(rdd => {
rdd.foreach(println)
if(rdd.count()>0) {
// rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).text("file:///D:/my/")
// rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).text("file://user/cloudera/testdata")
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).text("hdfs://hostname:8020/user/cloudera/testdata")
// rdd.saveAsTextFile("C:/data/spark/")
}
})
Complete POM.XML
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cyb</groupId>
<artifactId>First</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>First</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Output
We want to write the stream data into HDFS storage from Kafka topic.
Any help on it would be much appreciated ?
You need to import spark sql libraries to use spark-sql functions. Try importing this
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.SQLContext

Dataframes scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) error

when I run a normal wordcount program(with below code) with out any Dataframe included I am able run the application with spark-submit.
object wordCount {
def main(args: Array[String]): Unit = {
val logFile= "path/thread.txt"
val sparkConf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(logFile)
val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("path/output1234")
sc.stop()
}
}
But when I run the below code
import scala.reflect.runtime.universe
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object wordCount {
def main(args: Array[String]): Unit = {
val logFile = "path/thread.txt"
val sparkConf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(logFile)
val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
case class count1(key:String,value:Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._;
counts.toDF.registerTempTable("count1")
val counts1 = sqlContext.sql("select * from count1")
counts.saveAsTextFile("path/output1234")
sc.stop()
}
}
I am getting the below error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at com.cadillac.spark.sparkjob.wordCount$.main(wordCount.scala:18)
I am not sure what I am missing.
Pom.xml I am using is as below,
<name>sparkjob</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.10</version>
</dependency>
</dependencies>
</project>
Please suggest any changes.
My cluster is with
spark-version 2.1.0-mapr-1703
Scala version 2.11.8
Thanks in advance
If you go to this documentation the reason for the error is defined there as
This means that there is a mix of Scala versions in the libraries used in your code. The collection API is different between Scala 2.10 and 2.11 and this the most common error which occurs if a Scala 2.10 library is attempted to be loaded in a Scala 2.11 runtime. To fix this make sure that the name has the correct Scala version suffix to match your Scala version.
So changing your dependencies from
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
to
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
and add one more dependency
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
I guess the error should go away

Why does reading stream from Kafka fail with "Unable to find encoder for type stored in a Dataset"?

I am trying to use Spark Structured Streaming with Kafka.
object StructuredStreaming {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: StructuredStreaming <hostname> <port>")
System.exit(1)
}
val host = args(0)
val port = args(1).toInt
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
// Subscribe to 1 topic
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9093")
.option("subscribe", "sparkss")
.load()
lines.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
}
}
I got my code from Spark documentation and I got this build error :
Unable to find encoder for type stored in a Dataset. Primitive types
(Int, String, etc) and Product types (case classes) are supported by
importing spark.implicits._ Support for serializing other types will
be added in future releases.
.as[(String, String)]
I read on other SO post that it was due to the lack of import spark.implicits._. But it does not change anything for me.
UPDATE :
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<slf4j.version>1.7.12</slf4j.version>
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<scala.binary.version>2.10</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.10</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
Well, I tried with scala 2.11.8
<scala.version>2.11.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
and with corresponding dependencies (for scala 2.11) and it eventually worked.
Warning : You need to restart your project on intelliJ, I think there are some problems when changing version and not restarting, the errors are still there.

Spark Scala : Unable to import sqlContext.implicits._

I tried the below code and cannot import sqlContext.implicits._ - it throws an error (in the Scala IDE), unable to build the code:
value implicits is not a member of org.apache.spark.sql.SQLContext
Do I need to add any dependencies in pom.xml?
Spark version 1.5.2
package com.Spark.ConnectToHadoop
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
//import groovy.sql.Sql.CreateStatementCommand
//import org.apache.spark.SparkConf
object CountWords {
def main(args:Array[String]){
val objConf = new SparkConf().setAppName("Spark Connection").setMaster("spark://IP:7077")
var sc = new SparkContext(objConf)
val objHiveContext = new HiveContext(sc)
objHiveContext.sql("USE test")
var rdd= objHiveContext.sql("select * from Table1")
val options=Map("path" -> "hdfs://URL/apps/hive/warehouse/test.db/TableName")
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ //Error
val dataframe = rdd.toDF()
dataframe.write.format("orc").options(options).mode(SaveMode.Overwrite).saveAsTable("TableName")
}
}
My pom.xml file is as follows
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.Sudhir.Maven1</groupId>
<artifactId>SparkDemo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>SparkDemo</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>0.9.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
first create
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
now we have sqlContext w.r.t sc (this will be available automatically when we launch spark-shell)
now,
import sqlContext.implicits._
With the release of Spark 2.0.0 (July 26, 2016) one should now use the following:
import spark.implicits._ // spark = SparkSession.builder().getOrCreate()
https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
You use an old version of Spark-SQL. Change it to:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.2</version>
</dependency>
For someone using sbt to build, update the library versions to
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.12" % "2.4.6" % "provided",
"org.apache.spark" % "spark-sql_2.12" % "2.4.6" % "provided"
)
And then import SqlImplicits as below.
val spark = SparkSession.builder()
.appName("appName")
.getOrCreate()
import spark.sqlContext.implicits._;
You can also use
<properties>
<spark.version>2.2.0</spark.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/analysis/OverrideFunctionRegistry

I have tried with below code in spark and scala, attaching code and pom.xml
package com.Spark.ConnectToHadoop
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
//import groovy.sql.Sql.CreateStatementCommand
//import org.apache.spark.SparkConf
object CountWords {
def main(args:Array[String]){
val objConf = new SparkConf().setAppName("Spark Connection").setMaster("spark://IP:7077")
var sc = new SparkContext(objConf)
val objHiveContext = new HiveContext(sc)
objHiveContext.sql("USE test")
var test= objHiveContext.sql("show tables")
var i = 0
var testing = test.collect()
for(i<-0 until testing.length){
println(testing(i))
}
}
}
I have added spark-core_2.10,spark-catalyst_2.10,spark-sql_2.10,spark-hive_2.10 dependencies Do I need to add any more dependencies???
Edit:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.Sudhir.Maven1</groupId>
<artifactId>SparkDemo</artifactId>
<version>IntervalMeterData1</version>
<packaging>jar</packaging>
<name>SparkDemo</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>1.5.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Looks like you forgot to bump the spark-hive:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.5.2</version>
</dependency>
Consider introducing maven variable, like spark.version.
<properties>
<spark.version>1.5.2</spark.version>
</properties>
And modifying all your spark dependencies in this manner:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
Bumping up versions of spark won't be as painful.
Just adding the property spark.version in your <properties> is not enough, you have to call it with ${spark.version} in dependencies.