SparkSQL: not found value expr - scala

I'm having some problems in building a simple application with Spark SQL. What I want to do is to add a new column to a DataFrame. Thus, I have done:
val sqlContext=new HiveContext(sc)
import sqlContext._
// creating the DataFrame
correctDF.withColumn("COL1", expr("concat('000',COL1)") )
but when I build it with sbt it throws the exception:
not found: value expr
(and also Eclipse complains about it)
Instead in the spark-shell it works like a charm.
In my build.sbt file I have:
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.6.0" % "provided"
I've added the last line after I read a post, but nothing changed...
Can someone help me?

I found the answer. I was missing this import:
import org.apache.spark.sql.functions._

Related

spark-cassandra (java.lang.NoClassDefFoundError: org/apache/spark/sql/cassandra/package)

i am trying to read DataFrame from cassandra 4.0.3 with spark 3.2.1 using scala 2.12.5 and sbt 1.6.2 but i have a problem.
this is my sbt file:
name := "StreamHandler"
version := "1.6.2"
scalaVersion := "2.12.15"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
"org.apache.cassandra" % "cassandra-all" % "4.0.3" % "test",
"org.apache.spark" %% "spark-streaming" % "3.2.1" % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % "3.2.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "4.0.0"
)
libraryDependencies += "com.datastax.dse" % "dse-java-driver-core" % "2.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.2.1" % "provided"
libraryDependencies += "org.apache.commons" % "commons-math3" % "3.6.1" % "provided"
and this is my scala file:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.types._
import org.apache.spark.sql.cassandra._
import com.datastax.oss.driver.api.core.uuid.Uuids
import com.datastax.spark.connector._
object StreamHandler {
def main(args: Array[String]){
val spark = SparkSession
.builder
.appName("Stream Handler")
.config("spark.cassandra.connection.host","localhost")
.getOrCreate()
import spark.implicits._
val Temp_DF = spark
.read
.cassandraFormat("train_temperature", "project")
.load()
Temp_DF.show(10)
}
}
and this is the result:
Usually the problem is that when you do sbt package it builds a jar only with your code, and without dependencies. To mitigate this problem you have two approaches:
Specify Cassandra Connector when using spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 as described in documentation
Create a fat jar (with all necessary dependencies) using SBT assembly plugin, but you may need to take care of not including Spark classes into fat jar.

Getting error while running Scala Spark code to list blobs in storage

I am getting below error while trying to list blobs using google-cloud-storage library:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
I have tried to change version of google-cloud-storage library in build.sbt but getting the same error again and again.
import com.google.auth.oauth2.GoogleCredentials
import com.google.cloud.storage._
import com.google.cloud.storage.Storage.BlobListOption
val credentials: GoogleCredentials = GoogleCredentials.getApplicationDefault()
val storage: Storage = StorageOptions.newBuilder().setCredentials(credentials).setProjectId(projectId).build().getService()
val blobs =storage.list(bucketName, BlobListOption.currentDirectory(), BlobListOption.prefix(path))
My build.sbt looks like this:
version := "0.1"
scalaVersion := "2.11.8"
logBuffered in Test := false
libraryDependencies ++=
Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.scalatest" %% "scalatest" % "3.0.0" % Test,
"com.typesafe" % "config" % "1.3.1",
"org.scalaj" %% "scalaj-http" % "2.4.0",
"com.google.cloud" % "google-cloud-storage" % "1.78.0"
)
Please, help me.
This is happening because Spark uses older version of Guava library than google-cloud-storage library that doesn't have Preconditions.checkArgument method. This leads to java.lang.NoSuchMethodError exception.
You can find more detailed answer and instructions on how to fix this issue here.

Spark - Error “A master URL must be set in your configuration” using Intellij IDEA

When i am trying to hit the Spark streaming application using Intellij IDEA
Env
Spark core version 2.2.0
Intellij IDEA 2017.3.5 VERSION
Additional info :
Spark is running on Yarn mode.
Getting Error :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" java.lang.ExceptionInInitializerError
at kafka_stream.kafka_stream.main(kafka_stream.scala)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at kafka_stream.InitSpark$class.$init$(InitSpark.scala:15)
at kafka_stream.kafka_stream$.<init>(kafka_stream.scala:6)
at kafka_stream.kafka_stream$.<clinit>(kafka_stream.scala)
... 1 more
Process finished with exit code 1
Tried this
val spark: SparkSession = SparkSession.builder()
.appName("SparkStructStream")
.master("spark://127.0.0.1:7077")
//.master("local[*]")
.getOrCreate()
Still getting the same MASTER URL ERROR
Content of build.sbt file
name := "KafkaSpark"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.2.0",
"org.apache.spark" % "spark-sql_2.11" % "2.2.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0",
"org.apache.spark" % "spark-streaming-kafka_2.11" % "1.6.3"
)
// https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.11.0.0"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.11.0.0"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka-streams
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "0.11.0.0"
// https://mvnrepository.com/artifact/org.apache.kafka/connect-api
libraryDependencies += "org.apache.kafka" % "connect-api" % "0.11.0.0"
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
resolvers += Resolver.mavenLocal
resolvers += "central maven" at "https://repo1.maven.org/maven2/"
Any help on it would be much appreciated ?
It looks like the parameter is not passed somehow. E.g. the spark is initialized somewhere earlier. Nevertheless you can try with the VM option -Dspark.master=local[*], that pass the parameter to all places where it is not defined, so it should solve your problem. In the IntelliJ it's in list of run config -> Edit Configurations... -> VM Options
Download winutils.exe and place the file in c/hadoop/bin/winutil.exe
Include below line under def main statement
System.setProperty("hadoop.home.dir", "C:\\hadoop")
and it works well.

Build sbt for spark with janusgraph and gremlin scala

I was trying to setup a IntelliJ build for spark with janusgraph using gremlin scala but I am running into errors.
My build.sbt file is:
version := "1.0"
scalaVersion := "2.11.11"
libraryDependencies += "com.michaelpollmeier" % "gremlin-scala" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.2.1"
// https://mvnrepository.com/artifact/org.janusgraph/janusgraph-core
libraryDependencies += "org.janusgraph" % "janusgraph-core" % "0.2.0"
libraryDependencies ++= Seq(
"ch.qos.logback" % "logback-classic" % "1.2.3" % Test,
"org.scalatest" %% "scalatest" % "3.0.3" % Test
)
resolvers ++= Seq(
Resolver.mavenLocal,
"Sonatype OSS" at "https://oss.sonatype.org/content/repositories/public"
)
But I am getting errors when I try to compile code that uses gremlin scala libraries or io.Source libraries. Can someone share their build file or tell what I should modify to fix it.
Thanks in advance.
So, I was trying to compile this code:
import gremlin.scala._
import org.apache.commons.configuration.BaseConfiguration
import org.janusgraph.core.JanusGraphFactory
class Test1() {
val conf = new BaseConfiguration()
conf.setProperty("storage.backend", "inmemory")
val gr = JanusGraphFactory.open(conf)
val graph = gr.asScala()
graph.close
}
object Test{
def main(args: Array[String]) {
val t = new Test1()
println("in Main")
}
}
The errors I get are:
Error:(1, 8) not found: object gremlin
import gremlin.scala._
Error:(10, 18) value asScala is not a member of org.janusgraph.core.JanusGraph
val graph = gr.asScala()
If you go to the Gremlin-Scala GitHub page you'll see that the current version is "3.3.1.1" and that
Typically you just need to add a dependency on "com.michaelpollmeier" %% "gremlin-scala" % "SOME_VERSION" and one for the graph db of your choice to your build.sbt (this readme assumes tinkergraph). The latest version is displayed at the top of this readme in the maven badge.
It is not a surprise that the APi has changed when the major version of the
library is different. If I change your first dependency as
//libraryDependencies += "com.michaelpollmeier" % "gremlin-scala" % "2.3.0" //old!
libraryDependencies += "com.michaelpollmeier" %% "gremlin-scala" % "3.3.1.1"
then your example code compiles for me.

error: object xml is not a member of package com.databricks.spark

I am trying to read XML file using SBT but i am facing issue when i compile it.
build.sbt
name:= "First Spark"
version:= "1.0"
organization := "in.goai"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
libraryDependencies += "com.databricks" % "spark-avro_2.10" % "2.0.1"
libraryDependencies += "org.scala-lang.modules" %% "scala-xml" % "1.0.2"
resolvers += Resolver.mavenLocal
.scala file
package in.goai.spark
import scala.xml._
import com.databricks.spark.xml
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("First Spark")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val fileName = args(0)
val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "book").load("fileName")
val selectedData = df.select("title", "price")
val d = selectedData.show
println(s"$d")
}
}
when i compile it by giving "sbt package" it shows bellow error
[error] /home/hadoop/dev/first/src/main/scala/SparkMeApp.scala:4: object xml is not a member of package com.databricks.spark
[error] import com.databricks.spark.xml
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 9 s, completed Sep 22, 2017 4:11:19 PM
Do i need to add any other jar files related to xml? please suggest and please provide me any link which gives information about jar files for different file formats
Because you're using Scala 2.11 and Spark 2.0, in build.sbt, change your dependencies to the following:
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0"
libraryDependencies += "com.databricks" %% "spark-xml" % "0.4.1"
libraryDependencies += "org.scala-lang.modules" %% "scala-xml" % "1.0.6"
Change the spark-avro version to 3.2.0: https://github.com/databricks/spark-avro#requirements
Add "com.databricks" %% "spark-xml" % "0.4.1": https://github.com/databricks/spark-xml#scala-211
Change the scala-xml version to 1.0.6, the current version for Scala 2.11: http://mvnrepository.com/artifact/org.scala-lang.modules/scala-xml_2.11
In your code, delete the following import statement:
import com.databricks.spark.xml
Note that your code doesn't actually use the spark-avro or scala-xml libraries. Remove those dependencies from your build.sbt (and the import scala.xml._ statement from your code) if you're not going to use them.