Spark Scala: "cannot resolve symbol saveAsTextFile (reduceByKey)" - IntelliJ Idea - scala

I suppose some dependencies are not defined in build.sbt file.
I've added library dependencies in build.sbt file, but still I'm getting this error mentioned from title of this question. Try to search for solution on the google but couldn't find it
My spark scala source code (filterEventId100.scala) :
package com.projects.setTopBoxDataAnalysis
import java.lang.System._
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.SparkSession
object filterEventId100 extends App {
if (args.length < 2) {
println("Usage: JavaWordCount <Input-File> <Output-file>")
exit(1)
}
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
val data = spark.read.textFile(args(0)).rdd
val result = data.flatMap{line: String => line.split("\n")}
.map{serverData =>
val serverDataArray = serverData.replace("^", "::")split("::")
val evenId = serverDataArray(2)
if (evenId.equals("100")) {
val serverId = serverDataArray(0)
val timestempTo = serverDataArray(3)
val timestempFrom = serverDataArray(6)
val server = new Servers(serverId, timestempFrom, timestempTo)
val res = (serverId, server.dateDiff(server.timestampFrom, server.timestampTo))
res
}
}.reduceByKey{
case(x: Long, y: Long) => if ((x, y) != null) {
if (x > y) x else y
}
}
result.saveAsTextFile(args(1))
spark.stop
}
class Servers(val serverId: String, val timestampFrom: String, val timestampTo: String) {
val DATE_FORMAT = "yyyy-MM-dd hh:mm:ss.SSS"
private def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat(DATE_FORMAT)
dateFormat.parse(s)
}
private def convertDateStringToLong(dateAsString: String): Long = {
convertStringToDate(dateAsString).getTime
}
def dateDiff(tFrom: String, tTo: String): Long = {
val dDiff = convertDateStringToLong(tTo) - tFrom.toLong
dDiff
}
}
My build.sbt file:
name := "SetTopProject"
version := "0.1"
scalaVersion := "2.12.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-sql_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.hadoop" %% "hadoop-common" % "3.2.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-sql_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-hive_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-yarn_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy")
)
I was expecting everything will be fine because
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
is defined well (without any compiler's errors) and I use spark value to define data value:
val data = spark.read.textFile(args(0)).rdd
which calls saveAsTextFile and reducedByKey functions:
val result = data.flatMap{line: String => line.split("\n")}...
}.reducedByKey {case(x: Long, y: Long) => if ((x, y) != null) {
if (x > y) x else y
}
result.saveAsTextFile(args(1))
What I should to to remove compiler errors for saveAsTextFile and reduceByKey functions calls?

Replace
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
val data = spark.read.textFile(args(0)).rdd
to
val conf = new SparkConf().setAppName("FilterEvent100")
val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
val data = sc.textfile(args(0))

Related

Scala Spark Encoders.product[X] (where X is a case class) keeps giving me "No TypeTag available for X" error

I am working with Intellij Idea, in a Scala worksheet. I want to create a encoder for a scala case class. From various posts on internet I found the suggestion to use Encoders.product. But it never worked for me.
The following code
import org.apache.spark.sql.*
val spark: SparkSession =
SparkSession
.builder()
.appName("test")
.master("local")
.getOrCreate()
import scala3encoders.given
case class classa(i: Int, j: Int, s: String)
val enc = Encoders.product[classa]
keep throwing error:
-- Error: ----------------------------------------------------------------------
1 |val enc = Encoders.product[classa]
| ^
| No TypeTag available for classa
1 error found
Does anyone know what's going on there?
The content of build.sbt file is:
scalaVersion := "3.1.3"
scalacOptions ++= Seq("-language:implicitConversions", "-deprecation")
libraryDependencies ++= Seq(
excludes(("org.apache.spark" %% "spark-core" % "3.2.0").cross(CrossVersion.for3Use2_13)),
excludes(("org.apache.spark" %% "spark-sql" % "3.2.0").cross(CrossVersion.for3Use2_13)),
excludes("io.github.vincenzobaz" %% "spark-scala3" % "0.1.3"),
"org.scalameta" %% "munit" % "0.7.26" % Test
)
//netty-all replaces all these excludes
def excludes(m: ModuleID): ModuleID =
m.exclude("io.netty", "netty-common").
exclude("io.netty", "netty-handler").
exclude("io.netty", "netty-transport").
exclude("io.netty", "netty-buffer").
exclude("io.netty", "netty-codec").
exclude("io.netty", "netty-resolver").
exclude("io.netty", "netty-transport-native-epoll").
exclude("io.netty", "netty-transport-native-unix-common").
exclude("javax.xml.bind", "jaxb-api").
exclude("jakarta.xml.bind", "jaxb-api").
exclude("javax.activation", "activation").
exclude("jakarta.annotation", "jakarta.annotation-api").
exclude("javax.annotation", "javax.annotation-api")
// Without forking, ctrl-c doesn't actually fully stop Spark
run / fork := true
Test / fork := true
Encoders.product[classa] is a Scala 2 thing. This method accepts an implicit TypeTag. There are no TypeTags in Scala 3. In Scala 3 the library maintainers propose to work in the following way:
https://github.com/vincenzobaz/spark-scala3/blob/main/examples/src/main/scala/sql/StarWars.scala
package sql
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{Dataset, DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
object StarWars extends App:
val spark = SparkSession.builder().master("local").getOrCreate
import spark.implicits.localSeqToDatasetHolder
import scala3encoders.given
extension [T: Encoder] (seq: Seq[T])
def toDS: Dataset[T] =
localSeqToDatasetHolder(seq).toDS
case class Friends(name: String, friends: String)
val df: Dataset[Friends] = Seq(
("Yoda", "Obi-Wan Kenobi"),
("Anakin Skywalker", "Sheev Palpatine"),
("Luke Skywalker", "Han Solo, Leia Skywalker"),
("Leia Skywalker", "Obi-Wan Kenobi"),
("Sheev Palpatine", "Anakin Skywalker"),
("Han Solo", "Leia Skywalker, Luke Skywalker, Obi-Wan Kenobi, Chewbacca"),
("Obi-Wan Kenobi", "Yoda, Qui-Gon Jinn"),
("R2-D2", "C-3PO"),
("C-3PO", "R2-D2"),
("Darth Maul", "Sheev Palpatine"),
("Chewbacca", "Han Solo"),
("Lando Calrissian", "Han Solo"),
("Jabba", "Boba Fett")
).toDS.map((n,f) => Friends(n, f))
val friends = df.as[Friends]
friends.show()
case class FriendsMissing(who: String, friends: Option[String])
val dsMissing: Dataset[FriendsMissing] = Seq(
("Yoda", Some("Obi-Wan Kenobi")),
("Anakin Skywalker", Some("Sheev Palpatine")),
("Luke Skywalker", Option.empty[String]),
("Leia Skywalker", Some("Obi-Wan Kenobi")),
("Sheev Palpatine", Some("Anakin Skywalker")),
("Han Solo", Some("Leia Skywalker, Luke Skywalker, Obi-Wan Kenobi"))
).toDS
.map((a, b) => FriendsMissing(a, b))
dsMissing.show()
case class Character(
name: String,
height: Int,
weight: Option[Int],
eyecolor: Option[String],
haircolor: Option[String],
jedi: String,
species: String
)
val characters: Dataset[Character] = spark.sqlContext
.read
.option("header", "true")
.option("delimiter", ";")
.option("inferSchema", "true")
.csv("StarWars.csv")
.as[Character]
characters.show()
val sw_df = characters.join(friends, Seq("name"))
sw_df.show()
case class SW(
name: String,
height: Int,
weight: Option[Int],
eyecolor: Option[String],
haircolor: Option[String],
jedi: String,
species: String,
friends: String
)
val sw_ds = sw_df.as[SW]
So if you really need Encoders.product[classa] compile this part of your code with Scala 2
src/main/scala/App.scala
// this is Scala 3
object App {
def main(args: Array[String]): Unit = {
println(App1.schema)
// Seq(StructField(i,IntegerType,false), StructField(j,IntegerType,false), StructField(s,StringType,true))
}
}
scala2/src/main/scala/App1.scala
import org.apache.spark.sql._
// this is Scala 2
object App1 {
val schema = Encoders.product[classa].schema
}
common/src/main/scala/classa.scala
case class classa(i: Int, j: Int, s: String)
build.sbt
lazy val sparkCore = "org.apache.spark" %% "spark-core" % "3.2.0"
lazy val sparkSql = "org.apache.spark" %% "spark-sql" % "3.2.0"
lazy val scala3V = "3.1.3"
lazy val scala2V = "2.13.8"
lazy val root = project
.in(file("."))
.settings(
scalaVersion := scala3V,
scalacOptions ++= Seq("-language:implicitConversions", "-deprecation"),
libraryDependencies ++= Seq(
excludes(sparkCore.cross(CrossVersion.for3Use2_13)),
excludes(sparkSql.cross(CrossVersion.for3Use2_13)),
excludes("io.github.vincenzobaz" %% "spark-scala3" % "0.1.3"),
"org.scalameta" %% "munit" % "0.7.26" % Test
)
)
.dependsOn(scala2, common)
lazy val scala2 = project
.settings(
scalaVersion := scala2V,
libraryDependencies ++= Seq(
sparkCore,
sparkSql
)
)
.dependsOn(common)
lazy val common = project
.settings(
scalaVersion := scala3V,
crossScalaVersions := Seq(scala2V, scala3V)
)
//netty-all replaces all these excludes
def excludes(m: ModuleID): ModuleID =
m.exclude("io.netty", "netty-common").
exclude("io.netty", "netty-handler").
exclude("io.netty", "netty-transport").
exclude("io.netty", "netty-buffer").
exclude("io.netty", "netty-codec").
exclude("io.netty", "netty-resolver").
exclude("io.netty", "netty-transport-native-epoll").
exclude("io.netty", "netty-transport-native-unix-common").
exclude("javax.xml.bind", "jaxb-api").
exclude("jakarta.xml.bind", "jaxb-api").
exclude("javax.activation", "activation").
exclude("jakarta.annotation", "jakarta.annotation-api").
exclude("javax.annotation", "javax.annotation-api")
// Without forking, ctrl-c doesn't actually fully stop Spark
run / fork := true
Test / fork := true
Alternatively, in Scala 3 you can calculate TypeTag with Scala 2 runtime compilation (reflective Toolbox): How to compile and execute scala code at run-time in Scala3?
Scala 2 macros don't work, so we can't do runtime.currentMirror or q"..." but can do universe.runtimeMirror, tb.parse. It turns out this still works in Scala 3.
// this is Scala 3
import org.apache.spark.sql.*
import scala.tools.reflect.ToolBox
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe.*
import mypackage.classa
val rm = universe.runtimeMirror(getClass.getClassLoader)
val tb = rm.mkToolBox()
val typeTag = tb.eval(tb.parse(
"scala.reflect.runtime.universe.typeTag[mypackage.classa]"
)).asInstanceOf[TypeTag[classa]]
Encoders.product[classa](typeTag).schema
// Seq(StructField(i,IntegerType,false), StructField(j,IntegerType,false), StructField(s,StringType,true))
build.sbt
lazy val sparkCore = "org.apache.spark" %% "spark-core" % "3.2.0"
lazy val sparkSql = "org.apache.spark" %% "spark-sql" % "3.2.0"
lazy val scala3V = "3.1.3"
lazy val scala2V = "2.13.8"
lazy val root = project
.in(file("."))
.settings(
scalaVersion := scala3V,
scalacOptions ++= Seq(
"-language:implicitConversions",
"-deprecation"
),
libraryDependencies ++= Seq(
excludes(sparkCore.cross(CrossVersion.for3Use2_13)),
excludes(sparkSql.cross(CrossVersion.for3Use2_13)),
excludes("io.github.vincenzobaz" %% "spark-scala3" % "0.1.3"),
"org.scalameta" %% "munit" % "0.7.26" % Test,
scalaOrganization.value % "scala-reflect" % scala2V,
scalaOrganization.value % "scala-compiler" % scala2V,
),
)
def excludes(m: ModuleID): ModuleID =
m.exclude("io.netty", "netty-common").
exclude("io.netty", "netty-handler").
exclude("io.netty", "netty-transport").
exclude("io.netty", "netty-buffer").
exclude("io.netty", "netty-codec").
exclude("io.netty", "netty-resolver").
exclude("io.netty", "netty-transport-native-epoll").
exclude("io.netty", "netty-transport-native-unix-common").
exclude("javax.xml.bind", "jaxb-api").
exclude("jakarta.xml.bind", "jaxb-api").
exclude("javax.activation", "activation").
exclude("jakarta.annotation", "jakarta.annotation-api").
exclude("javax.annotation", "javax.annotation-api")
// Without forking, ctrl-c doesn't actually fully stop Spark
run / fork := true
Test / fork := true
One more option is to make a type tag manually
import scala.reflect.runtime.universe.*
import org.apache.spark.sql.*
val rm = runtimeMirror(getClass.getClassLoader)
val tpe: Type = internal.typeRef(internal.typeRef(NoType, rm.staticPackage("mypackage"), Nil), rm.staticClass("mypackage.classa"), Nil)
val ttg: TypeTag[_] = createTypeTag(rm, tpe)
Encoders.product[classa](ttg.asInstanceOf[TypeTag[classa]]).schema
// Seq(StructField(i,IntegerType,false), StructField(j,IntegerType,false), StructField(s,StringType,true))
package mypackage
case class classa(i: Int, j: Int, s: String)
import scala.reflect.api
inline def createTypeTag(mirror: api.Mirror[_ <: api.Universe with Singleton], tpe: mirror.universe.Type): mirror.universe.TypeTag[_] = {
mirror.universe.TypeTag.apply(mirror.asInstanceOf[api.Mirror[mirror.universe.type]],
new api.TypeCreator {
override def apply[U <: api.Universe with Singleton](m: api.Mirror[U]): m.universe.Type = {
tpe.asInstanceOf[m.universe.Type]
}
}
)
}
scalaVersion := "3.1.3"
libraryDependencies ++= Seq(
scalaOrganization.value % "scala-reflect" % "2.13.8",
"org.apache.spark" %% "spark-core" % "3.2.0" cross CrossVersion.for3Use2_13 exclude("org.scala-lang.modules", "scala-xml_2.13"),
"org.apache.spark" %% "spark-sql" % "3.2.0" cross CrossVersion.for3Use2_13 exclude("org.scala-lang.modules", "scala-xml_2.13"),
)
inline is to make a type tag serializable/deserializable.
In scala 2.12, why none of the TypeTag created in runtime is serializable?
How to create a TypeTag manually? (answer)
Get a TypeTag from a Type?
In Scala, how to create a TypeTag from a type that is serializable?
What causes ClassCastException when serializing TypeTags?
Get TypeTag[A] from Class[A]

SharedSparkSession is not working in Spark MemoryStream scala testing

I have tried to write Spark MemoryStream Unit test case and SharedSparkSession is not importing in my Test case program.
**import org.apache.spark.sql.test.SharedSparkSession
class MemoryStreamTest extends AnyFunSuite with SharedSparkSession {
....
}**
My build.sbt file configuration below
**scalaVersion := "2.12.0"
val sparkVersion = "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.5" % "test"
libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"**
Do I need to add any other dependencies artifact or any sclatest version changes required.
The below program getting import issue for SharedSparkSession file.
**import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.test.SharedSparkSession
class MemoryStreamTest extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}**
The SharedSparkSession is an internal test utility for the Apache-Spark project and not accessible through the packages you have provided in your sbt file.
The ScalaDocs do not mention the SharedSparkSession.
You will see that the trait SharedSparkSession extends SQLTestUtils which is another testing utility.
For your unit tests it is usually sufficient to just create a local SparkSession.
See the below working code..
import module.JsValueToString
import org.apache.log4j.{Level, Logger}
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions.{col, concat, current_timestamp, date_format, from_json, from_unixtime, from_utc_timestamp, lit, regexp_replace, sha2, struct, to_json, to_utc_timestamp, udf}
import org.apache.spark.sql.{DataFrame, SQLContext, SparkSession}
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.scalatest.BeforeAndAfterAll
import scala.io.Source
import org.apache.log4j.Logger
import org.apache.log4j.Level
class KafkaProducerFlattenerTestCase extends AnyFunSuite with BeforeAndAfterAll {
Logger.getLogger("org").setLevel(Level.ERROR)
#transient var spark : SparkSession =_
override def beforeAll(): Unit = {
spark = SparkSession
.builder()
.appName("KafkaProducerFlattenerTestCase")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
}
override def afterAll(): Unit = {
spark.stop()
}
test("MemoryStream testcase for Flattener JSON") {
implicit val sparkSesion: SparkSession = spark
implicit val ctx = spark.sqlContext
import sparkSesion.implicits._
val input = MemoryStream[String]
val valueDf = input.toDS().selectExpr("CAST(value As STRING)")
val df2 = (valueDf.select(to_json(struct(col("*"))).alias("content")))
df2.printSchema()
print(" Before Write Stream")
val formatName = ("memory")
val query = df2.writeStream
.queryName("testCustomSinkBasic")
.format(formatName)
.start()
val jasonContent = readJson()a
input.addData(jasonContent)
assert(query.isActive === true)
query.processAllAvailable()
assert(query.exception === None)
print("query....... "+query.runId)
val eventName = spark.sql("select * from testCustomSinkBasic")
val actualValString = JsValueToString(eventName)
println("actualValString..... "+actualValString)
assert(actualValString === expectValue())
}
def readJson(): String ={
val fileContents = Source.fromFile("src/resources/Json.txt").getLines().mkString
fileContents
}
def expectValue(): String = {
val expectVal = """{"publishTime":"123","name[0].lastname":"def","name[0].fname":"abc","name[1].lastname":"jkl","name[1].fname":"ghi","lpid":"1234"}"""
expectVal
}
}
Expected class to cover
import com.usb.transformation.JsFlattener
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.functions.{col, struct, to_json}
import play.api.libs.json.{JsObject, JsValue, Json}
object JsValueToString extends Serializable{
var df3 : String = null
def apply(eventName : DataFrame): (String) ={
eventName.foreach(x => {
val content = x.getAs[String]("content").replace("\\", "")
val subStr = content.substring(10, content.length()-2)
println("content ...."+content)
println("subString "+subStr)
val str2Json: JsValue = Json.parse(subStr)
df3 = JsFlattener(str2Json).as[JsObject].toString
println("df3 Value......"+df3)
})
df3
}
}

Trying to read file from s3 with FLINK using the IDE getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I am trying to read a file from s3 using Flink from IntelliJ and getting the following exception:
Caused by: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
This how my code looks like :
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.column.page.PageReadStore
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.parquet.hadoop.util.HadoopInputFile
import org.apache.parquet.io.ColumnIOFactory
class ParquetSourceFunction extends SourceFunction[String]{
override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
val inputPath = "s3a://path-to-bucket/"
val outputPath = "s3a://path-to-output-bucket/"
val conf = new Configuration()
conf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
val readFooter = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(inputPath), conf))
val metadata = readFooter.getFileMetaData
val schema = metadata.getSchema
val parquetFileReader = new ParquetFileReader(conf, metadata, new Path(inputPath), readFooter.getRowGroups, schema.getColumns)
// val parquetFileReader2 = new ParquetFileReader(new Path(inputPath), ParquetReadOptions)
var pages: PageReadStore = null
try {
while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
val rows = pages.getRowCount
val columnIO = new ColumnIOFactory().getColumnIO(schema)
val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
(0L until rows).foreach { _ =>
val group = recordReader.read()
val myString = group.getString("field_name", 0)
ctx.collect(myString)
}
}
}
}
override def cancel(): Unit = ???
}
object Job {
def main(args: Array[String]): Unit = {
// set up the execution environment
lazy val env = StreamExecutionEnvironment.getExecutionEnvironment
lazy val stream = env.addSource(new ParquetSourceFunction)
stream.print()
env.execute()
}
}
Sbt dependencies :
val flinkVersion = "1.12.1"
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-clients" % flinkVersion,// % "provided",
"org.apache.flink" %% "flink-scala" % flinkVersion,// % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion, // % "provided")
"org.apache.flink" %% "flink-parquet" % flinkVersion)
lazy val root = (project in file(".")).
settings(
libraryDependencies ++= flinkDependencies,
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.0" ,
libraryDependencies += "org.apache.parquet" % "parquet-hadoop" % "1.11.1",
libraryDependencies += "org.apache.flink" %% "flink-table-planner-blink" % "1.12.1" //% "provided"
)
S3 is only supported by adding the respective flink-s3-fs-hadoop to your plugin folder as described on the plugin docs. For an IDE local setup, the root that should contain the plugins dir is the working directory by default. You can override it by using the env var FLINK_PLUGINS_DIR.
To get the flink-s3-fs-hadoop into plugin, I'm guessing some sbt glue is necessary (or you do it once manually). In gradle, I'd define a plugin scope and copy the jars in a custom task to the plugin dir.

java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame error when running Scala MongoDB connector

I am trying to run a Scala example with SBT to read data from MongoDB. I am getting this error whenever I try to access the data read from Mongo into the RDD.
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethod(Class.java:2128)
at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1431)
at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:494)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
I have imported the Dataframe explicitly, even though it is not used in my code. Can anyone help with this issue?
My code:
package stream
import org.apache.spark._
import org.apache.spark.SparkContext._
import com.mongodb.spark._
import com.mongodb.spark.config._
import com.mongodb.spark.rdd.MongoRDD
import org.bson.Document
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DataFrame
object SpaceWalk {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SpaceWalk")
.setMaster("local[*]")
.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/nasa.eva")
.set("spark.mongodb.output.uri", "mongodb://127.0.0.1/nasa.astronautTotals")
val sc = new SparkContext(sparkConf)
val rdd = sc.loadFromMongoDB()
def breakoutCrew ( document: Document ): List[(String,Int)] = {
println("INPUT"+document.get( "Duration").asInstanceOf[String])
var minutes = 0;
val timeString = document.get( "Duration").asInstanceOf[String]
if( timeString != null && !timeString.isEmpty ) {
val time = document.get( "Duration").asInstanceOf[String].split( ":" )
minutes = time(0).toInt * 60 + time(1).toInt
}
import scala.util.matching.Regex
val pattern = new Regex("(\\w+\\s\\w+)")
val names = pattern findAllIn document.get( "Crew" ).asInstanceOf[String]
var tuples : List[(String,Int)] = List()
for ( name <- names ) { tuples = tuples :+ (( name, minutes ) ) }
return tuples
}
val logs = rdd.flatMap( breakoutCrew ).reduceByKey( (m1: Int, m2: Int) => ( m1 + m2 ) )
//logs.foreach(println)
def mapToDocument( tuple: (String, Int ) ): Document = {
val doc = new Document();
doc.put( "name", tuple._1 )
doc.put( "minutes", tuple._2 )
return doc
}
val writeConf = WriteConfig(sc)
val writeConfig = WriteConfig(Map("collection" -> "astronautTotals", "writeConcern.w" -> "majority", "db" -> "nasa"), Some(writeConf))
logs.map( mapToDocument ).saveToMongoDB( writeConfig )
import org.apache.spark.sql.SQLContext
import com.mongodb.spark.sql._
import org.apache.spark.sql.DataFrame
// load the first dataframe "EVAs"
val sqlContext = new SQLContext(sc);
import sqlContext.implicits._
val evadf = sqlContext.read.mongo()
evadf.printSchema()
evadf.registerTempTable("evas")
// load the 2nd dataframe "astronautTotals"
val astronautDF = sqlContext.read.option("collection", "astronautTotals").mongo[astronautTotal]()
astronautDF.printSchema()
astronautDF.registerTempTable("astronautTotals")
sqlContext.sql("SELECT astronautTotals.name, astronautTotals.minutes FROM astronautTotals" ).show()
sqlContext.sql("SELECT astronautTotals.name, astronautTotals.minutes, evas.Vehicle, evas.Duration FROM " +
"astronautTotals JOIN evas ON astronautTotals.name LIKE evas.Crew" ).show()
}
}
case class astronautTotal ( name: String, minutes: Integer )
This is my sbt file -
name := "Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
//libraryDependencies += "org.apache.spark" %% "spark-streaming-twitter" % "1.2.1"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0"
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "0.1"
addCommandAlias("c1", "run-main stream.SaveTweets")
addCommandAlias("c2", "run-main stream.SpaceWalk")
outputStrategy := Some(StdoutOutput)
//outputStrategy := Some(LoggedOutput(log: Logger))
fork in run := true
This error message is because you are using an incompatible library that only supports Spark 1.x. You should use mongo-spark-connector 2.0.0+ instead. See: https://docs.mongodb.com/spark-connector/v2.0/

Anaylze twitter datas with Spark

Anyone else help me about how can i analyze twitter data based on 'keys' whatever i write.I found this code but this is give me an error.
import java.io.File
import com.google.gson.Gson
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Collect at least the specified number of tweets into json text files.
*/
object Collect {
private var numTweetsCollected = 0L
private var partNum = 0
private var gson = new Gson()
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 3) {
System.err.println("Usage: " + this.getClass.getSimpleName +
"<outputDirectory> <numTweetsToCollect> <intervalInSeconds> <partitionsEachInterval>")
System.exit(1)
}
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
val outputDir = new File(outputDirectory.toString)
if (outputDir.exists()) {
System.err.println("ERROR - %s already exists: delete or specify another directory".format(
outputDirectory))
System.exit(1)
}
outputDir.mkdirs()
println("Initializing Streaming Spark Context...")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(intervalSecs))
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)
.map(gson.toJson(_))
tweetStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0) {
val outputRDD = rdd.repartition(partitionsEachInterval)
outputRDD.saveAsTextFile(outputDirectory + "/tweets_" + time.milliseconds.toString)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
System.exit(0)
}
}
})
ssc.start()
ssc.awaitTermination()
}
}
Error is
object gson is not a member of package com.google
If you know any link about it or fix this problem can you share with me,because i want to analyze twitter datas with spark.
Thanks.:)
Like Peter pointed out, you are missing the gson dependency. So you'll need to add the following dependency to your build.sbt :
libraryDependencies += "com.google.code.gson" % "gson" % "2.4"
You can also do the following to define all the dependencies in one sequence :
libraryDependencies ++= Seq(
"com.google.code.gson" % "gson" % "2.4",
"org.apache.spark" %% "spark-core" % "1.2.0",
"org.apache.spark" %% "spark-streaming" % "1.2.0",
"org.apache.spark" %% "spark-streaming-twitter" % "1.2.0"
)
Bonus: In case of other missing dependencies, you can try to search your dependency on the http://mvnrepository.com/ and if you need to find the associated jar/dependency for a given class, you can also use the findjar website