I have a basic Spark - Kafka code, I try to run following code:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import java.util.regex.Pattern
import java.util.regex.Matcher
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import Utilities._
object WordCount {
def main(args: Array[String]): Unit = {
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
setupLogging()
// Construct a regular expression (regex) to extract fields from raw Apache log lines
val pattern = apacheLogPattern()
// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
// List of topics you want to listen for from Kafka
val topics = List("testLogs").toSet
// Create our Kafka stream, which will contain (topic,message) pairs. We tack a
// map(_._2) at the end in order to only get the messages, which contain individual
// lines of data.
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).map(_._2)
// Extract the request field from each log line
val requests = lines.map(x => {val matcher:Matcher = pattern.matcher(x); if (matcher.matches()) matcher.group(5)})
// Extract the URL from the request
val urls = requests.map(x => {val arr = x.toString().split(" "); if (arr.size == 3) arr(1) else "[error]"})
// Reduce by URL over a 5-minute window sliding every second
val urlCounts = urls.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(300), Seconds(1))
// Sort and print the results
val sortedResults = urlCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print()
// Kick it off
ssc.checkpoint("/home/")
ssc.start()
ssc.awaitTermination()
}
}
I am using IntelliJ IDE, and create scala project by using sbt. Details of build.sbt file is as follow:
name := "Sample"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.4.1",
"org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
However, when I try to build the code, it creates following error:
Error:scalac: missing or invalid dependency detected while loading class file 'StreamingContext.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'StreamingContext.class' was compiled against an incompatible version of org.apache.spark.
Error:scalac: missing or invalid dependency detected while loading class file 'DStream.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'DStream.class' was compiled against an incompatible version of org.apache.spark.
When using different Spark libraries together the versions of all libs should always match.
Also, the version of kafka you use matters also, so should be for example: spark-streaming-kafka-0-10_2.11
...
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
This is a useful site if you need to check the exact dependencies you should use:
https://search.maven.org/
Related
Getiing following error while trying to read data from Kafka. I am using docker-compose for running kafka and spark.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Here is my code for reading:
object Livedata extends App with LazyLogging {
logger.info("starting livedata...")
val spark = SparkSession.builder().appName("livedata").master("local[*]").getOrCreate()
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "topic")
.option("startingOffsets", "latest")
.load()
df.printSchema()
val hadoopConfig = spark.sparkContext.hadoopConfiguration
hadoopConfig.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hadoopConfig.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
}
After reading few answers I have added all packages for sbt build
Here is build.sbt file:
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "com.live.data",
version := "0.1.0",
scalaVersion := "2.12.2",
assemblyJarName in assembly := "livedata.jar"
)),
name := "livedata",
libraryDependencies ++= List(
"org.scalatest" %% "scalatest" % "3.0.5",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0" % "provided",
"org.apache.kafka" % "kafka-clients" % "2.5.0",
"org.apache.kafka" % "kafka-streams" % "2.5.0",
"org.apache.kafka" %% "kafka-streams-scala" % "2.5.0"
)
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Not sure what is the main issue here.
Update:
Finally I got the solution from here Error when connecting spark structured streaming + kafka
Main issue was getting this org.apache.spark.sql.AnalysisException: Failed to find data source: kafka exception because spark-sql-kafka library is not available in classpath & It is unable to find org.apache.spark.sql.sources.DataSourceRegister inside META-INF/services folder.
Following codeblock need to add in build.sbt. This will include org.apache.spark.sql.sources.DataSourceRegister file in the final jar.
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF","services",xs # _*) => MergeStrategy.filterDistinctLines
case PathList("META-INF",xs # _*) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}```
spark-sql-kafka-0-10 is not provided, so remove that part of the dependencies. (spark-sql is provided, though, so you could add it to that one)
You also shouldn't pull Kafka Streams (since that's not used by Spark), and kafka-clients is transitively pulled by sql-kafka, so don't need that either
I am new in spark and I am trying this example:
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.{Vectors,Vector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}
object App {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println(
"Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
// $example on$
val conf = new SparkConf().setAppName("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val model = new StreamingKMeans()
.setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
// $example off$
}
}
but it cannot resolve LabeledPoint.parse it only has apply and unapply methods available not parse.
It's probably the version I am using. So this is my sbt
name := "myApp"
version := "0.1"
scalaVersion := "2.11.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-mllib" % "2.3.1"
)
EDIT so I made a custom made labelPoint class since nothing else worker that did solved the compile problem. But, I tried to run it and the predicted values are always zero.
the input txt for train is
[36.72, 67.44]
[92.20, 11.81]
[90.85, 48.07]
.....
and the test txt is
(2, [9.26,68.19])
(1, [3.27,9.14])
(9, [66.66,13.85])
....
So why the result values are 2,0 1,0 9,0 ? Is there a problem with labeledPoint?
Need some help, please.
I am using IntelliJ with SBT to build my apps.
I'm working on an app to read a Kafka topic in Spark Streaming in order to do some ETL work on it. Unfortunately, I can't read from Kafka.
The KafkaUtils.createDirectStream isn't resolving and keeps giving me errors (CANNOT RESOLVE SYMBOL). I have done my research and it appears I have the correct dependencies.
Here is my build.sbt:
name := "ASUIStreaming"
version := "0.1"
scalacOptions += "-target:jvm-1.8"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0"
libraryDependencies += "org.apache.kafka" %% "kafka-clients" % "0.8.2.1"
libraryDependencies += "org.scala-lang.modules" %% "scala-parser-combinators" % "1.0.4"
Any suggestions? I should also mention I don't have admin access on the laptop since this is a work computer, and I am using a portable JDK and IntelliJ installation. However, my colleagues at work are in the same situation and it works fine for them.
Thanks in advance!
Here is the main Spark Streaming code snippet I'm using.
Note: I've masked some of the confidential work data such as IP and Topic name etc.
import org.apache.kafka.clients.consumer.ConsumerRecord
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark
import org.apache.kafka.clients.consumer._
import org.apache.kafka.common.serialization.StringDeserializer
import scala.util.parsing.json._
import org.apache.spark.streaming.kafka._
object ASUISpeedKafka extends App
{
// Create a new Spark Context
val conf = new SparkConf().setAppName("ASUISpeedKafka").setMaster("local[*]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
//Identify the Kafka Topic and provide the parameters and Topic details
val kafkaTopic = "TOPIC1"
val topicsSet = kafkaTopic.split(",").toSet
val kafkaParams = Map[String, String]
(
"metadata.broker.list" -> "IP1:PORT, IP2:PORT2",
"auto.offset.reset" -> "smallest"
)
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]
(
ssc, kafkaParams, topicsSet
)
}
I was able to resolve the issue. After re-creating the project and adding all dependencies again, I found out that in Intellij certain code has to be on the same line other it won't compile.
In this case, putting val kafkaParams code on the same line (instead of in a code block) solved the issue!
I have managed to run Mahout rowsimilarity on flat files of below format:
item-id tag1 tag-2 tag3
This has to be run via cli and the output is again flat files. I want to make this such that it reads data from MongoDB (open to using other DBs too) and then dumps the output to DB which can then be picked from our system.
I've researched for past few days and found below things:
Will have to write Scala code implementing RowSimilarity
Pass it an IndexedDataSet object to process the data
Convert the output to required format (json/csv)
What I'm yet to figure out is how do I go about importing data from DB to IndexedDataSet. Also, I've read about RDD format and still can't figure out how to convert json data to RDD which can be used by RowSimilarity code.
tl;dr: How to convert MongoDB data so that it can be processed by mahout/spark rowsimilarity?
Edit1: I have found some code that converts Mongodata to RDD from this link: https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage#scala-example
Now I need help to convert it to IndexedDataset so that it can be passed to SimilarityAnalysis.rowSimilarityIDS.
tl;dr: How do I convert RDD to IndexedDataset
Below is the answer:
import org.apache.hadoop.conf.Configuration
import org.apache.mahout.math.cf.SimilarityAnalysis
import org.apache.mahout.math.indexeddataset.Schema
import org.apache.mahout.sparkbindings
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
import org.apache.spark.rdd.RDD
import org.bson.BSONObject
import com.mongodb.hadoop.MongoInputFormat
object SparkExample extends App {
implicit val mc = sparkbindings.mahoutSparkContext(masterUrl = "local", appName = "RowSimilarity")
val mongoConfig = new Configuration()
mongoConfig.set("mongo.input.uri", "mongodb://hostname:27017/db.collection")
val documents: RDD[(Object, BSONObject)] = mc.newAPIHadoopRDD(
mongoConfig,
classOf[MongoInputFormat],
classOf[Object],
classOf[BSONObject]
)
val documents_Array: RDD[(String, Array[String])] = documents.map(
doc1 => (
doc1._2.get("product_id").toString(),
doc1._2.get("product_attribute_value").toString().replace("[ \"", "").replace("\"]", "").split("\" , \"").map(value => value.toLowerCase.replace(" ", "-").mkString(" "))
)
)
val new_doc: RDD[(String, String)] = documents_Array.flatMapValues(x => x)
val myIDs = IndexedDatasetSpark(new_doc)(mc)
val readWriteSchema = new Schema(
"rowKeyDelim" -> "\t",
"columnIdStrengthDelim" -> ":",
"omitScore" -> false,
"elementDelim" -> " "
)
SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://hadoop:9000/mongo-hadoop-rowsimilarity", readWriteSchema)(mc)
}
build.sbt:
name := "scala-mongo"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.mongodb" %% "casbah" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "1.4.2"
libraryDependencies ++= Seq(
"org.apache.hadoop" % "hadoop-client" % "2.6.0" exclude("javax.servlet", "servlet-api") exclude ("com.sun.jmx", "jmxri") exclude ("com.sun.jdmk", "jmxtools") exclude ("javax.jms", "jms") exclude ("org.slf4j", "slf4j-log4j12") exclude("hsqldb","hsqldb"),
"org.scalatest" % "scalatest_2.10" % "1.9.2" % "test"
)
libraryDependencies += "org.apache.mahout" % "mahout-math-scala_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-spark_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-math" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-hdfs" % "0.11.2"
resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/"
resolvers += Resolver.mavenLocal
I've used mongo-hadoop to get data from Mongo and use it. Since my data had an array, I had to use flatMapValues to flatten it and then pass to IDS for proper output.
PS: I posted the answer here and not the linked question because this Q&A covers the full scope of getting data and processing it.
The code below causes Spark to become unresponsive:
System.setProperty("hadoop.home.dir", "H:\\winutils");
val sparkConf = new SparkConf().setAppName("GroupBy Test").setMaster("local[1]")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val text_file = sc.textFile("h:\\data\\details.txt")
val counts = text_file
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
println(counts);
}
I'm setting hadoop.home.dir in order to avoid the error mentioned here: Failed to locate the winutils binary in the hadoop binary path
This is how my build.sbt file looks like:
lazy val root = (project in file(".")).
settings(
name := "hello",
version := "1.0",
scalaVersion := "2.11.0"
)
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "1.6.0"
)
Should Scala Spark be compilable/runnable using the sbt code in the file?
I think code is fine, it was taken verbatim from http://spark.apache.org/examples.html, but I am not sure if the Hadoop WinUtils path is required.
Update: "The solution was to use fork := true in the main build.sbt"
Here is the reference: Spark: ClassNotFoundException when running hello world example in scala 2.11
This is the content of my build.sbt. Notice that if your internet connection is slow it might take some time.
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.1",
"org.apache.spark" %% "spark-mllib" % "1.6.1",
"org.apache.spark" %% "spark-sql" % "1.6.1",
"org.slf4j" % "slf4j-api" % "1.7.12"
)
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
In the main I added this, however it depends on where you placed the winutil folder.
System.setProperty("hadoop.home.dir", "c:\\winutil")