Creating a kafka consumer using spark streaming - scala

I have started kafka, created a topic and a producer. Now I want to read messages sent from that producer. My code
def main(args: Array[String]){
val sparkConf = new SparkConf()
val spark = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(spark, Seconds(5))
val kafkaStream = KafkaUtils.createStream(streamingContext
, "localhost:2181"
, "test-group"
, Map("test" -> 1))
kafkaStream.print
streamingContext.start
streamingContext.awaitTermination
}
The dependencies I use
<properties>
<spark.version>1.6.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
But every time I try to run it in idea I get
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at org.apache.spark.util.Utils$.getSystemProperties(Utils.scala:1582)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:59)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:53)
at com.mypackage.KafkaConsumer$.main(KafkaConsumer.scala:10)
at com.mypackage.KafkaConsumer.main(KafkaConsumer.scala)
Other questions here point to conflicts between dependencies.
I use scala 2.10.5 and spark 1.6.2. I tried them in other projects, they worked fine.
Line 10 in this case is val sparkConf = new SparkConf()
I try to run the app in the IDEA without packaging it.
What can be the reason for this problem?

It's an error with Scala version. You're are using different versions of scala in your code and dependencies.
You said you're using scala 2.10 but you import spark-XX_2.11 dependencies. Unify your scala version.

Related

Getting dependency error for sparksession and SQLContext

I am getting dependency error for my SQLContext and sparksession in my spark program
val sqlContext = new SQLContext(sc)
val spark = SparkSession.builder()
Error for SQLCOntext
Symbol 'type org.apache.spark.Logging' is missing from the classpath. This symbol is required by 'class org.apache.spark.sql.SQLContext'. Make sure that type Logging is in your classpath and check for conflicting dependencies with -Ylog-classpath. A full rebuild may help if 'SQLContext.class' was compiled against an incompatible version of org.apache.spark.
Error for SparkSession:
not found: value SparkSession
Below are the spark dependencies in my pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>
You can't have both Spark 2 and Spark 1.6 dependencies defined in your project.
org.apache.spark.Logging is not available in Spark 2 anymore.
Change
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
to
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>

Spark Streaming Kafka: ClassNotFoundException for ByteArrayDeserializer when run with spark-submit

I'm new to Scala / Spark Streaming, and to StackOverflow so please excuse my formatting. I have made a Scala app that reads log files from a Kafka Stream. It runs fine within the IDE, but I'll be damned if I can get it to run using spark-submit. It always fails with:
ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
The line referenced in the Exception is the load command in this snippet:
val records = spark
.readStream
.format("kafka") // <-- use KafkaSource
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaBroker) // 192.168.4.86:9092
.load()
.selectExpr("CAST(value AS STRING) AS temp")
.withColumn("record", deSerUDF($"temp"))
IDE: IntelliJ
Spark: 2.2.1
Scala: 2.11.8
Kafka: kafka_2.11-0.10.0.0
Relevant parts of pom.xml:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.2.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.github.scala-incubator.io</groupId>
<artifactId>scala-io-file_2.11</artifactId>
<version>0.4.3-1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
<!-- version>2.0.0</version -->
</dependency>
Note: I don't think it is related, but I have to use zip -d BroLogSpark.jar "META-INF/*.SF" and zip -d BroLogSpark.jar "META-INF/*.DSA" to get past meaning about the manifest signatures.
My jar file does not include any of org.apache.kafka. I have seen several posts that strongly suggest I have a mismatch in versions, and I have tried countless permutations of changes to pom.xml and spark-submit. After each change, I confirm that it still runs within the IDE, then proceed to try using spark-submit on the same system, same user. Below is my most recent attempt, where my BroLogSpark.jar is in the current directory and "192.168.4.86:9092 profile" are input arguments.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.1,org.apache.kafka:kafka-clients:0.10.0.0 BroLogSpark.jar 192.168.4.86:9092 BroFile
Add below dependency too
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.0.0</version>
</dependency>

Unable to connect to remote Cassandra via Spark + Scala

I'm having some troubles while trying to connect to a remote Cassandra using Apache-Spark and Scala. I successfully managed to connect in past, in the same way, with MongoDb.
This time I really don't understand why I'm getting the following error:
Failed to open native connection to Cassandra at {127.0.0.1}:9042
I guess it's a dependency and version problem but I was not able to find anything related to this issue in particular, both on documentation and on other questions.
I essentially manage to connect via ssh-tunnel to my server using jsch and all works fine. Then, I'm successfully able to connect to the local apache-spark using SparkConnectionFactory.scala:
package connection
import org.apache.spark.{SparkConf, SparkContext}
class SparkConnectionFactory {
var sparkContext : SparkContext = _
def initSparkConnection = {
val configuration = new SparkConf(true).setMaster("local[8]")
.setAppName("my_test")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.input.consistency.level","ONE")
.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(configuration)
sparkContext = sc
}
def getSparkInstance : SparkContext = {
sparkContext
}
}
And calling it in my Main.scala:
val sparkConnectionFactory = new SparkConnectionFactory
sparkConnectionFactory.initSparkConnection
val sc : SparkContext = sparkConnectionFactory.getSparkInstance
But, when I try to select all the items inside a Cassandra table using:
val rdd = sc.cassandraTable("my_keyspace", "my_table")
rdd.foreach(println)
I get the error I wrote above.
On my server I installed Scala ~v2.11.6, Spark ~v2.1.1, SparkSQL ~v2.1.1. Of course I have 8 cores and a replication factor of 1. On my pom.xml I have:
. . .
<properties>
<scala.version>2.11.6</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
. . .
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10 -->
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-codec/commons-codec -->
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.9</version>
</dependency>
</dependencies>
Is my issue caused by conflicting versions? If yes, how can I fix this? If not, any hint on what's causing it?
Thanks in advance.
I'm forwarding port 9042 to 8988
Then that's the port you need to connect to
.set("spark.cassandra.connection.port", 8988)

Exception while running Spark program with SQL context in Scala

I am trying to run a simple Spark scala program built with Maven
Below is the source code:
case class Person(name:String,age:Int)
object parquetoperations {
def main(args:Array[String]){
val sparkconf=new SparkConf().setAppName("spark1").setMaster("local")
val sc=new SparkContext(sparkconf);
val sqlContext= new SQLContext(sc)
import sqlContext.implicits._
val peopleRDD = sc.textFile(args(0));
val peopleDF=peopleRDD.map(_.split(",")).map(attributes=>Person(attributes(0),attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val adultsDF=sqlContext.sql("select * from people where age>18")
//adultsDF.map(x => "Name: "+x.getAs[String]("name")+ " age is: "+x.getAs[Int]("age")).show();
}
}
and below are the maven dependencies I have.
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-xml</artifactId>
<version>2.11.0-M4</version>
</dependency>
It throws the below error. Tried to debug in various ways with no luck
Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$scope()Lscala/xml/TopScope$;
Looks like this is an error related to loading the spark web ui
All your dependencies are on Scala 2.10, but your scala-xml dependency is on Scala 2.11.
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-xml</artifactId>
<version>2.11.0-M4</version>
</dependency>
Btw, unless you really have a strong reason to do so, I would suggest you move to Scala 2.11.8. Everything is so much better with 2.11 compared to 2.10.

Getting an error while trying to run a simple spark streaming kafka example

I am trying to run a simple kafka spark streaming example. Here is the error I am getting.
16/10/02 20:45:43 INFO SparkEnv: Registering OutputCommitCoordinator
Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$scope()Lscala/xml/TopScope$; at
org.apache.spark.ui.jobs.StagePage.(StagePage.scala:44) at
org.apache.spark.ui.jobs.StagesTab.(StagesTab.scala:34) at
org.apache.spark.ui.SparkUI.(SparkUI.scala:62) at
org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215) at
org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157) at
org.apache.spark.SparkContext.(SparkContext.scala:443) at
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:836)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:84)
at
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:138)
at com.application.SparkConsumer.App.main(App.java:27)
I am setting this example using the following pom. I have tried to find this missing scala.Predef class, and added the missing dependency for spark-streaming-kafka-0-8-assembly, and I can see the class when I explore this jar.
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.8.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-8-assembly_2.11</artifactId>
    <version>2.0.0</version>
</dependency>
I have tried a simple spark word count example and it works fine. When I use this spark-streaming-kafka, I am having trouble. I have tried to lookup for this error, but no luck.
Here is the code snippet.
SparkConf sparkConf = new SparkConf().setAppName("someapp").setMaster("local[2]");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<String,Integer>();
topicMap.put("fast-messages", 1);
Map<String, String> kafkaParams = new HashMap<String,String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,"zoo1","my-consumer-group", topicMap);
There seems to be problem when I used 2.11 of 0.8.2.0 kafka. After switching to 2.10 it worked fine.