Spark 2.3.1 structured streaming kafka ClassNotFound [duplicate] - scala

This question already has answers here:
Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)?
(8 answers)
Closed 4 years ago.
I am trying to use Spark 2.3.1 structured streaming with Kafka. Getting the following error:
java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
... 49 elided
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 51 more
Request advise.
Scala code (using IntelliJ 2018.1):
import org.apache.log4j.Logger._
import org.apache.spark.sql.SparkSession
object test {
def main(args: Array[String]): Unit = {
println("test3")
import org.apache.log4j._
getLogger("org").setLevel(Level.ERROR)
getLogger("akka").setLevel(Level.ERROR)
val spark = SparkSession.
builder.
master("local").
appName("StructuredNetworkWordCount").
getOrCreate()
import spark.implicits._
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "t_tweets")
.load()
lines.selectExpr("CAST(value AS STRING)")
.as[(String)]
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
}
}
Build.sbt :
name := "scalaSpark3"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "1.1.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" % "provided"
Full error log:
objc[7301]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/bin/java (0x10c6334c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x10c6bf4e0). One of the two will be used. Which one is undefined.
test3
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
at test$.main(test.scala:33)
at test.main(test.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 3 more
My code is based the sample code here:
https://spark.apache.org/docs/2.3.1/structured-streaming-programming-guide.html
https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html

Kafka integration is not included on your Spark classpath
Therefore, remove provided
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.1" % "provided"
And make sure you create an uber jar before you run Spark Submit

Related

sparkSession throwing Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Maps

I was trying to write simple scala program to use spark, which has following content.
src/main/scala/SimpleApp.scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.random
object SimpleApp {
def main(args: Array[String]) {
val logFile = "<Some Valid Text File Path>" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").master("local").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"
but when I run the program I get following exception stack trace:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/21 03:23:07 INFO SparkContext: Running Spark version 2.4.5
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Maps
at org.apache.hadoop.metrics2.lib.MetricsRegistry.<init>(MetricsRegistry.java:42)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.<init>(MetricsSystemImpl.java:93)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.<init>(MetricsSystemImpl.java:140)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:236)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at SimpleApp$.main(SimpleApp.scala:9)
at SimpleApp.main(SimpleApp.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 17 more
I tried running in debug mode and exception seems to be thrown when trying to create SparkSession object. What am I missing?
I have installed spark from brew and it works from terminal.
I found a solution. To run this in IDE I needed to add few extra dependencies. I appended following to build.sbt
libraryDependencies += "com.google.guava" % "guava" % "28.2-jre"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.10.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.2"

Spark-Kafka invalid dependency detected

I have a basic Spark - Kafka code, I try to run following code:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import java.util.regex.Pattern
import java.util.regex.Matcher
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import Utilities._
object WordCount {
def main(args: Array[String]): Unit = {
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
setupLogging()
// Construct a regular expression (regex) to extract fields from raw Apache log lines
val pattern = apacheLogPattern()
// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
// List of topics you want to listen for from Kafka
val topics = List("testLogs").toSet
// Create our Kafka stream, which will contain (topic,message) pairs. We tack a
// map(_._2) at the end in order to only get the messages, which contain individual
// lines of data.
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).map(_._2)
// Extract the request field from each log line
val requests = lines.map(x => {val matcher:Matcher = pattern.matcher(x); if (matcher.matches()) matcher.group(5)})
// Extract the URL from the request
val urls = requests.map(x => {val arr = x.toString().split(" "); if (arr.size == 3) arr(1) else "[error]"})
// Reduce by URL over a 5-minute window sliding every second
val urlCounts = urls.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(300), Seconds(1))
// Sort and print the results
val sortedResults = urlCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print()
// Kick it off
ssc.checkpoint("/home/")
ssc.start()
ssc.awaitTermination()
}
}
I am using IntelliJ IDE, and create scala project by using sbt. Details of build.sbt file is as follow:
name := "Sample"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.4.1",
"org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
However, when I try to build the code, it creates following error:
Error:scalac: missing or invalid dependency detected while loading class file 'StreamingContext.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'StreamingContext.class' was compiled against an incompatible version of org.apache.spark.
Error:scalac: missing or invalid dependency detected while loading class file 'DStream.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'DStream.class' was compiled against an incompatible version of org.apache.spark.
When using different Spark libraries together the versions of all libs should always match.
Also, the version of kafka you use matters also, so should be for example: spark-streaming-kafka-0-10_2.11
...
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
This is a useful site if you need to check the exact dependencies you should use:
https://search.maven.org/

Spark 2.3.0 Failed to find data source: kafka

I am attempting to setup a Kafka stream using a CSV so that I can stream it into Spark. However, I keep getting
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
My code looks like this
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val mySchema = StructType(Array(
StructField("incident_id", IntegerType),
StructField("date", StringType),
StructField("state", StringType),
StructField("city_or_county", StringType),
StructField("n_killed", IntegerType),
StructField("n_injured", IntegerType)
))
val streamingDataFrame = spark.readStream.schema(mySchema).csv("C:/Users/zoldham/IdeaProjects/flinkpoc/Data/test")
streamingDataFrame.selectExpr("CAST(incident_id AS STRING) AS key",
"to_json(struct(*)) AS value").writeStream
.format("kafka")
.option("topic", "testTopic")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/zoldham/IdeaProjects/flinkpoc/Data")
.start()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testTopic").load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json(col("value"), mySchema).as("data"), col("timestamp"))
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}
And my build.sbt file looks like this
name := "Spark POC"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.microsoft.sqlserver" % "mssql-jdbc" % "6.2.1.jre8"
libraryDependencies += "org.scalafx" %% "scalafx" % "8.0.144-R12"
libraryDependencies += "org.apache.ignite" % "ignite-core" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-spring" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-indexing" % "2.5.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % "2.3.0"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.11.0.1"
What is causing that error? As you can see, I plainly included Kafka in the library dependencies, and even followed the official guide. Here is the stack trace:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:283)
at SpeedTester$.main(SpeedTester.scala:61)
at SpeedTester.main(SpeedTester.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 3 more
You need to add missing dependency
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
as it stated in documentation or here for example.

Scala Exception

I am learning Scala programming to write driver program for word count in Apache Spark .I am using Windows 7 and Latest Spark version 2.2.0. While executing the program getting below mentioned error.
How to fix and get result ?
SBT
name := "sample"
version := "0.1"
scalaVersion := "2.12.3"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion,
"org.apache.spark" % "spark-sql_2.11" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.11" % sparkVersion
)
Driver Program
package com.demo.file
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
object Reader {
def main(args: Array[String]): Unit = {
println("Welcome to Reader.")
val filePath = "C:\\notes.txt"
val spark = SparkSession.builder.appName("Simple app").config("spark.master", "local")getOrCreate();
val fileData = spark.read.textFile(filePath).cache()
val count_a = fileData.filter(line => line.contains("a")).count()
val count_b = fileData.filter(line => line.contains("b")).count()
println(s" count of A $count_a and count of B $count_b")
spark.stop()
}
}
Error
Welcome to Reader.
Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class
at org.apache.spark.SparkConf$DeprecatedConfig.<init>(SparkConf.scala:723)
at org.apache.spark.SparkConf$.<init>(SparkConf.scala:571)
at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:92)
at org.apache.spark.SparkConf.set(SparkConf.scala:81)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.demo.file.Reader$.main(Reader.scala:11)
at com.demo.file.Reader.main(Reader.scala)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
Spark 2.2.0 is built and distributed to work with Scala 2.11 by default. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.11.X). And your scala version is 2.12.X. That's why it is throwing exception.

Spark: Creating DataFrame gives exception

I am trying to create DataFrame using spark sqlContext. I have used spark 1.6.3 and scala 2.10.5. Below is my code for creating DataFrames.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.knoldus.pipeline.KMeansPipeLine
object SimpleApp{
def main(args:Array[String]){
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val kMeans = new KMeansPipeLine()
val df = sqlContext.createDataFrame(Seq(
("a#email.com", 12000,"M"),
("b#email.com", 43000,"M"),
("c#email.com", 5000,"F"),
("d#email.com", 60000,"M")
)).toDF("email", "income","gender")
val categoricalFeatures = List("gender","email")
val numberOfClusters = 2
val iterations = 10
val predictionResult = kMeans.predict(sqlContext,df,categoricalFeatures,numberOfClusters,iterations)
}
}
Its giving me the following exception. What mistake I am doing? Can anyone help me resolve this?
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.sql.SQLContext.createDataFrame(Lscala/collection/Seq;Lscala/ref lect/api/TypeTags$TypeTag;)Lorg/apache/spark/sql/Dataset;
at SimpleApp$.main(SimpleApp.scala:24)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The dependencies I have used are:
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.0.0" % "provided",
"org.apache.spark" % "spark-sql_2.10" % "2.0.0" % "provided",
"org.apache.spark" % "spark-mllib_2.10" % "2.0.0" % "provided",
"knoldus" % "k-means-pipeline" % "0.0.1" )
As I see in your createDataFrame missed second argument. Method pattern described here:
https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.sql.SQLContext#createDataFrame(org.apache.spark.api.java.JavaRDD,%20java.lang.Class)
In your case it will be
def createDataFrame[A <: Product](data: Seq[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame
:: Experimental :: Creates a DataFrame from a local Seq of Product.
OR
Converting Seq into List/RDD and using method pattern with 2 arguments