I use flink to process some data. And I found fromCollection in StreamExecutionEnvironment has many implementations.
Som part of the pom.xml is :
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<scala.version>2.12.16</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<flink.version>1.14.4</flink.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
</dependency>
<!-- SQL Server-->
<dependency>
<groupId>com.microsoft.sqlserver</groupId>
<artifactId>sqljdbc4</artifactId>
<version>4.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid-spring-boot-starter</artifactId>
<version>1.1.10</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>com.microsoft.sqlserver</groupId>
<artifactId>mssql-jdbc</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>1.14.4</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>1.14.4</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web_${scala.binary.version}</artifactId>
<version>1.14.4</version>
</dependency>
</dependencies>
It seems that Scala cannot dertemine which fromCollection method to use.
package com.mycode.learnflink.ts.sync
import com.mycode.learnflink.model.datasourcesync.domain.RiverWaterRegime
import org.apache.flink.streaming.api.datastream.DataStreamSource
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
class SyncWater {
var cachedRiverWaterRegimes: List[RiverWaterRegime] = List()
private val BATCH_SIZE = 1000
private var size = 0
def flinkProcess(list:List[RiverWaterRegime] ): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(6)
val source:DataStreamSource[String] = env.fromCollection(list) // Error:Cannot resolve overloaded method 'fromCollection'
}
}
What can I do to eliminate this error?
In your case, the imported package should be Scala ones.
Here is the example.
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
class SyncWater {
type RiverWaterRegime = Int
private val BATCH_SIZE = 1000
private var size = 0
def flinkProcess(list:List[RiverWaterRegime] ): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(6)
val source = env.fromCollection(list)
}
}
Related
I am following this link to use pureconfig to load the data
https://pureconfig.github.io/docs/overriding-behavior-for-case-classes.html.
Here is my code
import com.typesafe.config.ConfigFactory
import pureconfig._
private case class SampleConf(foo: Int, bar: String)
object TestConfigLoad {
def main(args: Array[String]): Unit = {
loadConfig[SampleConf](ConfigFactory.parseString("{ FOO: 2, BAR: two }"))
}
}
When I run it , I am getting this error
Error:scalac: Error: scala.collection.immutable.$colon$colon.tl$1()Lscala/collection/immutable/List;
java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.tl$1()Lscala/collection/immutable/List;
at shapeless.LazyMacros$DerivationContext$State.addDependency(lazy.scala:363)
These are the entries in the pom file
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.7</version>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
</dependency>
<dependency>
<groupId>com.github.pureconfig</groupId>
<artifactId>pureconfig_2.11</artifactId>
<version>0.8.0</version>
</dependency>
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_2.11</artifactId>
<version>1.3.3</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.9</version>
</dependency>
<dependency>
<groupId>com.lihaoyi</groupId>
<artifactId>sourcecode_2.11</artifactId>
<version>0.1.4</version>
</dependency>
You have to match the Scala versions. Cannot mix 2.10 (scala-library) and 2.11 (pureconfig_2.11).
Unless you have a good reason, use the latest stable version (2.12.8 currently)
I am getting compile time error in scala spark program for filter method. Below is my code :
case class TrafficSchema(a: String, b: Int, c: Int, d: Int, e: Float)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val data = sqlContext.read.csv(input_Path)
val header = data.first()
val trackerData = data.filter(head => head != header)
.map(f => TrafficSchema(f.get(0).toString(), f.get(1).toString().toInt, f.get(2).toString().toInt, f.get(3).toString().toInt, f.get(4).toString().toFloat)).toDF()
I am getting compile time error on filter method, below is the error,
Reference to method filter in class Dataset should not have survived past type checking, it should have been processed and eliminated during expansion of an enclosing macro.
And below are my maven dependencies:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_${scala.compat.version}</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.2</version>
</dependency>
<!-- Test Scopes -->
<dependency>
<groupId>org.scalactic</groupId>
<artifactId>scalactic_${scala.compat.version}</artifactId>
<version>3.0.5</version>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<version>3.0.5</version>
<scope>test</scope>
</dependency>
</dependencies>
And the Scala version I am using is 2.10.6
I am not sure what I am doing wrong.
I am getting error
Exception in thread "main" java.lang.NoClassDefFoundError
Here is my complete program:
object StatefulNetworkWordCount {
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = runningCount.getOrElse(0) + newValues.sum
Some(newCount)
}
// Set checkpoint directory
ssc.checkpoint("E:\\sparkdata")
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
// Update state using `updateStateByKey`
val runningCounts = pairs.updateStateByKey[Int](updateFunction _)
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
And my pom.xml file :///
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.3</version>
</dependency>
<!-- JDBC Connector Jar -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.31</version>
</dependency>
<dependency>
<groupId>au.com.bytecode</groupId>
<artifactId>opencsv</artifactId>
<version>2.4</version>
</dependency>
</dependencies>
</project>
you are including different versions of spark.
you have both 2.1.1 and 1.6.0 .
you should use the same spark version for all spark dependencies
when I run a normal wordcount program(with below code) with out any Dataframe included I am able run the application with spark-submit.
object wordCount {
def main(args: Array[String]): Unit = {
val logFile= "path/thread.txt"
val sparkConf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(logFile)
val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("path/output1234")
sc.stop()
}
}
But when I run the below code
import scala.reflect.runtime.universe
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object wordCount {
def main(args: Array[String]): Unit = {
val logFile = "path/thread.txt"
val sparkConf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(logFile)
val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
case class count1(key:String,value:Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._;
counts.toDF.registerTempTable("count1")
val counts1 = sqlContext.sql("select * from count1")
counts.saveAsTextFile("path/output1234")
sc.stop()
}
}
I am getting the below error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at com.cadillac.spark.sparkjob.wordCount$.main(wordCount.scala:18)
I am not sure what I am missing.
Pom.xml I am using is as below,
<name>sparkjob</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.10</version>
</dependency>
</dependencies>
</project>
Please suggest any changes.
My cluster is with
spark-version 2.1.0-mapr-1703
Scala version 2.11.8
Thanks in advance
If you go to this documentation the reason for the error is defined there as
This means that there is a mix of Scala versions in the libraries used in your code. The collection API is different between Scala 2.10 and 2.11 and this the most common error which occurs if a Scala 2.10 library is attempted to be loaded in a Scala 2.11 runtime. To fix this make sure that the name has the correct Scala version suffix to match your Scala version.
So changing your dependencies from
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
to
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
and add one more dependency
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
I guess the error should go away
I have tried with below code in spark and scala, attaching code and pom.xml
package com.Spark.ConnectToHadoop
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
//import groovy.sql.Sql.CreateStatementCommand
//import org.apache.spark.SparkConf
object CountWords {
def main(args:Array[String]){
val objConf = new SparkConf().setAppName("Spark Connection").setMaster("spark://IP:7077")
var sc = new SparkContext(objConf)
val objHiveContext = new HiveContext(sc)
objHiveContext.sql("USE test")
var test= objHiveContext.sql("show tables")
var i = 0
var testing = test.collect()
for(i<-0 until testing.length){
println(testing(i))
}
}
}
I have added spark-core_2.10,spark-catalyst_2.10,spark-sql_2.10,spark-hive_2.10 dependencies Do I need to add any more dependencies???
Edit:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.Sudhir.Maven1</groupId>
<artifactId>SparkDemo</artifactId>
<version>IntervalMeterData1</version>
<packaging>jar</packaging>
<name>SparkDemo</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>1.5.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Looks like you forgot to bump the spark-hive:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.5.2</version>
</dependency>
Consider introducing maven variable, like spark.version.
<properties>
<spark.version>1.5.2</spark.version>
</properties>
And modifying all your spark dependencies in this manner:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
Bumping up versions of spark won't be as painful.
Just adding the property spark.version in your <properties> is not enough, you have to call it with ${spark.version} in dependencies.