How to use Flink's KafkaSource with Scala in 2022 - scala

I've checked out this similar but 7 year old question but it does not apply to newer Flink versions.
I'm trying to get a simple Flink Kafka job running and have tried various versions getting different compile errors for each. I'm using sbt to manage my dependencies:
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-clients" % flinkVersion,
"org.apache.flink" %% "flink-scala" % flinkVersion,
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion,
"org.apache.flink" %% "flink-connector-kafka" % flinkVersion
)
Versions tried:
scala 2.11.12 and 2.12.15
flink 1.14.6
The code I'm trying to compile (relevant bits):
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.connector.kafka.source.KafkaSource
...
val env = ExecutionEnvironment.getExecutionEnvironment
val kafkaConsumer = new KafkaSource.builder[String]
.setBootstrapservers("localhost:9092")
.setGroupId("flink")
.setTopics("test")
.build()
val text = env.fromSource(kafkaConsumer)
I did not find an official example that this is indeed how one is supposed to use the KafkaSource but I found this setup here and here. To my still very new Java eyes this looks aligned with the API docs. But yeah can't get it to work with either Scala version:
[error] somepathwithmyfile: type builder is not a member of object org.apache.flink.connector.kafka.source.KafkaSource
[error] val kafkaConsumer = new KafkaSource.builder[String]
[error] ^
[error] somepathwithmyfile: value fromSource is not a member of org.apache.flink.api.scala.ExecutionEnvironment
[error] val text = env.fromSource(kafkaConsumer)
[error] ^
[error] two errors found

For the first problem, drop the new:
val kafkaConsumer = KafkaSource.builder[String]
...
For the second problem, fromSource requires three arguments:
/** Create a DataStream using a [[Source]]. */
#Experimental
def fromSource[T: TypeInformation](
source: Source[T, _ <: SourceSplit, _],
watermarkStrategy: WatermarkStrategy[T],
sourceName: String): DataStream[T] = {
val typeInfo = implicitly[TypeInformation[T]]
asScalaStream(javaEnv.fromSource(source, watermarkStrategy, sourceName, typeInfo))
}
Also, note that Flink does not (yet) support scala 2.12.15. See https://issues.apache.org/jira/browse/FLINK-20969. However, Flink 1.15 can be used with newer versions of Scala (including Scala 3), if you exclude Flink's built-in scala API support. See https://flink.apache.org/2022/02/22/scala-free.html for more on this.

Related

Hazelcast server with scala client issue

I am trying to setup hazelcast server and client on my local machine. I am also trying to connect to local Hazelcast server by scala-client.
For server I used below code,
import com.hazelcast.config._
import com.hazelcast.Scala._
object HazelcastServer {
def main(args: Array[String]): Unit = {
val conf = new Config
serialization.Defaults.register(conf.getSerializationConfig)
serialization.DynamicExecution.register(conf.getSerializationConfig)
val hz = conf.newInstance()
val cmap = hz.getMap[String, String]("test")
cmap.put("a","A")
cmap.put("b","B")
}
}
and hazelcast client as,
import com.hazelcast.Scala._
import client._
import com.hazelcast.client._
import com.hazelcast.config._
object Hazelcast_Client {
def main(args:Array[String]): Unit = {
val conf = new Config
serialization.Defaults.register(conf.getSerializationConfig)
serialization.DynamicExecution.register(conf.getSerializationConfig)
val hz = conf.newClient()
val cmap = hz.getMap("test")
println(cmap.size())
}
}
In my build.sbt,
libraryDependencies += "com.hazelcast" % "hazelcast" % "3.7.2"
libraryDependencies += "com.hazelcast" %% "hazelcast-scala" % "3.7.2"
I am getting below error and stuck in dependency issues.
Symbol 'type <none>.config.ClientConfig' is missing from the classpath.
[error] This symbol is required by 'value com.hazelcast.Scala.client.package.conf'.
[error] Make sure that type ClientConfig is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
[error] A full rebuild may help if 'package.class' was compiled against an incompatible version of <none>.config.
[error] val conf = new Config
I referred hazelcast documentation. I am not able to find any good hazelcast scala examples to understand the setup and to start playing with. If anybody can help in solving this issue, or share really good scala examples that would be helpful.
I've done a Scala+Akka Hazelcast before. My build.sbt included
libraryDependencies += "com.hazelcast" % "hazelcast-all" % "3.7.2"
I seem to remember that hazelcast-all was required.

Flink ES connection Not compiling as expected

My problem is somewhat as described here.
Part of Code (actually took from apache site) is as below
val httpHosts = new java.util.ArrayList[HttpHost]
httpHosts.add(new HttpHost("127.0.0.1", 9200, "http"))
httpHosts.add(new HttpHost("10.2.3.1", 9200, "http"))
val esSinkBuilder = new ElasticsearchSink.Builder[String](
httpHosts,
new ElasticsearchSinkFunction[String] {
def createIndexRequest(element: String): IndexRequest = {
val json = new java.util.HashMap[String, String]
json.put("data", element)
return Requests.indexRequest()
.index("my-index")
.`type`("my-type")
.source(json)
If I add these three statements, I am getting error as below
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
Error I am getting
object elasticsearch is not a member of package org.apache.flink.streaming.connectors
object elasticsearch6 is not a member of package org.apache.flink.streaming.connectors
If I do not add those import statements, I get error as below
Compiling 1 Scala source to E:\sar\scala\practice\readstbdata\target\scala-2.11\classes ...
[error] E:\sar\scala\practice\readstbdata\src\main\scala\example\readcsv.scala:35:25: not found: value ElasticsearchSink
[error] val esSinkBuilder = new ElasticsearchSink.Builder[String](
[error] ^
[error] E:\sar\scala\practice\readstbdata\src\main\scala\example\readcsv.scala:37:7: not found: type ElasticsearchSinkFunction
[error] new ElasticsearchSinkFunction[String] {
[error] ^
[error] two errors found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 1 s, completed 10 Feb, 2020 2:15:04 PM
Stackflow question I referred above, some function has been extended. My understanding is, flink.streaming.connectors.elasticsearch have to be extended into REST libraries. 1) Is my understanding correct 2) if Yes, can I have complete extensions 3)If my understanding is wrong, please give me a solution.
Note: I added the following statements in build.sbt
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2",
The streaming connectors are not part of the flink binary distribution. You have to package them with your application.
For elasticsearch6 you need to add flink-connector-elasticsearch6_2.11, which you can do as
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch6" % "1.6.0"
Once this jar is part of your build, then the compiler will find the missing components. However, I don't know if this ES6 client will work with version 7.5.2.
Flink Elasticsearch Connector 7
Please look at the working and detailed answer which I have provided here, Which is written in Scala.

run-main-0) scala.ScalaReflectionException: class java.sql.Date in JavaMirror with ClasspathFilter(

Hi I have a file given to by my teacher. It is about Scala and Spark.
When I run the code it gives me this exception:
(run-main-0) scala.ScalaReflectionException: class java.sql.Date in
JavaMirror with ClasspathFilter
The file itself looks like this:
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Main {
type Embedding = (String, List[Double])
type ParsedReview = (Integer, String, Double)
org.apache.log4j.Logger getLogger "org" setLevel
(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger getLogger "akka" setLevel
(org.apache.log4j.Level.WARN)
val spark = SparkSession.builder
.appName ("Sentiment")
.master ("local[9]")
.getOrCreate
import spark.implicits._
val reviewSchema = StructType(Array(
StructField ("reviewText", StringType, nullable=false),
StructField ("overall", DoubleType, nullable=false),
StructField ("summary", StringType, nullable=false)))
// Read file and merge the text abd summary into a single text column
def loadReviews (path: String): Dataset[ParsedReview] =
spark
.read
.schema (reviewSchema)
.json (path)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => (id.toInt, s"${row getString 2} ${row getString 0}", row getDouble 1) }
.toDS
.withColumnRenamed ("_1", "id" )
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_3", "overall")
.as[ParsedReview]
// Load the GLoVe embeddings file
def loadGlove (path: String): Dataset[Embedding] =
spark
.read
.text (path)
.map { _ getString 0 split " " }
.map (r => (r.head, r.tail.toList.map (_.toDouble))) // yuck!
.withColumnRenamed ("_1", "word" )
.withColumnRenamed ("_2", "vec")
.as[Embedding]
def main(args: Array[String]) = {
val glove = loadGlove ("Data/glove.6B.50d.txt") // take glove
val reviews = loadReviews ("Data/Electronics_5.json") // FIXME
// replace the following with the project code
glove.show
reviews.show
spark.stop
}
}
I need to keep the line
import org.apache.spark.sql.Dataset
because some code depends on it but it is exactly because of it I have an exception throw.
My build.sbt file looks like this:
name := "Sentiment Analysis Project"
version := "1.1"
scalaVersion := "2.11.12"
scalacOptions ++= Seq("-unchecked", "-deprecation")
initialCommands in console :=
"""
import Main._
"""
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"2.3.0"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" %
"test"
The Scala guide recommends you compile with Java8:
We recommend using Java 8 for compiling Scala code. Since the JVM is backward compatible, it is usually safe to use a newer JVM to run your code compiled by the Scala compiler for older JVM versions.
Although it's only a recommendation, I found it to fix the issue you mention.
In order to install Java 8 using Homebrew, it's best to use jenv which will help you handle multiple Java versions should you need to.
brew install jenv
Then run the following to add a tap (repository) of alternative versions of casks, since Java 8 is not in the default tap anymore:
brew tap homebrew/cask-versions
To install Java 8:
brew cask install homebrew/cask-versions/adoptopenjdk8
Run the following to add the previously installed Java version to jenv's list of versions:
jenv add /Library/Java/JavaVirtualMachines/<installed_java_version>/Contents/Home
Finally run
jenv global 1.8
or
jenv local 1.8
to use Java 1.8 globally or locally (in the current folder).
Fore more information, follow the instructions at jenv's website

Spark not working with pureconfig

I'm trying to use pureConfig and configFactory for my spark application configuration.
here is my code:
import pureconfig.{loadConfigOrThrow}
object Source{
def apply(keyName: String, configArguments: Config): Source = {
keyName.toLowerCase match {
case "mysql" =>
val properties = loadConfigOrThrow[DBConnectionProperties](configArguments)
new MysqlSource(None, properties)
case "files" =>
val properties = loadConfigOrThrow[FilesSourceProperties](configArguments)
new Files(properties)
case _ => throw new NoSuchElementException(s"Unknown Source ${keyName.toLowerCase}")
}
}
}
import Source
val config = ConfigFactory.parseString(result.mkString("\n"))
val source = Source("mysql",config.getConfig("source.mysql"))
when I run it from the IDE (intelliJ) or directly from java
(i.e java jar...) it works fine.
But when I run it with spark-submit it fails with the following error:
Exception in thread "main" java.lang.NoSuchMethodError: shapeless.Witness$.mkWitness(Ljava/lang/Object;)Lshapeless/Witness;
From a quick search I found a similar similar to this question.
which suggest the reason for this is due to the fact both spark and pureConfig depends on Shapeless but with different versions,
I tried to shade it as suggested in the answer
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "shadeshapless.#1")
.inLibrary("com.github.pureconfig" %% "pureconfig" % "0.7.0").inProject
)
but it didn't work as well
can it be from a different reason?
any idea what may work?
Thanks
You also have to shade shapeless inside its own JAR, in addition to pureconfig:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "shadeshapless.#1")
.inLibrary("com.chuusai" % "shapeless_2.11" % "2.3.2")
.inLibrary("com.github.pureconfig" %% "pureconfig" % "0.7.0")
.inProject
)
Make sure to add the correct shapeless version.

No RowReaderFactory can be found for this type error when trying to map Cassandra row to case object using spark-cassandra-connector

I am trying to get a simple example working mapping rows from Cassandra to a scala case class using Apache Spark 1.1.1, Cassandra 2.0.11, & the spark-cassandra-connector (v1.1.0). I have reviewed the documentation at the spark-cassandra-connector github page, planetcassandra.org, datastax, and generally searched around; but have not found anyone else encountering this issue. So here goes...
Building a tiny spark application using sbt (0.13.5), scala 2.10.4, spark 1.1.1 against Cassandra 2.0.11. Modelling the example from the spark-cassandra-connector docs the following two lines present an error in my IDE and fail to compile.
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
The simple error presented by eclipse is:
No RowReaderFactory can be found for this type
The compile error is only slightly more verbose:
> compile
[info] Compiling 1 Scala source to /home/bkarels/dev/simple-case/target/scala-2.10/classes...
[error] /home/bkarels/dev/simple-case/src/main/scala/com/bradkarels/simple/SimpleApp.scala:82: No RowReaderFactory can be found for this type
[error] val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 1 s, completed Dec 10, 2014 9:01:30 AM
>
Scala source:
package com.bradkarels.simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
// Likely don't need this import - but throwing darts hits the bullseye once in a while...
import com.datastax.spark.connector.rdd.reader.RowReaderFactory
object CaseStudy {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "simple", conf)
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
}
}
With the bothersome lines removed, everything compiles fine, assembly works, and I can perform other Spark operations normally. For example, if I remove the problem lines and drop in:
val rdd:CassandraRDD[CassandraRow] = sc.cassandraTable("nicecase", "human")
I get back the RDD and work with it as expected. That said, I suspect that my sbt project, assembly plugin, etc. are not contributing to the issues. The working source (less the new attempt to map to a case class as the connector as intended) can be found on github here.
But, to be more thorough, my build.sbt:
name := "Simple Case"
version := "0.0.1"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.1",
"org.apache.spark" %% "spark-sql" % "1.1.1",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
)
So the question is what have I missed? Hoping this is something silly, but if you have encountered this and can help me get past this puzzling little issue I would very much appreciate it. Please let me know if there are any other details that would be helpful in troubleshooting.
Thank you.
This may be my newness with Scala in general, but I resolved this issue by moving the case class declaration out of the main method. So the simplified source now looks like this:
package com.bradkarels.simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
object CaseStudy {
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "simple", conf)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
}
}
The complete source (updated & fixed) can be found on github https://github.com/bradkarels/spark-cassandra-to-scala-case-class