RestHighLevelClient Not Found in Scala - scala

I am trying to insert into ElasticSearch(ES) in a Scala Program.
In build.sbt I have added
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
My code is
val client = new RestHighLevelClient( RestClient.builder(new HttpHost("localhost", 9200, "http")))
While compiling I am getting errors as below
not found: type RestHighLevelClient
not found: value RestClient
Am I missing some import? My goal is to get a stream from Flink and insert into ElasticSearch
Any help is greatly appreciated.

To use Elasticsearch with Flink it's going to be easier if you use Flink's ElasticsearchSink, rather than working with RestHighLevelClient directly. However, a version of that sink for Elasticsearch 7.x is coming in Flink 1.10, which hasn't been released yet (it's coming very soon; RC1 is already out).
Using this connector requires an extra dependency, such as flink-connector-elasticsearch6_2.11 (or flink-connector-elasticsearch7_2.11, coming with Flink 1.10).
See the docs on using Elasticsearch with Flink.
The reason to prefer Flink's sink over using RestHighLevelClient yourself is that the Flink sink makes bulk requests, handles errors and retries, and it's tied in with Flink's checkpointing mechanism, so it's able to guarantee that nothing is lost if something fails.
As for your actual question, maybe you need to add
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-client" % "7.5.2"

We don't need to use these dependencies separately to insert data in Elasticsearch by using Flink Streaming.
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
Just use this flink-connector-elasticsearch7 or flink-connector-elasticsearch6
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % "1.10.0"
All dependencies of Elasticsearch come along with the Flink-Elastic connector. So we don't need to include them separately in build.sbt file.
build.sbt file for Flink Elasticsearch
name := "flink-streaming-demo"
scalaVersion := "2.12.11"
val flinkVersion = "1.10.0"
libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % flinkVersion
For more details please go through this working Flink-Elasticsearch code which I have provided here.
Note: Since Elastic 6.x onwards they started full support of the REST elastic client. And till Elastic5.x they were using Transport elastic client.

Related

Spark Cassandra Join ClassCastException

I am trying to join two Cassandra tables with:
t1.join(t2, Seq("some column"), "left")
I am getting the below error message:
Exception in thread "main" java.lang.ClassCastException: scala.Tuple8 cannot be cast to scala.Tuple7 at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy.apply(CassandraDirectJoinStrategy.scala:27)
I am using cassandra v3.11.13 and Spark 3.3.0. The code dependencies:
libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "3.2.11" % Test,
"com.github.mrpowers" %% "spark-fast-tests" % "1.0.0" % Test,
"graphframes" % "graphframes" % "0.8.1-spark3.0-s_2.12" % Provided,
"org.rogach" %% "scallop" % "4.1.0" % Provided,
"org.apache.spark" %% "spark-sql" % "3.1.2" % Provided,
"org.apache.spark" %% "spark-graphx" % "3.1.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "3.2.0" % Provided)
Your help is greatly appreciated
The Spark Cassandra connector does not support Apache Spark 3.3.0 yet and I suspect that is the reason it's not working though I haven't done any verification myself.
Support for Spark 3.3.0 has been requested in SPARKC-686 but the amount of work required is significant so stay tuned.
The latest supported Spark version is 3.2 using spark-cassandra-connector 3.2. Cheers!
this commit
adds initial support for Spark 3.3.x, although it is awaiting RC's/publish at the time of this comment, so you would need to build and package the jars yourself for the time being to begin making use of them to resolve the above error when using spark 3.3. This could be a good opportunity to provide any feedback on any subsequent RC's, as an active user.
I will update this comment when RC's/stable releases are available, which should resolve the above issue for others hitting this issue. Unfortunately, I don't have enough reputation to add this a comment to thread above.

How to define Scala API for Kafka Streams as dependency in build.sbt?

I am trying to start a new SBT Scala project and have the following in build.sbt file:
name := "ScalaKafkaStreamsDemo"
version := "1.0"
scalaVersion := "2.12.1"
libraryDependencies += "javax.ws.rs" % "javax.ws.rs-api" % "2.1" artifacts(Artifact("javax.ws.rs-api", "jar", "jar"))
libraryDependencies += "org.apache.kafka" %% "kafka" % "2.0.0"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "2.0.0"
So according to the GitHub repo, in 2.0.0 I should see the Scala classes/functions etc etc that I want to use, however they just don't seem to be available. Within IntelliJ I can open up the kafka-streams-2.0.0.jar, but I don't see any Scala classes.
Is there another JAR I need to include?
Just while we are on the subject of extra JARs, does anyone know what JAR I need to include to be able to use the EmbeddedKafkaCluster?
The artifact you need is kafka-streams-scala:
libraryDependencies += "org.apache.kafka" %% "kafka-streams-scala" % "2.0.1"
(please use 2.0.1, or even better 2.1.0, as 2.0.0 has some scala API bugs)
To answer your latter question, it's in the test-jar, which you can address using a classifier:
libraryDependencies += "org.apache.kafka" %% "kafka-streams" % "2.0.1" % "test" classifier "test"
But note that this is an internal class and subject to change (or removal) without notice. If at all possible, it's highly recommended that you use the TopologyTestDriver in the test-utils instead:
libraryDependencies += "org.apache.kafka" %% "kafka-streams-test-utils" % "2.0.1" % "test"
It looks like you face the issue of unresolved javax.ws.rs-api dependency that happens with some Java projects that are direct or transitive dependencies of Scala projects that use sbt. I've faced it with Scala projects that use Apache Spark and recently with Kafka Streams (with and without the Scala API).
A workaround that has worked fine for me is to simply exclude the dependency and define it again explicitly.
excludeDependencies += ExclusionRule("javax.ws.rs", "javax.ws.rs-api")
libraryDependencies += "javax.ws.rs" % "javax.ws.rs-api" % "2.1.1"
Make sure that you use the latest and greatest of sbt (i.e. 1.2.7 as of the time of this writing).
With that said, the dependencies in build.sbt should be as follows:
scalaVersion := "2.12.8"
val kafkaVer = "2.1.0"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % kafkaVer
libraryDependencies += "org.apache.kafka" %% "kafka-streams-scala" % kafkaVer
excludeDependencies += ExclusionRule("javax.ws.rs", "javax.ws.rs-api")
libraryDependencies += "javax.ws.rs" % "javax.ws.rs-api" % "2.1.1"
Within IntelliJ I can open up the kafka-streams-2.0.0.jar, but I don't see any Scala classes. Is there another JAR I need to include?
The following dependency is all you need:
libraryDependencies += "org.apache.kafka" %% "kafka-streams-scala" % kafkaVer
You can also use following workaround, that works in my case - more details
here
import sbt._
object PackagingTypePlugin extends AutoPlugin {
override val buildSettings = {
sys.props += "packaging.type" -> "jar"
Nil
}
}

Apache Flink 1.4 with Apache Kafka 1.0.0

I am trying to get Apache Flink Scala project to integrate with Apache Kafka 1.0.0. When i attempt to add the flink-connector-kafka package in my build.sbt file I get an error saying it cannot resolve it.
When i then look at the options available in the maven repository, there is no maven dependency available for Apache Kafka 2.11-1.0.0 for any version above 0.10.2
val flinkVersion = "1.4.1"
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided")
"org.apache.flink" %% "flink-connector-kafka" % flinkVersion)
Does anyone know how to integrate these versions correctly so that I can connect Apache Flink 1.4 to Apache Kafka 2.11-1.0.0, as nothing I seem to try works (and i do not wish to downgrade the Kafka version I am connecting to).
This should work. Try:
val flinkVersion = "1.4.2"
libraryDependencies ++= Seq(
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion,
"org.apache.flink" %% "flink-connector-kafka-0.11" % flinkVersion
)
Try
org.apache.flink" % "flink-connector-kafka-0.11_2.11" % "1.4.0
flink-connector-kafka-0.11_2.11 is Flink's latest Kafka connector available.
Sources: https://search.maven.org/#search%7Cga%7C1%7Cflink%20kafka%20connector , https://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.flink%22%20AND%20a%3A%22flink-connector-kafka-0.11_2.11%22
A Kafka 1.0 broker is backwards compatible with 0.11 and 0.10 APIs.

Input Spark Dataframe to DeepLearning4J model

I've data in my spark dataframe (df) which have 24 features and the 25th column is my target variable. I want to fit my dl4j model on this dataset which takes input in the form of org.nd4j.linalg.api.ndarray.INDArray, org.nd4j.linalg.dataset.Dataset or org.nd4j.linalg.dataset.api.iterator.DataSetIterator. How can I convert my dataframe to the required type ?
I've also tried using Pipeline method to input spark dataframe to the model directly. But sbt dependency of dl4j-spark-ml is not working. My build.sbt file is :
scalaVersion := "2.11.8"
libraryDependencies += "org.deeplearning4j" %% "dl4j-spark-ml" % "0.8.0_spark_2-SNAPSHOT"
libraryDependencies += "org.deeplearning4j" % "deeplearning4j-core" % "0.8.0"
libraryDependencies += "org.nd4j" % "nd4j" % "0.8.0"
libraryDependencies += "org.nd4j" % "nd4j-native-platform" % "0.8.0"
libraryDependencies += "org.nd4j" % "nd4j-backends" % "0.8.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.1"
Can someone guide me from here ? Thanks in advance.
You can use snapshots which have readded the spark.ml integration.
If you want to use snapshots, add the oss sonatype repository:
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/pom.xml#L16
The version at the time of this writing is:
0.8.1-SNAPSHOT
Please verify the latest version with the examples repo though:
https://github.com/deeplearning4j/dl4j-examples/blob/master/pom.xml#L21
You can't mix versions of dl4j. The version you're trying to use is very out of date (by more than a year). Please upgrade to the latest version beyond that.
The new spark.ml integration examples can be found here:
https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl
Make sure to add the proper dependency, which is typically something like
org.deeplearning4j:dl4j-spark-ml_${YOUR SCALA BINARY VERSION}:0.8.1_spark_${YOUR SPARK VERSION (1 or 2}-SNAPSHOT

Saving data from Spark to Cassandra results in java.lang.ClassCastException

I'm trying to save data from Spark to Cassandra in Scala using saveToCassandra for an RDD or save with a dataframe (both result in the same error). The full message is:
java.lang.ClassCastException:
com.datastax.driver.core.DefaultResultSetFuture cannot be cast to
com.google.common.util.concurrent.ListenableFuture
I've followed along with the code here and still seem to get the error.
I'm using Cassandra 3.6, Spark 1.6.1, and spark-cassandra-connector 1.6. Let me know if there's anything else I can provide to help with the debugging.
I had similar exception and fixed it after changing in build.sbt scala version:
scalaVersion := "2.10.6"
and library dependencies:
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.2",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)
With this configuration example from 5-minute quick start guide works fine.