Input Spark Dataframe to DeepLearning4J model - scala

I've data in my spark dataframe (df) which have 24 features and the 25th column is my target variable. I want to fit my dl4j model on this dataset which takes input in the form of org.nd4j.linalg.api.ndarray.INDArray, org.nd4j.linalg.dataset.Dataset or org.nd4j.linalg.dataset.api.iterator.DataSetIterator. How can I convert my dataframe to the required type ?
I've also tried using Pipeline method to input spark dataframe to the model directly. But sbt dependency of dl4j-spark-ml is not working. My build.sbt file is :
scalaVersion := "2.11.8"
libraryDependencies += "org.deeplearning4j" %% "dl4j-spark-ml" % "0.8.0_spark_2-SNAPSHOT"
libraryDependencies += "org.deeplearning4j" % "deeplearning4j-core" % "0.8.0"
libraryDependencies += "org.nd4j" % "nd4j" % "0.8.0"
libraryDependencies += "org.nd4j" % "nd4j-native-platform" % "0.8.0"
libraryDependencies += "org.nd4j" % "nd4j-backends" % "0.8.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.1"
Can someone guide me from here ? Thanks in advance.

You can use snapshots which have readded the spark.ml integration.
If you want to use snapshots, add the oss sonatype repository:
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/pom.xml#L16
The version at the time of this writing is:
0.8.1-SNAPSHOT
Please verify the latest version with the examples repo though:
https://github.com/deeplearning4j/dl4j-examples/blob/master/pom.xml#L21
You can't mix versions of dl4j. The version you're trying to use is very out of date (by more than a year). Please upgrade to the latest version beyond that.
The new spark.ml integration examples can be found here:
https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl
Make sure to add the proper dependency, which is typically something like
org.deeplearning4j:dl4j-spark-ml_${YOUR SCALA BINARY VERSION}:0.8.1_spark_${YOUR SPARK VERSION (1 or 2}-SNAPSHOT

Related

Spark + Scala: How to add external dependencies in build.sbt [duplicate]

This question already has an answer here:
How to define Kafka (data source) dependencies for Spark Streaming?
(1 answer)
Closed 2 years ago.
I'm new to Spark (using v2.4.5), and am still trying to figure out the correct way to add external dependencies. When trying to add Kafka streaming to my project, my build.sbt looked liked this:
name := "Stream Handler"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.5" % "provided",
"org.apache.spark" % "spark-streaming_2.11" % "2.4.5" % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.4.5"
)
This builds successfully, but when running with spark-submit, I get a java.lang.NoClassDefFoundError with KafkaUtils.
I was able to get my code working by passing in the dependency thru the --packages option like this:
$ spark-submit [other_args] --packages "org.apache.spark:org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.5"
Ideally I would like to set up all the dependencies in the build.sbt, but I'm not sure what I'm doing wrong. Any advice would be appreciated!
your "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.4.5" is wrong.
change that to below like mvnrepo.. https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.4.5"

RestHighLevelClient Not Found in Scala

I am trying to insert into ElasticSearch(ES) in a Scala Program.
In build.sbt I have added
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
My code is
val client = new RestHighLevelClient( RestClient.builder(new HttpHost("localhost", 9200, "http")))
While compiling I am getting errors as below
not found: type RestHighLevelClient
not found: value RestClient
Am I missing some import? My goal is to get a stream from Flink and insert into ElasticSearch
Any help is greatly appreciated.
To use Elasticsearch with Flink it's going to be easier if you use Flink's ElasticsearchSink, rather than working with RestHighLevelClient directly. However, a version of that sink for Elasticsearch 7.x is coming in Flink 1.10, which hasn't been released yet (it's coming very soon; RC1 is already out).
Using this connector requires an extra dependency, such as flink-connector-elasticsearch6_2.11 (or flink-connector-elasticsearch7_2.11, coming with Flink 1.10).
See the docs on using Elasticsearch with Flink.
The reason to prefer Flink's sink over using RestHighLevelClient yourself is that the Flink sink makes bulk requests, handles errors and retries, and it's tied in with Flink's checkpointing mechanism, so it's able to guarantee that nothing is lost if something fails.
As for your actual question, maybe you need to add
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-client" % "7.5.2"
We don't need to use these dependencies separately to insert data in Elasticsearch by using Flink Streaming.
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
Just use this flink-connector-elasticsearch7 or flink-connector-elasticsearch6
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % "1.10.0"
All dependencies of Elasticsearch come along with the Flink-Elastic connector. So we don't need to include them separately in build.sbt file.
build.sbt file for Flink Elasticsearch
name := "flink-streaming-demo"
scalaVersion := "2.12.11"
val flinkVersion = "1.10.0"
libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % flinkVersion
For more details please go through this working Flink-Elasticsearch code which I have provided here.
Note: Since Elastic 6.x onwards they started full support of the REST elastic client. And till Elastic5.x they were using Transport elastic client.

How to define Scala API for Kafka Streams as dependency in build.sbt?

I am trying to start a new SBT Scala project and have the following in build.sbt file:
name := "ScalaKafkaStreamsDemo"
version := "1.0"
scalaVersion := "2.12.1"
libraryDependencies += "javax.ws.rs" % "javax.ws.rs-api" % "2.1" artifacts(Artifact("javax.ws.rs-api", "jar", "jar"))
libraryDependencies += "org.apache.kafka" %% "kafka" % "2.0.0"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "2.0.0"
So according to the GitHub repo, in 2.0.0 I should see the Scala classes/functions etc etc that I want to use, however they just don't seem to be available. Within IntelliJ I can open up the kafka-streams-2.0.0.jar, but I don't see any Scala classes.
Is there another JAR I need to include?
Just while we are on the subject of extra JARs, does anyone know what JAR I need to include to be able to use the EmbeddedKafkaCluster?
The artifact you need is kafka-streams-scala:
libraryDependencies += "org.apache.kafka" %% "kafka-streams-scala" % "2.0.1"
(please use 2.0.1, or even better 2.1.0, as 2.0.0 has some scala API bugs)
To answer your latter question, it's in the test-jar, which you can address using a classifier:
libraryDependencies += "org.apache.kafka" %% "kafka-streams" % "2.0.1" % "test" classifier "test"
But note that this is an internal class and subject to change (or removal) without notice. If at all possible, it's highly recommended that you use the TopologyTestDriver in the test-utils instead:
libraryDependencies += "org.apache.kafka" %% "kafka-streams-test-utils" % "2.0.1" % "test"
It looks like you face the issue of unresolved javax.ws.rs-api dependency that happens with some Java projects that are direct or transitive dependencies of Scala projects that use sbt. I've faced it with Scala projects that use Apache Spark and recently with Kafka Streams (with and without the Scala API).
A workaround that has worked fine for me is to simply exclude the dependency and define it again explicitly.
excludeDependencies += ExclusionRule("javax.ws.rs", "javax.ws.rs-api")
libraryDependencies += "javax.ws.rs" % "javax.ws.rs-api" % "2.1.1"
Make sure that you use the latest and greatest of sbt (i.e. 1.2.7 as of the time of this writing).
With that said, the dependencies in build.sbt should be as follows:
scalaVersion := "2.12.8"
val kafkaVer = "2.1.0"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % kafkaVer
libraryDependencies += "org.apache.kafka" %% "kafka-streams-scala" % kafkaVer
excludeDependencies += ExclusionRule("javax.ws.rs", "javax.ws.rs-api")
libraryDependencies += "javax.ws.rs" % "javax.ws.rs-api" % "2.1.1"
Within IntelliJ I can open up the kafka-streams-2.0.0.jar, but I don't see any Scala classes. Is there another JAR I need to include?
The following dependency is all you need:
libraryDependencies += "org.apache.kafka" %% "kafka-streams-scala" % kafkaVer
You can also use following workaround, that works in my case - more details
here
import sbt._
object PackagingTypePlugin extends AutoPlugin {
override val buildSettings = {
sys.props += "packaging.type" -> "jar"
Nil
}
}

Check Spark packages version [duplicate]

This question already has an answer here:
sbt unresolved dependency for spark-cassandra-connector 2.0.2
(1 answer)
Closed 5 years ago.
I am trying to set up my first Scala project with IntelliJ Idea on Ubuntu 16.04. I need the Spark library and I think I have installed correctly in my computer, however I am not able to refer it in the project dependencies. In particular, I have added the following code in my build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core" % "2.1.1",
"org.apache.spark" % "spark-sql" % "2.1.1")
However sbt complains about not finding the correct packages (Unresolved Dependencies error, org.apache.spark#spark-core;2.1.1: not found and org.apache.spark#spark-sql;2.1.1: not found):
I think that the versions of the packages are incorrect (I copied the previous code from the web, just to try).
How can I determine the correct packages versions?
If you use % you have to define the exact version as
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.1",
"org.apache.spark" % "spark-sql_2.10" % "2.1.1")
And if you don't want to define the version and let sbt take the correct version then you need to define %% as
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.1",
"org.apache.spark" %% "spark-sql" % "2.1.1")
you can check of installed version by doing
spark-submit --version
And by going to maven dependency

Saving data from Spark to Cassandra results in java.lang.ClassCastException

I'm trying to save data from Spark to Cassandra in Scala using saveToCassandra for an RDD or save with a dataframe (both result in the same error). The full message is:
java.lang.ClassCastException:
com.datastax.driver.core.DefaultResultSetFuture cannot be cast to
com.google.common.util.concurrent.ListenableFuture
I've followed along with the code here and still seem to get the error.
I'm using Cassandra 3.6, Spark 1.6.1, and spark-cassandra-connector 1.6. Let me know if there's anything else I can provide to help with the debugging.
I had similar exception and fixed it after changing in build.sbt scala version:
scalaVersion := "2.10.6"
and library dependencies:
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.2",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)
With this configuration example from 5-minute quick start guide works fine.