Spark Cassandra Join ClassCastException - scala

I am trying to join two Cassandra tables with:
t1.join(t2, Seq("some column"), "left")
I am getting the below error message:
Exception in thread "main" java.lang.ClassCastException: scala.Tuple8 cannot be cast to scala.Tuple7 at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy.apply(CassandraDirectJoinStrategy.scala:27)
I am using cassandra v3.11.13 and Spark 3.3.0. The code dependencies:
libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "3.2.11" % Test,
"com.github.mrpowers" %% "spark-fast-tests" % "1.0.0" % Test,
"graphframes" % "graphframes" % "0.8.1-spark3.0-s_2.12" % Provided,
"org.rogach" %% "scallop" % "4.1.0" % Provided,
"org.apache.spark" %% "spark-sql" % "3.1.2" % Provided,
"org.apache.spark" %% "spark-graphx" % "3.1.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "3.2.0" % Provided)
Your help is greatly appreciated

The Spark Cassandra connector does not support Apache Spark 3.3.0 yet and I suspect that is the reason it's not working though I haven't done any verification myself.
Support for Spark 3.3.0 has been requested in SPARKC-686 but the amount of work required is significant so stay tuned.
The latest supported Spark version is 3.2 using spark-cassandra-connector 3.2. Cheers!

this commit
adds initial support for Spark 3.3.x, although it is awaiting RC's/publish at the time of this comment, so you would need to build and package the jars yourself for the time being to begin making use of them to resolve the above error when using spark 3.3. This could be a good opportunity to provide any feedback on any subsequent RC's, as an active user.
I will update this comment when RC's/stable releases are available, which should resolve the above issue for others hitting this issue. Unfortunately, I don't have enough reputation to add this a comment to thread above.

Related

RestHighLevelClient Not Found in Scala

I am trying to insert into ElasticSearch(ES) in a Scala Program.
In build.sbt I have added
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
My code is
val client = new RestHighLevelClient( RestClient.builder(new HttpHost("localhost", 9200, "http")))
While compiling I am getting errors as below
not found: type RestHighLevelClient
not found: value RestClient
Am I missing some import? My goal is to get a stream from Flink and insert into ElasticSearch
Any help is greatly appreciated.
To use Elasticsearch with Flink it's going to be easier if you use Flink's ElasticsearchSink, rather than working with RestHighLevelClient directly. However, a version of that sink for Elasticsearch 7.x is coming in Flink 1.10, which hasn't been released yet (it's coming very soon; RC1 is already out).
Using this connector requires an extra dependency, such as flink-connector-elasticsearch6_2.11 (or flink-connector-elasticsearch7_2.11, coming with Flink 1.10).
See the docs on using Elasticsearch with Flink.
The reason to prefer Flink's sink over using RestHighLevelClient yourself is that the Flink sink makes bulk requests, handles errors and retries, and it's tied in with Flink's checkpointing mechanism, so it's able to guarantee that nothing is lost if something fails.
As for your actual question, maybe you need to add
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-client" % "7.5.2"
We don't need to use these dependencies separately to insert data in Elasticsearch by using Flink Streaming.
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2"
Just use this flink-connector-elasticsearch7 or flink-connector-elasticsearch6
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % "1.10.0"
All dependencies of Elasticsearch come along with the Flink-Elastic connector. So we don't need to include them separately in build.sbt file.
build.sbt file for Flink Elasticsearch
name := "flink-streaming-demo"
scalaVersion := "2.12.11"
val flinkVersion = "1.10.0"
libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch7" % flinkVersion
For more details please go through this working Flink-Elasticsearch code which I have provided here.
Note: Since Elastic 6.x onwards they started full support of the REST elastic client. And till Elastic5.x they were using Transport elastic client.

Flink write to S3 on EMR

I am trying to write some outputs to S3 using EMR with Flink. I am using Scala 2.11.7, Flink 1.3.2, and EMR 5.11. However, I got the following error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:345)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:222)
at org.apache.flink.api.java.io.TextOutputFormat.open(TextOutputFormat.java:78)
at org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:61)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:376)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
My build.sbt looks like this:
libraryDependencies ++= Seq(
"org.apache.flink" % "flink-core" % "1.3.2",
"org.apache.flink" % "flink-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-streaming-scala_2.11" % "1.3.2",
"org.apache.flink" % "flink-shaded-hadoop2" % "1.3.2",
"org.apache.flink" % "flink-clients_2.11" % "1.3.2",
"org.apache.flink" %% "flink-avro" % "1.3.2",
"org.apache.flink" %% "flink-connector-filesystem" % "1.3.2"
)
I also found this post, but it didn't resolve the issue: External checkpoints to S3 on EMR
I just put the output to S3: input.writeAsText("s3://test/flink"). Any suggestions would be appreciated.
Not sure the good combination for flink-shaded-hadoop and EMR version. After several round of tries and failures, I was able to write to S3 by using a new version of flink-shaded-hadoop2 -- "org.apache.flink" % "flink-shaded-hadoop2" % "1.4.0"
Your issue is probably due the fact that some libraries are loaded by EMR/Yarn/Flink before your own classes, what leads to NoSuchMethodError: classes loaded are not the one you provided, but the one provided by EMR. Take care of the classpath in the JobManager/TaskManager logs.
A solution is to put your own jars in the Flink lib directory to that they are loaded before EMR ones.

finding spark scala packages

I'm sure this is simpler than it looks, but I'm willing to look dumb.
I'm working my way through some Scala/Spark examples, which occasionally call for adding library dependencies, eg,
libraryDependencies ++= Seq(
scalaTest % Test,
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "2.2.0"
)
The question is, how do you find the appropriate names and versions for the libraries? It seems the texts all give import statements; there has to be some kind of registry or something. But where?
The correct version of library can always search from the mvnrepository .If you are trying to access the version from proprietary Distribution you need to add the repository of that Distribution.
Cloudera repository
MapR repository
hdp_maven_artifacts

Using spark ML models outside of spark [hdfs DistributedFileSystem could not be instantiated]

I've been trying to follow along things blog post:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
Using spark 2.1 with built in Hadoop 2.7 run locally I can save a model:
trainedModel.save("mymodel.model"))
However if I try to load the model from a regular scala (sbt) shell hdfs fails to load.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.{PipelineModel, Predictor}
val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("myApp"))
val model = PipelineModel.load("mymodel.model")
I get this is error:
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated
Is it in fact possible to use a spark model without calling spark-submit, or spark-shell? The article I linked to was the only one I'd seen mentioning such functionality.
My build.sbt is using the following dependencies:
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.1.0",
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"
In both cases I am using Scala 2.11.8.
Edit: Okay it looks including this was the source of the problem
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"
I removed that line and the problem went away
try:
trainedModel.write.overwrite().save("mymodel.model"))
Also if your model is saved locally, you can remove hdfs in your configuration. This should prevent spark from attempting to instantiate hdfs.

Saving data from Spark to Cassandra results in java.lang.ClassCastException

I'm trying to save data from Spark to Cassandra in Scala using saveToCassandra for an RDD or save with a dataframe (both result in the same error). The full message is:
java.lang.ClassCastException:
com.datastax.driver.core.DefaultResultSetFuture cannot be cast to
com.google.common.util.concurrent.ListenableFuture
I've followed along with the code here and still seem to get the error.
I'm using Cassandra 3.6, Spark 1.6.1, and spark-cassandra-connector 1.6. Let me know if there's anything else I can provide to help with the debugging.
I had similar exception and fixed it after changing in build.sbt scala version:
scalaVersion := "2.10.6"
and library dependencies:
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.2",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)
With this configuration example from 5-minute quick start guide works fine.