Why does Spark Cassandra Connector fail with NoHostAvailableException? - scala

I am having problems getting Spark Cassandra Connector working in Scala.
I'm using these versions:
Scala 2.10.4
spark-core 1.0.2
cassandra-thrift 2.1.0 (my installed cassandra is v2.1.0)
cassandra-clientutil 2.1.0
cassandra-driver-core 2.0.4 (recommended for connector?)
spark-cassandra-connector 1.0.0
I can connect and talk to Cassandra (w/o spark) and I can talk to Spark (w/o Cassandra) but the connector gives me:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.0.194:9042 (com.datastax.driver.core.TransportException: [/10.0.0.194:9042] Cannot connect))
What am I missing? Cassandra is a default install (port 9042 for cql according to cassandra.yaml). I'm trying to connect locally ("local").
My code:
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext("local","test",conf)
val rdd = sc.cassandraTable("myks","users")
val rr = rdd.first
println(s"Result: $rr")

Local in this context is specifying the Spark master (telling it to run in local mode) and not the Cassandra connection host.
To set the Cassandra Connection host you have to set a different property in the Spark Config
import org.apache.spark._
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "IP Cassandra Is Listening On")
.set("spark.cassandra.username", "cassandra") //Optional
.set("spark.cassandra.password", "cassandra") //Optional
val sc = new SparkContext("spark://Spark Master IP:7077", "test", conf)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

Related

spark Cassandra tuning

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra
In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.
The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

Connection to Cassandra from spark Error

I am using Spark 2.0.2 and Cassandra 3.11.2 I am using this code but it give me connection error.
./spark-shell --jars ~/spark/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.10/spark-cassandra-connector-assembly-2.0.5-121-g1a7fa1f8.jar
import com.datastax.spark.connector._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val test = sc.cassandraTable("sensorkeyspace", "sensortable")
test.count
When I enter test.count command it give me this error.
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:168)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
Can you check the yaml file? It seems the number of enough concurrent connections are open at any instance of time.

spark cassandra intelliJ scala

Environment:
My Laptop:
IntelliJ IDEA 2017.1.4
JRE: 1.8.0_112-release-736-b21 x86_64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Mac OS X 10.12.2
IP Address: 192.168.3.203
AWS Environment:
Linux ip-172-31-16-45 x86_64 x86_64 GNU/Linux
DSE Version : 5.1.0
IP Address: 172.31.16.45
PUBLIC IP: 35.35.42.42
Ecexuted the following commands:
[ec2-user#ip-172-31-16-45 ~]$ dse cassandra -k
and then
[ec2-user#ip-172-31-16-45 ~]$ dse spark
The log file is at /home/ec2-user/.spark-shell.log
warning: there was one deprecation warning; re-run with -deprecation for details
New Spark Session
WARN 2017-09-13 13:18:26,907 org.apache.spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.
Extracting Spark Context
Extracting SqlContext
Spark context Web UI available at http://172.31.16.45:4040
Spark context available as 'sc' (master = dse://?, app id = app-20170913131826-0001).
Spark session available as 'spark'.
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark Version: 2.0.2.6 ( One I see on Web Browser when I open http://35.35.42.42:4040/executors/)
Both Spark and Cassandra database are running on 35.35.42.42(public IP) or (Local IP: 172.31.16.45)
When I continue coding from scala env on AWS my code is running fine. What I need is to run the same code from intelliJ . Is it possible. If not what is the process.
Now I opened IntelliJ on my laptop and tried running below code:
// In .setMaster I tried all - .setMaster ("all combinations of local IP, public IP, spark, dse etc")
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
object helloworld {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("dse://35.35.42.42:4040")
// val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://172.31.16.45:4040")
// val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://35.35.42.42:7077")
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://172.31.16.45:7077")
val sc = new SparkContext(conf)
// val rdd = sc.cassandraTable("tdata", "map")
}
}
Error:
org.apache.spark.SparkException: Could not parse Master URL: 'dse://35.35.42.42:4040'

Scala Code to connect to Spark and Cassandra

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}
val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector