spark cassandra intelliJ scala - scala

Environment:
My Laptop:
IntelliJ IDEA 2017.1.4
JRE: 1.8.0_112-release-736-b21 x86_64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Mac OS X 10.12.2
IP Address: 192.168.3.203
AWS Environment:
Linux ip-172-31-16-45 x86_64 x86_64 GNU/Linux
DSE Version : 5.1.0
IP Address: 172.31.16.45
PUBLIC IP: 35.35.42.42
Ecexuted the following commands:
[ec2-user#ip-172-31-16-45 ~]$ dse cassandra -k
and then
[ec2-user#ip-172-31-16-45 ~]$ dse spark
The log file is at /home/ec2-user/.spark-shell.log
warning: there was one deprecation warning; re-run with -deprecation for details
New Spark Session
WARN 2017-09-13 13:18:26,907 org.apache.spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.
Extracting Spark Context
Extracting SqlContext
Spark context Web UI available at http://172.31.16.45:4040
Spark context available as 'sc' (master = dse://?, app id = app-20170913131826-0001).
Spark session available as 'spark'.
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark Version: 2.0.2.6 ( One I see on Web Browser when I open http://35.35.42.42:4040/executors/)
Both Spark and Cassandra database are running on 35.35.42.42(public IP) or (Local IP: 172.31.16.45)
When I continue coding from scala env on AWS my code is running fine. What I need is to run the same code from intelliJ . Is it possible. If not what is the process.
Now I opened IntelliJ on my laptop and tried running below code:
// In .setMaster I tried all - .setMaster ("all combinations of local IP, public IP, spark, dse etc")
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
object helloworld {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("dse://35.35.42.42:4040")
// val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://172.31.16.45:4040")
// val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://35.35.42.42:7077")
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://172.31.16.45:7077")
val sc = new SparkContext(conf)
// val rdd = sc.cassandraTable("tdata", "map")
}
}
Error:
org.apache.spark.SparkException: Could not parse Master URL: 'dse://35.35.42.42:4040'

Related

Clarification about running spark jobs on a cluster (AWS)

I have a HortonWorks cluster running on AWS EC2 machine on which I would like to run a spark job using spark streaming that will swallow the tweet concernings Game of thrones.
Before trying to run it on my cluster I did run it locally.
The code is working, here it is:
import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.twitter._
import org.apache.spark.{SparkConf, SparkContext}
object Twitter_Stream extends App {
val consumerKey = "hidden"
val consumerSecret = "hidden"
val accessToken = "hidden"
val accessTokenSecret = "hidden"
val sparkConf = new SparkConf().setAppName("GotTweets").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val myStream = TwitterUtils.createStream(ssc, None, Array("#GoT","#WinterIsHere","#GameOfThrones"))
val rddTweets = myStream.foreachRDD(rdd =>
{
rdd.take(10).foreach(println)
})
ssc.start()
ssc.awaitTermination()
}
My question are more precisely about this specific code line :
val sparkConf = new SparkConf().setAppName("GotTweets").setMaster("local[2]")
I replaced the "local[2]" by "spark://ip-address-EC2:7077" wich correspond to one of my ec2 machine but I have a connection failure.
I'm sure that the 7077 port is open on this machine.
Also when I run this code with this configuration (setMaster("local[2]")) on one of my EC2 machine , will my spark use all the machine of the cluster or will it run only on a single machine ?
Here the exception :
17/07/24 11:53:42 INFO AppClient$ClientEndpoint: Connecting to master
spark://ip-adress:7077... 17/07/24 11:53:44 WARN
AppClient$ClientEndpoint: Failed to connect to master ip-adress:7077
java.io.IOException: Failed to connect to spark://ip-adress:7077 at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
To run spark application using yarn, you should use spark-submit using --master yarn . No need to use setMaster inside scala source code.

Scala Code to connect to Spark and Cassandra

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}
val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST

Running Sparkling-Water with external H2O backend

I was following the steps for running Sparkling water with external backend from here. I am using spark 1.4.1, sparkling-water-1.4.16, I've build the extended h2o jar and exported the H2O_ORIGINAL_JAR and H2O_EXTENDED_JAR system variables. I start the h2o backend with
java -jar $H2O_EXTENDED_JAR -md5skip -name test
But when I start sparkling water via
./bin/sparkling-shell
and in it try to get the H2OConf with
import org.apache.spark.h2o._
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test”)
val hc = H2OContext.getOrCreate(sc, conf)
it fails on the second line with
<console>:24: error: trait H2OConf is abstract; cannot be instantiated
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
^
I've tried adding the newly build extended h2o jar with --jars parameter either to sparkling water or standalone spark with no progress. Does any one have any hints?
This is unsupported for versions of Spark earlier than 2.0.
Download the latest version of sparkling jar and add it to while starting the spark-shell:
./bin/sparkling-shell --master yarn-client --jars "<path to the jar located>"
Then run the code by setting the extended h2o driver:
import org.apache.spark.h2o._
val conf = new H2OConf(spark).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("//home//xyz//sparkling-water-2.2.5/bin//h2odriver-sw2.2.5-hdp2.6-extended.jar").setNumOfExternalH2ONodes(2).setMapperXmx("6G")
val hc = H2OContext.getOrCreate(spark, conf)

Apache spark error: not found: value sqlContext

I am trying to set up spark in Windows 10. Initially, I faced this error while starting and the solution in the link helped. Now I am still not able to run import sqlContext.sql as it still throws me an error
----------------------------------------------------------------
Fri Mar 24 12:07:05 IST 2017:
Booting Derby version The Apache Software Foundation - Apache Derby - 10.12.1.1 - (1704137): instance a816c00e-015a-ff08-6530-00000ac1cba8
on database directory C:\metastore_db with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#37606fee
Loaded from file:/F:/Soft/spark/spark-2.1.0-bin-hadoop2.7/bin/../jars/derby-10.12.1.1.jar
java.vendor=Oracle Corporation
java.runtime.version=1.8.0_101-b13
user.dir=C:\
os.name=Windows 10
os.arch=amd64
os.version=10.0
derby.system.home=null
Database Class Loader started - derby.database.classpath=''
17/03/24 12:07:09 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.128.18.22:4040
Spark context available as 'sc' (master = local[*], app id = local-1490337421381).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import sqlContext.sql
<console>:23: error: not found: value sqlContext
import sqlContext.sql
^
Spark context available as 'sc' (master = local[*], app id =
local-1490337421381).
Spark session available as 'spark'.
In Spark 2.0.x, the entry point of Spark is SparkSession and that is available in Spark shell as spark, so try this way:
spark.sqlContext.sql(...)
You can also create your Spark Context like this
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
First option is my choice as Spark shell has already created one for you, so make use of it.
Since you are using Spark 2.1 you'll have to use the SparkSession object. You can get a reference to SparkContext from the SparkSession object
var sSession = org.apache.spark.sql.SparkSession.getOrCreate();
var sContext = sSession.sparkContext;
If you are on Cloudera and have this issue, the solution from this Github ticket worked for me (https://github.com/cloudera/clusterdock/issues/30):
The root user (who you're running as when you start spark-shell) has no user directory in HDFS. If you create one (sudo -u hdfs hdfs dfs -mkdir /user/root followed by sudo -u hdfs dfs -chown root:root /user/root), this should be fixed.
I.e. create a user home directory for the user running spark-shell. This fixed it for me.
And don't forget to import the context!
import org.apache.spark.sql.{SparkSession, types}
You have to create sqlContext in order to access it to execute SQL statements. In Spark 2.0, you can create SQLContext easily using SparkSession like below.
val sqlContext = spark.sqlContext
sqlContext.sql("SELECT * FROM sometable")
Alternatively, you could also execute SQL statements using SparkSession like below.
spark.sql("SELECT * FROM sometable")

Why does Spark Cassandra Connector fail with NoHostAvailableException?

I am having problems getting Spark Cassandra Connector working in Scala.
I'm using these versions:
Scala 2.10.4
spark-core 1.0.2
cassandra-thrift 2.1.0 (my installed cassandra is v2.1.0)
cassandra-clientutil 2.1.0
cassandra-driver-core 2.0.4 (recommended for connector?)
spark-cassandra-connector 1.0.0
I can connect and talk to Cassandra (w/o spark) and I can talk to Spark (w/o Cassandra) but the connector gives me:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.0.194:9042 (com.datastax.driver.core.TransportException: [/10.0.0.194:9042] Cannot connect))
What am I missing? Cassandra is a default install (port 9042 for cql according to cassandra.yaml). I'm trying to connect locally ("local").
My code:
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext("local","test",conf)
val rdd = sc.cassandraTable("myks","users")
val rr = rdd.first
println(s"Result: $rr")
Local in this context is specifying the Spark master (telling it to run in local mode) and not the Cassandra connection host.
To set the Cassandra Connection host you have to set a different property in the Spark Config
import org.apache.spark._
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "IP Cassandra Is Listening On")
.set("spark.cassandra.username", "cassandra") //Optional
.set("spark.cassandra.password", "cassandra") //Optional
val sc = new SparkContext("spark://Spark Master IP:7077", "test", conf)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md