StreamingContext couldn't bind to a port used by Java - streaming

I have started Spark master and workers and can easily run a MapReduce like wordcount on HDFS.
Now I want to run a streaming on textstream and when I want to make a new StreamingContext
I have this error:
scala> val ssc = new StreamingContext("spark://master:7077","test", Seconds(2))
13/07/17 11:13:45 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.2.105:48594
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
....
I checked the port and it was used by Java. I killed the process and I got out of Spark-shell.
Is there any way I can change the StreamingContext's port to a random free port?

Java is the underlying process for spark (scala runs on the jvm). It is possible that you have multiple copies of spark /spark streaming running. Can you look into that?
Specifically: i get the same result if I have a spark-shell already running.
You can check for other spark processes:
ps -ef | grep spark | -v grep

Related

Spark 2.2.0 unable to connect to Phoenix 4.11.0 version in loading the table to DF

I'm using the below techstack and trying to connect Phoenix tables using PySpark code. I have downloaded the following jars from the url and tried executing the below code. In logs the connection to hbase is established but the console is stuck with out doing nothing. Please let me know if anybody encountered and fixed similar issue.
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/4.11.0-HBase-1.2
jars:
phoenix-spark-4.11.0-HBase-1.2.jar
phoenix-client.jar
Tech Stack all running in same host:
Apache Spark 2.2.0 Version
Hbase 1.2 Version
Phoenix 4.11.0 Version
Copied the hbase-site.xml in the folder path /spark/conf/hbase-site.xml.
Command executed ->
usr/local/spark> spark-submit phoenix.py --jars /usr/local/spark/jars/phoenix-spark-4.11.0-HBase-1.2.jar --jars /usr/local/spark/jars/phoenix-client.jar
Phoenix.py:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.phoenix.spark").option("table",
"schema.table1").option("zkUrl", "localhost:2181").load()
df.show()
Error log: Hbase Connection is established, however in the console it is stuck and timing out error is thrown
18/07/30 12:28:15 WARN HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
18/07/30 12:28:54 INFO RpcRetryingCaller: Call exception, tries=10, retries=35, started=38367 ms ago, cancelled=false, msg=row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master01,16020,1532591192223, seqNum=0
Take a look at these answers :
phoenix jdbc doesn't work, no exceptions and stuck
HBase Java client - unknown host: localhost.localdomain
Both of the issues happened in Java (with JDBC), but it looks like it's a similar issue here.
Try to add ZooKeeper hostname (master01, as I see in the error message) to your /etc/hosts :
127.0.0.1 master01
if you are running all your stack locally.

Connection to Cassandra from spark Error

I am using Spark 2.0.2 and Cassandra 3.11.2 I am using this code but it give me connection error.
./spark-shell --jars ~/spark/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.10/spark-cassandra-connector-assembly-2.0.5-121-g1a7fa1f8.jar
import com.datastax.spark.connector._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val test = sc.cassandraTable("sensorkeyspace", "sensortable")
test.count
When I enter test.count command it give me this error.
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:168)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
Can you check the yaml file? It seems the number of enough concurrent connections are open at any instance of time.

How to use mesos master url in a self-contained Scala Spark program

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077
By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.
in file:spark-default.xml ------->setting :spark.cores.max=4
It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.

Programatically setting (remote) master address for launching Spark

Note that the following local setting does work:
sc = new SparkContext("local[8]", testName)
But setting the remote master programmatically does not work:
sc = new SparkContext(master, testName)
or (same end result)
val sconf = new SparkConf()
.setAppName(testName)
.setMaster(master)
sc = new SparkContext(sconf)
In both of the latter cases the result is:
[16:25:33,427][INFO ][AppClient$ClientActor] Connecting to master akka.tcp://sparkMaster#mellyrn:7077/user/Master...
[16:25:33,439][WARN ][ReliableDeliverySupervisor] Association with remote system [akka.tcp://sparkMaster#mellyrn:7077]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
The following command line approach for setting the spark master consistently works (verified on multiple projects):
$SPARK_HOME/bin/spark-submit --master spark://mellyrn.local:7077
--class $1 $curdir/sparkclass.jar )
Clearly there is some additional configuration happening related to the command line spark-submit. Anyone want to posit what that might be?
In UNIX shell script below:
SP_MAST_URL=$CASSANDRA_HOME/dse client-tool spark master-address
echo $SP_MAST_URL
This will print the master from your Spark cluster environment. You may try this command utility provided by Spark and pass it on to SPARK SUBMIT command.
Note: CASSANDRA_HOME is the path where Apache cassandra installation is done. It could be any UNIX FILE path depending upon each environment.