Accessing spark cluster mode in spark submit - scala

I am trying to run my spark scala code using spark submit. I want to access the spark cluster for this purpose. So, what should I use for the master in the Spark Context? I have used like this
val spark = SparkSession.builder()
.master("spark://prod5:7077")
.appName("MyApp")
.getOrCreate;
But it doesn't seem to work. What should I use as the master for using spark cluster?

If you are trying to submit your job in IDE,just make sure that "prod5" is the master, and try to change the port to 6066 by default.

From official documentation -
spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]

Related

Error to write dataframe in Cassandra table on Amazon Keyspaces

I'm trying to write a dataframe on AWS (Keyspace), but I'm getting the following messages below:
Stack:
dfExploded.write.cassandraFormat(table = "table", keyspace = "hub").mode(SaveMode.Append).save()
21/08/18 21:45:18 WARN DefaultTokenFactoryRegistry: [s0] Unsupported partitioner 'com.amazonaws.cassandra.DefaultPartitioner', token map will be empty.
java.lang.AssertionError: assertion failed: There are no contact points in the given set of hosts
at scala.Predef$.assert(Predef.scala:223)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy$.determineDataCenter(LocalNodeFirstLoadBalancingPolicy.scala:195)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$dataCenterNodes$1(CassandraConnector.scala:192)
at scala.Option.getOrElse(Option.scala:189)
at com.datastax.spark.connector.cql.CassandraConnector$.dataCenterNodes(CassandraConnector.scala:192)
at com.datastax.spark.connector.cql.CassandraConnector$.alternativeConnectionConfigs(CassandraConnector.scala:207)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$3(CassandraConnector.scala:169)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:34)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:89)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.datasource.CassandraCatalog$.com$datastax$spark$connector$datasource$CassandraCatalog$$getMetadata(CassandraCatalog.scala:455)
at com.datastax.spark.connector.datasource.CassandraCatalog$.getTableMetaData(CassandraCatalog.scala:421)
at org.apache.spark.sql.cassandra.DefaultSource.getTable(DefaultSource.scala:68)
at org.apache.spark.sql.cassandra.DefaultSource.inferSchema(DefaultSource.scala:72)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameWriter.getTable$1(DataFrameWriter.scala:339)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
SparkSubmit:
spark-submit --deploy-mode cluster --master yarn \
--conf=spark.cassandra.connection.port="9142" \
--conf=spark.cassandra.connection.host="cassandra.sa-east-1.amazonaws.com" \
--conf=spark.cassandra.auth.username="BUU" \
--conf=spark.cassandra.auth.password="123456789" \
--conf=spark.cassandra.connection.ssl.enabled="true" \
--conf=spark.cassandra.connection.ssl.trustStore.path="cassandra_truststore.jks"
--conf=spark.cassandra.connection.ssl.trustStore.password="123456"
Connection by cqlsh everything ok, but in spark got this error
To read and write data between Keyspaces and Apache Spark by using the open-source Spark Cassandra Connector all you have to do is update the partitioner for your Keyspaces account.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
The issue as the error states is that AWS Keyspaces uses a partitioner (com.amazonaws.cassandra.DefaultPartitioner) that isn't supported by the Spark-Cassandra-connector.
There isn't a lot of public docs around what the underlying database is for AWS Keyspaces so I've long-suspected that there's a CQL API engine sitting in front of Keyspaces so it "looks" like Cassandra but it's probably backed by something else like Dynamo DB. I'm more than happy to be corrected by someone here from AWS just so I can put that to bed. 🙂
The default Cassandra partitioner is Murmur3Partitioner and is the only recommended partitioner. The older partitioners such as RandomPartitioner and ByteOrderedPartitioner are supported only for backward compatibility but should never be used for new clusters.
Finally, we don't test the Spark connector against AWS Keyspaces so be prepared for a lot of surprises there. Cheers!

Memory Allocation In Spark-Scala Application:

I am executing spark-Scala job using Spark submit command. I have written my code in spark sql where i am joining 2 tables and loading data again in 3rd hive.
code is working fine,But sometimes i am getting some issue like OutofmemoryIssue: Java heap size issue,Timeout error.
So i want to control my job manually by passing number of executors, cores and memory.When i used 16 executor,1 core and 20 GB executor memory my spark application is getting stuck.
can someone please suggest me how should i control manually my spark application by providing correct parameter.and is there any other hive or spark specific parameter are there which i can use for fast execution.
below is configuration of my cluster.
Number of Nodes: 5
Number of Cores per Node: 6
RAM per Node: 125 gb
Spark Submit Command.
spark-submit --class org.apache.spark.examples.sparksc \
--master yarn-client \
--num-executors 16 \
--executor-memory 20g \
--executor-cores 1 \
examples/jars/spark-examples.jar
It depends on volume of your data. you can make dynamic parameters. This link has very nice explanation
How to tune spark executor number, cores and executor memory?
you can enable spark.shuffle.service.enabled, use spark.sql.shuffle.partitions=400, hive.exec.compress.intermediate=true, hive.exec.reducers.bytes.per.reducer=536870912, hive.exec.compress.output=true, hive.output.codec=snappy, mapred.output.compression.type=BLOCK
if your data >700MB you can enable spark.speculation properties

How to make spark streaming process multiple batches?

Spark uses parallelism, however while testing my application and looking at the sparkUI, under the streaming tab I often notice under "active batches" that the status of one is "processing" and the rest are "queued". Is there a parameter I can configure to make Spark process multiple batches simultaneously?
Note: I am using spark.streaming.concurrentJobs greater than 1, but that doesn't seem to apply to batch processing (?)
I suppose that you are using Yarn to launch your spark stream.
Yarn queue your batches because he don't have enough resources to launch simultaneously your stream/spark batch.
you can try limit ressource use by yarn with :
-driver-memory -> memory for the driver
--executor-memory -> memory for worker
-num-executors -> number of distinct yarn containers
--executor-cores -> number of threads you get inside each executor
for exemple :
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 800m \
--executor-memory 800m \
--num-executors 4 \
--class my.class \
myjar

spark-submit gets idle in local mode

I am trying to test a jar using spark-submit (Spark 1.6.0) into a Cloudera cluster, which has Kerberos enabled.
The fact is that if I launch this command:
spark-submit --master local --class myDriver myApp.jar -c myConfig.conf
In local or local[*], the process stops after a couple of stages. However, if I use yarn-client or yarn-cluster master modes the process ends correctly. The process reads and writes some files into HDFS.
Furthermore, these traces appear:
17/07/05 16:12:51 WARN spark.SparkContext: Requesting executors is only supported in coarse-grained mode
17/07/05 16:12:51 WARN spark.ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!
It is surely a matter of configuration, but the fact is that I don't know what is happenning. Any ideas? What configuration options should I change?

How to configure executors with custom StatsD Spark metrics sink

How do I sink Spark Streaming metrics to this StatsD sink for executors?
Similar to other reported issues (sink class not found, sink class in executor), I can get driver metrics, but executors throw ClassNotFoundException with my setup:
StatsD sink class is compiled with my Spark-Streaming app (my.jar)
spark-submit is run with:
--files ./my.jar (to pull jar containing sink into executor)
--conf "spark.executor.extraClassPath=my.jar"
Spark Conf is configured in the driver with:
val conf = new SparkConf()
conf.set("spark.metrics.conf.*.sink.statsd.class",
"org.apache.spark.metrics.sink.StatsDSink")
.set("spark.metrics.conf.*.sink.statsd.host", conf.get("host"))
.set("spark.metrics.conf.*.sink.statsd.port", "8125")
Looks you hit the bug https://issues.apache.org/jira/browse/SPARK-18115. I hit it too and googled your question :(
Copy your jar files to the $SPARK_HOME/jars folder.