How to configure executors with custom StatsD Spark metrics sink - scala

How do I sink Spark Streaming metrics to this StatsD sink for executors?
Similar to other reported issues (sink class not found, sink class in executor), I can get driver metrics, but executors throw ClassNotFoundException with my setup:
StatsD sink class is compiled with my Spark-Streaming app (my.jar)
spark-submit is run with:
--files ./my.jar (to pull jar containing sink into executor)
--conf "spark.executor.extraClassPath=my.jar"
Spark Conf is configured in the driver with:
val conf = new SparkConf()
conf.set("spark.metrics.conf.*.sink.statsd.class",
"org.apache.spark.metrics.sink.StatsDSink")
.set("spark.metrics.conf.*.sink.statsd.host", conf.get("host"))
.set("spark.metrics.conf.*.sink.statsd.port", "8125")

Looks you hit the bug https://issues.apache.org/jira/browse/SPARK-18115. I hit it too and googled your question :(

Copy your jar files to the $SPARK_HOME/jars folder.

Related

Not able to execute Pyspark script using spark action in Oozie - Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog

I am facing below error while running spark action through oozie workflow on an EMR 5.14 cluster:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'"
My Pyspark script runs fine when executed as a normal spark job but is not being executed via Oozie
Pyspark Program:-
spark = SparkSession.builder.appName("PysparkTest").config("hive.support.quoted.identifiers", "none").enableHiveSupport().getOrCreate()
sc = SparkContext.getOrCreate();
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
I have created a workflow.xml and job.properties taking reference from the LINK.
I copied all the spark and hive related configuration file under the same directory($SPARK_CONF_DIR/).
Hive is also configured to use MySQL for the metastore.
It will be great if you can help me figure out the problem which I am facing when running this Pyspark program as a jar file in an Oozie spark action.
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog' This means the Catalog jar its trying find is not is ooziesharelib spark directory.
Please add the following property in your job.properties file.
oozie.action.sharelib.for.spark=hive,spark,hcatalog
Also can you please post the whole log?
And if possible could you please run the same on EMR 5.29, I have faced few jar issue on 5.26 and the lower version while running PySpark.

NimbusLeaderNotFoundException in Apache Storm UI

I am trying to launch Storm ui for streaming application, however I constantly get this error:
org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [localhost]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:250)
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:179)
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:138)
at org.apache.storm.daemon.ui.resources.StormApiResource.getClusterConfiguration(StormApiResource.java:116)
I have launched storm locally using storm script for starting nimbus, submitting jar and polling ui. What could be the reason of it?
Here is the code with connection setup:
val cluster = new LocalCluster()
val bootstrapServers = "localhost:9092"
val spoutConfig = KafkaTridentSpoutConfig.builder(bootstrapServers, "tweets")
.setProp(props)
.setFirstPollOffsetStrategy(FirstPollOffsetStrategy.LATEST)
.build()
val config = new Config()
cluster.submitTopology("kafkaTest", config, tridentTopology.build())
When you submit to a real cluster using storm jar, you should not use LocalCluster. Instead use the StormSubmitter class.
The error you're getting is saying that it can't find Nimbus at localhost. Are you sure Nimbus is running on the machine you're running storm jar from? If so, please post the commands you're running, and maybe also check the Nimbus log.

End of file exception while reading a file from remote hdfs cluster using spark

I am new to working with HDFS. I am trying to read a csv file which is stored in a hadoop cluster using spark. Every time i try to access it i get the following error:
End of File Exception between local host
I have not setup hadoop locally since i already had access to hadoop cluster.
I may be missing some configurations but i dont know which ones. Would appreciate the help.
I tried to debug it using this :
link
Did not work for me.
This is the code using spark.
val conf= new SparkConf().setAppName("Read").setMaster("local").set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
val sc=new SparkContext(conf)
val data=sc.textfile("hdfs://<some-ip>/abc.csv)
I expect it to read the csv and convert it into RDD.
Getting this error:
Exception in thread "main" java.io.EOFException: End of File Exception between local host is:
Run you spark jobs on hadoop cluster. Use below code:
val spark = SparkSession.builder().master("local[1]").appName("Read").getOrCreate()
val data = spark.sparkContext.textFile("<filePath>")
or you can use spark-shell as well.
If you want to access hdfs from your local, follow this:link

How to write a Spark dataframe into Kinesis Stream?

I am creating a Dataframe from a kafka topic using spark streaming.
I want to write the Dataframe into a Kinesis Producer.
I understand that there is no official API for this as of now. But there are multiple APIs available over the internet , but sadly, none of them worked for me.
Spark version : 2.2
Scala : 2.11
I tried using https://github.com/awslabs/kinesis-kafka-connector and build the jar. But getting errors due to conflicting package names between this jar and spark API. Please help.
########### Here is the code for others:
spark-shell --jars spark-sql-kinesis_2.11-2.2.0.jar,spark-sql-kafka-0-10_2.11-2.1.0.jar,spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar --files kafka_client_jaas_spark.conf --properties-file gobblin_migration.conf --conf spark.port.maxRetries=100 --driver-java-options "-Djava.security.auth.login.config=kafka_client_jaas_spark.conf" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_spark.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_spark.conf"
import java.io.File
import org.apache.commons.lang3.exception.ExceptionUtils
import org.apache.spark.SparkException
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import scala.sys.process._
import org.apache.log4j.{ Logger, Level, LogManager, PropertyConfigurator }
import org.apache.spark.sql.streaming.Trigger
val streamingInputDF =spark.readStream.format("kafka").option("kafka.bootstrap.servers","bootstrap server").option("subscribe", "<kafkatopic>").option("startingOffsets", "latest").option("failOnDataLoss", "false").option("kafka.security.protocol", "SASL_PLAINTEXT").load()
val xdf=streamingInputDF.select(col("partition").cast("String").alias("partitionKey"),col("value").alias("data"))
xdf.writeStream.format("kinesis").option("checkpointLocation", "<hdfspath>").outputMode("Append").option("streamName", "kinesisstreamname").option("endpointUrl","kinesisendpoint").option("awsAccessKeyId", "accesskey").option("awsSecretKey","secretkey").start().awaitTermination()
For the jar spark-sql-kinesis_2.11-2.2.0.jar, go to quoble , download the package for your spark version, build the jar.
If you are behind a corporate network, set the proxy before launching spark.
export http_proxy=http://server-ip:port/
export https_proxy=https://server-ip:port/
Kafka Connect is a service to which you can POST your connector specifications (kinesis in this case), which then takes care of running the connector. It supports quite a few transformations as well while processing the records. Kafka Connect plugins are not intended to be used with Spark applications.
If your use case requires you to do some business logic while processing the records, then you could go with either Spark Streaming or Structured Streaming approach.
If you want to take Spark based approach, below are the 2 options I can think of.
Use Structured Streaming. You could use a Strucuted streaming connector for Kinesis. You can find one here. There may be others too. This is the only stable and open source connector I am aware of. You can find an example for using Kinesis as a sink here.
Use Kinesis Producer Library or aws-java-sdk-kinesis library to publish records from your Spark Streaming application. Using KPL is a preferred approach here. You could do mapPartitions and create a Kinesis client per partition and publish the records using these libraries. There are plenty of examples in AWS docs for these 2 libraries.

Spark App Works Only in Standalone but not able to connect to master?

I have a scala 2.10 spark 1.5.0 sbt app I am developing in eclipse. In my main method, I have:
val sc = new SparkContext("local[2]", "My Spark App")
// followed by operations to do things with the spark context to count up lines in a file
When I run this application within the eclipse IDE, it works and outputs what I expect the result to be, but when I change the spark context to connect to my cluster using:
val master = "spark://My-Hostname-From-The-Spark-Master-Page.local:7077"
val conf = new SparkConf().setAppName("My Spark App").setMaster(master)
val sc = new SparkContext(conf)
I get errors like:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#My-Hostname-From-The-Spark-Master-Page.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
What gives? How can I get my job to run with the existing master and slave nodes I started up? I know spark-submit is recommended, but arn't applications like Zepplin and Notebook designed to use spark without having to "spark-submit"?