How to use yarn to run a self-contained Spark app remotely - scala

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through yarn.
I need my Spark job to load a hdfs file located on a hadoop cluster not accessible directly through my local machine. So, I create SOCKS proxy through ssh tunnel by including these properties in hdfs-site.xml.
<property>
<name>hadoop.socks.server</name>
<value>localhost:7070</value>
</property>
<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader</name>
<value>true</value>
</property>
where 7070 is the dynamic port to the hadoop gateway machine.
ssh -fCND 7070 <hadoop-gateway-machine>
This allows me to access hdfs files locally, when I am using Spark in local[*] master configuration for testing.
However, when I run a real Spark job on yarn deployed on the same hadoop cluster (configured by yarn-site.xml, hdfs-site.xml, and core-site.xml in the classpath), I see errors like:
java.lang.IllegalStateException: Library directory '<project-path>/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
So, I set spark.yarn.jars property directly on sparkConf. This at least starts a yarn application. When I go the application url I just keep seeing this message in one of the worker logs:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
And this message in one more hadoop worker log (apparently the Spark worker that could not connect to driver)
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:187)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:653)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:651)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
My question is, what is the right way of running self-contained Spark apps on a yarn cluster. How do you do it so you don't have to specify spark.yarn.jars and other properties? Should you include spark-defaults.conf in the classpath as well?

Related

Confluent Kafka connect distributed mode jdbc connector

We have successfully used MySQL - kafka data ingestion using jdbc standalone connector but now facing issue in using the same in distributed mode (as kafka connect service ).
Command used for standalone connector which works fine -
/usr/bin/connect-standalone /etc/kafka/connect-standalone.properties /etc/kafka-connect-jdbc/source-quickstart-mysql.properties
Now we have stopped this one and started the kafka connect service in distributed mode like this -
systemctl status confluent-kafka-connect
● confluent-kafka-connect.service - Apache Kafka Connect - distributed
Loaded: loaded (/usr/lib/systemd/system/confluent-kafka-connect.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2018-11-14 22:52:49 CET; 41min ago
Docs: http://docs.confluent.io/
Main PID: 130178 (java)
CGroup: /system.slice/confluent-kafka-connect.service
└─130178 java -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.a...
2 nodes are currently running the connect service with same connect-distributed.properties file .
bootstrap.servers=node1IP:9092,node2IP:9092
group.id=connect-cluster
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
config.storage.topic=connect-configs
config.storage.replication.factor=1
status.storage.topic=connect-status
status.storage.replication.factor=1
offset.flush.interval.ms=10000
plugin.path=/usr/share/java
The connect service is UP and running but it doesn't load the connectors defined under /etc/kafka/connect-standalone.properties.
What should be done to the service so that whenever you hit the command systemctl start confluent-kafka-connect , it runs the service and starts the defined connectors under /etc/kafka-connect-*/ just like when you run a standalone connector manually providing paths to properties files.
it runs the service and starts the defined connectors under /etc/kafka-connect-*/
That's not how distributed mode works... It doesn't know which property files you want to load, and it doesn't scan those folders1
With standalone-mode the N+1 property files that you give are loaded immediately, yes, but for connect-distributed, you must use HTTP POST calls to the Connect REST API.
Confluent Control Center or Landoop's Connect UI can provide a nice management web portal for these operations.
By the way, if you have more than one broker, I'll suggest increasing the replica factors on those connect topics in the connect-distributed.properties file.
1. It might be a nice feature if it did, but then you have to ensure connectors are never deleted/stopped in distributed mode, and you just end up in an inconsistent state with what's running and the files that are on the filesystem.
I can describe what I did for starting the jdbc connector in distributed mode:
I am using on my local machine, confluent CLI utility for booting up faster the services.
./confluent start
Afterwords I stopped kafka-connect
./confluent stop connect
and then I proceed to manually start the customized connect-distributed on two different ports (18083 and 28083)
➜ bin ./connect-distributed ../etc/kafka/connect-distributed-worker1.properties
➜ bin ./connect-distributed ../etc/kafka/connect-distributed-worker2.properties
NOTE: Set plugin.path setting to the full (and not relative) path (e.g.: plugin.path=/full/path/to/confluent-5.0.0/share/java)
Then i can easily add a new connector
curl -s -X POST -H "Content-Type: application/json" --data #/full/path/to/confluent-5.0.0/etc/kafka-connect-jdbc/source-quickstart-sqlite.json http://localhost:18083/connectors
This should do the trick.
As already pointed out by cricket_007 consider a replication factor of at least 3 for the kafka brokers in case you're dealing with stuff that you don't want to lose in case of outage of one of the brokers.
The connector in distributed mode can not be deployed by property file as in a standalone mode. Using REST API instead, pls refer to https://docs.confluent.io/current/connect/managing/configuring.html#connect-managing-distributed-mode

HDFS-sink connector: No FileSystem for scheme: http

I'm using the documentation of confluent, but when I add a hdfs-sink connector, I get this Error:
Caused by: java.io.IOException: No FileSystem for scheme: http
Could any one help me please?
HDFS Sink connect doesn't work with HTTP urls (such as HttpFs).
You need to give a supported Hadoop Compatible FileSystem such as
hdfs://
file:// (Will write to local disk on individual Connect workers in Distributed Mode, works best only in Standalone mode)
s3a:// (Assuming hadoop-aws on CLASSPATH)
wasb:// (Assuming hadoop-azure on CLASSPATH)

spark-submit gets idle in local mode

I am trying to test a jar using spark-submit (Spark 1.6.0) into a Cloudera cluster, which has Kerberos enabled.
The fact is that if I launch this command:
spark-submit --master local --class myDriver myApp.jar -c myConfig.conf
In local or local[*], the process stops after a couple of stages. However, if I use yarn-client or yarn-cluster master modes the process ends correctly. The process reads and writes some files into HDFS.
Furthermore, these traces appear:
17/07/05 16:12:51 WARN spark.SparkContext: Requesting executors is only supported in coarse-grained mode
17/07/05 16:12:51 WARN spark.ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!
It is surely a matter of configuration, but the fact is that I don't know what is happenning. Any ideas? What configuration options should I change?

Error running multiple kafka standalone hdfs connectors

We are trying to launch multiple standalone kafka hdfs connectors on a given node.
For each connector, we are setting the rest.port and offset.storage.file.filename to different ports and path respectively.
Also kafka broker JMX port is # 9999.
When I start the kafka standalone connector, I get the error
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9999; nested exception is:
java.net.BindException: Address already in use (Bind failed)
Though the rest.port is set to 9100
kafka version: 2.12-0.10.2.1
kafka-connect-hdfs version: 3.2.1
Please help.
We are trying to launch multiple standalone kafka hdfs connectors on a given node.
Have you considered running these multiple connectors within a single instance of Kafka Connect? This might make things easier.
Kafka Connect itself can handle running multiple connectors within a single worker process. Kafka Connect in distributed mode can run on a single node, or across multiple ones.
For those who trying to use rest.port flag and still getting Address already in use error. That flag has been marked as deprecated in KIP-208 and finally removed in PR.
From that point listeners can be used to change default REST port.
Examples from Javadoc
listeners=HTTP://myhost:8083
listeners=HTTP://:8083
Configuring and Running Workers - Standalone mode
You may have open Kafka Connect connections that you don't know about. You can check this with:
ps -ef | grep connect
If you find any, kill those processes.

Setup kafka-connect to fetch data from remote brokers

I'm trying to set up Kafka connect sink connector. Kafka connect is part of Kafka connect worker (confluent-3.2.0). I have a Kafka broker (confluent-3.2.0) up and running on machine A. I want to set up Kafka-connect-sink connector on another machine B to consume messages, using a custom Kafka-connect-sink connector jar. Assume that Kafka broker and Zoo keeper ports on machine A are open to machine B.
So should I install/setup confluent-3.2.0 on machine B (Since Kafka Connect is part of Kafka package) by setting the classpath to the Kafka-connect-sink connector jar and run the following command?
./bin/connect-distributed.sh worker.properties
Yes. What you describe will work and is the easiest way to setup this system even though on machine B you really only need the start script, the configuration properties file, the jars for Kafka Connect, and the jars for the custom connector.