acessing azure databricks data from kubernetes - kubernetes

I am running the following command from a kubernetes cluster to access a file from azure databricks
spark-submit --packages io.delta:delta-core_2.12:0.7.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore" script.py
I am getting this error.
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
Do I need to install any jars from hadoop azure. Please guide me

Related

Spark submit truncates arguments in yarn cluster mode

I am running spark application on yarn cluster in cluster deploy mode using following command
spark-submit --conf spark.executor.memory=24g --conf spark.master=yarn --conf spark.submit.deployMode=cluster --conf spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 --conf spark.files=file:///opt/configurations/app.conf --class com.example.HelloWorld --queue sample_q file:///opt/jars/example.jar '{"sample":{}}'
This command is not passing the entire argument to HelloWorld class.
Main method argument passed : {"sample":{
Main method argument expected: {"sample":{}}
The same command is running properly with client deploy mode
spark-submit --conf spark.executor.memory=24g --conf spark.master=yarn --conf spark.submit.deployMode=client --conf spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 --conf spark.files=file:///opt/configurations/app.conf --class com.example.HelloWorld --queue sample_q file:///opt/jars/example.jar '{"sample":{}}'
Upon inspecting the launch_container.sh script in yarn worker node it was found that the command also had truncated string within it (--arg '{\"sample\":{')
Spark Version: 2.3
Hadoop Version: 2.7.3
Yarn consider {{ and }} as parameter expansion character hence any occurrence is considered as an environment variable and replaced with the corresponding value. Since there is no environment variable.
This causes an issue in cluster deploy mode as driver runs in yarn cluster.
Reference: YarnApplicationConstants

Kubernetes spark-submit

I am trying to use kuberenets as cluster manger for spark. I also want to ship the container logs to splunk. Now I do have monitoring stack deployed (fluent-bit, prometheus etc)in the same namespace and the way it works is if your pod has a certain environment_variable it will start reading the logs and push it to splunk.
The thing I am not able to find is how do I set a environment variable and populate it
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://my-kube-cluster.com \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
....
....
....
--conf spark.kubernetes.driverEnv.UID="set it to spark driver pod id" \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar
To configure additional Spark Driver Pod environment variables you can pass additional --conf spark.kubernetes.driverEnv.EnvironmentVariableName=EnvironmentVariableValue (please refer docs for more details).
To configure additional Spark Executor Pods environment variables you can pass additional --conf spark.executorEnv.EnvironmentVariableName=EnvironmentVariableValue (please refer docs for more details).
Hope it helps.

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini
With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py
Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

how to use Spark-submit configuration: jars,packages:in cluster mode?

When use Spark-submit in cluster mode(yarn-cluster),jars and packages configuration confused me: for jars, i can put them in HDFS, instead of in local directory . But for packages, because they build with Maven, with HDFS,it can't work. my way like below:
spark-submit --jars hdfs:///mysql-connector-java-5.1.39-bin.jar --driver-class-path /home/liac/test/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar --conf "spark.mongodb.input.uri=mongodb://192.168.27.234/test.myCollection2?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://192.168.27.234/test.myCollection2" --packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0 --py-files /home/liac/code/diagnose_disease/tool.zip main_disease_tag_spark.py --master yarn-client
error occur:
`Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
Anyone can tell me how to use jars and packages in cluster mode? and what's wrong with my way?
Your use of the --packages argument is wrong:
--packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
It needs to be in the form of groupId:artifactId:version as the output suggests. You cannot use a URL with it.
An example for using mongoDB with spark with the built-in repository support:
$SPARK_HOME/bin/spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0
If you insist on using your own jar you can provide it via --repositories. The value of the argument is
Comma-separated list of remote repositories to search for the Maven coordinates specified in packages.
For example, in your case, it could be
--repositories hdfs:///user/liac/package/jars/ --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0