scala spark to read file from hdfs cluster

scala spark to read file from hdfs cluster - scala

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?

You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

Related

How to read files uploaded by spark-submit on Kubernetes

I have Spark Jobs running on Yarn. These days I'm moving to Spark on Kubernetes.
On Kubernetes I'm having an issue: files uploaded via --files can't be read by Spark Driver.
On Yarn, as described in many answers I can read those files using Source.fromFile(filename).
But I can't read files in Spark on Kubernetes.
Spark version: 3.0.1
Scala version: 2.12.6
deploy-mode: cluster
submit commands
$ spark-submit --class <className> \
--name=<jobName> \
--master=k8s://https://api-hostname:6443 \
...
--deploy-mode=cluster \
--files app.conf \
--conf spark.kubernetes.file.upload.path=hdfs://<nameservice>/path/to/sparkUploads/ \
app.jar
After executing above command, app.conf is uploaded to hdfs://<nameservice>/path/to/sparkUploads/spark-upload-xxxxxxx/,
And in Driver's pod, I found app.conf in /tmp/spark-******/ directory, app.jar as well.
But Driver can't read app.conf, Source.fromFile(filename) returns null, there was no permission problems.
Update 1
In Spark Web UI->"Environment" Tab, spark://<pod-name>-svc.ni.svc:7078/files/app.conf in "Classpath Entries" menu. Does this mean app.conf is available in classpath?
On the other hand, in Spark on Yarn user.dir property was included in System classpath.
I found SPARK-31726: Make spark.files available in driver with cluster deploy mode on kubernetes
Update 2
I found that driver pod's /opt/spark/work-dir/ dir was included in classpath.
but /opt/spark/work-dir/ is empty on driver pod whereas on executor pod it contains app.conf and app.jar.
I think that is the problem and SPARK-31726 describes this.
Update 3
After reading Jacek's answer, I tested org.apache.spark.SparkFiles.getRootDirectory().
It returns /var/data/spark-357eb33e-1c17-4ad4-b1e8-6f878b1d8253/spark-e07d7e84-0fa7-410e-b0da-7219c412afa3/userFiles-59084588-f7f6-4ba2-a3a3-9997a780af24
Update 4 - work around
First, I make ConfigMaps to save files that I want to read driver/executors
Next, The ConfigMaps are mounted on driver/executors. To mount ConfigMap, use Pod Template or Spark Operator

--files files should be accessed using SparkFiles.get utility:
get(filename: String): String
Get the absolute path of a file added through SparkContext.addFile().

I found the another temporary solution in spark 3.3.0
We can use flag --archives. The files without tar, tar.gz, zip are ignored unpacking step and after that they are placed on working dir of driver and executor.
Although the docs of --archive don't mention executor, I tested and it's working.

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini

With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py

Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

how to use Spark-submit configuration: jars,packages:in cluster mode?

When use Spark-submit in cluster mode(yarn-cluster),jars and packages configuration confused me: for jars, i can put them in HDFS, instead of in local directory . But for packages, because they build with Maven, with HDFS,it can't work. my way like below:
spark-submit --jars hdfs:///mysql-connector-java-5.1.39-bin.jar --driver-class-path /home/liac/test/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar --conf "spark.mongodb.input.uri=mongodb://192.168.27.234/test.myCollection2?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://192.168.27.234/test.myCollection2" --packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0 --py-files /home/liac/code/diagnose_disease/tool.zip main_disease_tag_spark.py --master yarn-client
error occur:
`Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
Anyone can tell me how to use jars and packages in cluster mode? and what's wrong with my way?

Your use of the --packages argument is wrong:
--packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
It needs to be in the form of groupId:artifactId:version as the output suggests. You cannot use a URL with it.
An example for using mongoDB with spark with the built-in repository support:
$SPARK_HOME/bin/spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0
If you insist on using your own jar you can provide it via --repositories. The value of the argument is
Comma-separated list of remote repositories to search for the Maven coordinates specified in packages.
For example, in your case, it could be
--repositories hdfs:///user/liac/package/jars/ --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0

"Can't get Kerberos realm" on yarn cluster

The situation is as follows:
I'm doing this on Windows 7, with MIT Kerberos client kfw 4.0.1. I'm connecting to a YARN cluster, via OpenVPN, that is secured with Kerberos 5. This cluster has been around for a while and it's been in use by other people, so the error is not likely to be on that side of things.
I can get a ticket via kinit (returns without error). However, once I try to do any of the following commands:
hdfs dfs -ls
spark-shell --master yarn
spark-submit anything --master yarn --deploy-mode cluster
essentially any spark or hadoop command on the cluster
I get the error: Can't get Kerberos realm (or Unable to locate Kerberos realm).
My krb5.ini file is in C:\ProgramData\MIT\Kerberos5
How can I further troubleshoot this?

Your JVM can not locate the krb5.conf file. You have several options:
set JVM property: -Djava.security.krb5.conf=/path/to/krb5.conf
or put the krb5.conf file into the <jdk-home>/jre/lib/security folder
or put the krb5.conf file into the c:\winnt\ folder
More information about locating the krb5.conf file are placed here: https://docs.oracle.com/javase/7/docs/technotes/guides/security/jgss/tutorials/KerberosReq.html

Master must start with yarn,spark

I am getting this error when is want to run SparkPi example.
beyhan#beyhan:~/spark-1.2.0-bin-hadoop2.4$ /home/beyhan/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --master ego-client --class org.apache.spark.examples.SparkPi /home/beyhan/spark-1.2.0-bin-hadoop2.4/lib/spark-examples-1.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Also i already start my master via another terminal
>./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home/beyhan/spark-1.2.0-bin-hadoop2.4/sbin/../logs/spark-beyhan-org.apache.spark.deploy.master.Master-1-beyhan.out
Any suggestion ?
Thanks.

Download and extract Spark:
$ cd ~/Downloads
$ wget -c http://archive.apache.org/dist/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz
$ cd /tmp
$ tar zxf ~/Downloads/spark-1.2.0-bin-hadoop2.4.tgz
$ cd spark-1.2.0-bin-hadoop2.4/
Start master:
$ sbin/start-master.sh
Find master's URL from logs in the file that above command printed. Lets assume that master is: spark://ego-server:7077
In this case, you can also find your master url by visiting this URL: http://localhost:8080/
Start one slave, and connect it to master:
$ sbin/start-slave.sh --master spark://ego-server:7077
Another way to ensure that master up and running start a shell bound to that master:
$ bin/spark-submit --master "spark://ego-server:7077"
If you get a spark shell, then everything seems fine.
Now execute your job:
$ find . -name "spark-example*jar"
./lib/spark-examples-1.2.0-hadoop2.4.0.jar
$ bin/spark-submit --master "spark://ego-server:7077" --class org.apache.spark.examples.SparkPi ./lib/spark-examples-1.2.0-hadoop2.4.0.jar

The error you're getting
Error: Master must start with yarn, spark, mesos, or local
Means that --master ego-client is not recognized by spark.
Use
--master local
for a local execution of spark or
--master spark://your-spark-master-ip:7077

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

scala spark to read file from hdfs cluster - scala

Related

How to read files uploaded by spark-submit on Kubernetes

How to pass external configuration file to pyspark(Spark 2.x) program?

how to use Spark-submit configuration: jars,packages:in cluster mode?

"Can't get Kerberos realm" on yarn cluster

Master must start with yarn,spark

Categories

Resources