Spark-submit -class command not found? - scala

I am running a project with kafka and Apache spark. To run my kafka stream I am running this command from within the project:
$SPARK_HOME/spark-submit --class "TwitterStream" --master local[*] target/scala-2.11/scalakafka_2.11-0.1.jar
However I simply get the below error:
bash: /spark-submit: No such file or directory
If I add \ to the end of the command it seems to enter spark-submit, but nothing happens!

First you check you have configured SPARK_HOME properly using echo $SPARK_HOME. if not then run export SPARK_HOME=<your spark folder's path>.

Related

Is it possible to run a Spark Scala script without going inside spark-shell?

The only two way I know to run Scala based spark code is to either compile a Scala program into a jar file and run it with spark-submit, or run a Scala script by using :load inside the spark-shell. My question is, it is possible to run a Scala file directly on the command line, without first going inside spark-shell and then issuing :load?
You can simply use the stdin redirection with spark-shell:
spark-shell < YourSparkCode.scala
This command starts a spark-shell, interprets your YourSparkCode.scala line by line and quits at the end.
Another option is to use -I <file> option of spark-shell command:
spark-shell -I YourSparkCode.scala
The only difference is that the latter command leaves you inside the shell and you must issue :quit command to close the session.
[UDP]
Passing parameters
Since spark-shell does not execute your source as an application but just interprets your source file line by line, you cannot pass any parameters directly as application arguments.
Fortunately, there may be a lot of options to approach the same (e.g, externalizing the parameters in another file and read it in the very beginning in your script).
But I personally find the Spark configuration the most clean and convenient way.
Your pass your parameters via --conf option:
spark-shell --conf spark.myscript.arg1=val1 --conf spark.yourspace.arg2=val2 < YourSparkCode.scala
(please note that spark. prefix in your property name is mandatory, otherwise Spark will discard your property as invalid)
And read these arguments in your Spark code as below:
val arg1: String = spark.conf.get("spark.myscript.arg1")
val arg2: String = spark.conf.get("spark.myscript.arg2")
It is possible via spark-submit.
https://spark.apache.org/docs/latest/submitting-applications.html
You can even put it to bash script either create sbt-task
https://www.scala-sbt.org/1.x/docs/Tasks.html
to run your code.

Spark-shell -i path/to/filename alternative

We have:
spark-shell -i path/to/script.scala
to run a scala script, is it possible to add something like this to the spark-defaults.conf file so that it always loads the scala script on start up of the spark-shell and thus does not have to be added to the command line.
I would like to use this to store import _, credentials and user defined functions that I use regularly so that I don't have to enter the commands every time I start spark-shell.
Thanks,
Shane
You can go to spark directory /bin, create file spark-shell-new.cmd and paste
spark-shell -i path/to/script.scala then run spark-shell-new in cmd like a default spark-shell.
You can do something like this
:load <path_to_script>
Write all the required lines of code in that script

kafka command error: Could not find or load main class >-Djava.net.preferIPv4Stack=true

I'm using confluent-kafka platform. I want to use the commandline tool to list all topics. It shows the error. Under the ./bin folder:
$ ./kafka-topics --list --zookeeper mykafkaaddress:port
Error: Could not find or load main class >-Djava.net.preferIPv4Stack=true
I already have the $JAVA_HOME set:
$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk
Somewhere within kafka-topics or kafka-run-class scripts, you have inserted >-Djava.net.preferIPv4Stack=true
For example, you might have done export KAFKA_OPTS=">-Djava.net.preferIPv4Stack=true" or some other export getting added to the internal java command
Starting a new terminal session, inspecting your environment variables, or fixing the script are some solutions to the issue.

spark-notebook: command not found

I want to set up spark notebook on my laptop following the instructions listed in http://spark-notebook.io I gave the command bin/spark-notebook and I'm getting:
-bash: bin/spark-notebook: command not found
How to resolve this? I want to run spark-notebook for spark standalone and scala.
You can download
spark-notebook-0.7.0-pre2-scala-2.10.5-spark-1.6.3-hadoop-2.7.2-with-parquet.tqz
Set the path in bashrc
Example :
$sudo gedit ~/.bashrc
export SPARK_HOME=/yor/path/
export PATH=$PATH:$SPARK+HOME/bin
Then start your notebook following command...
$spark-notebook

PySpark: can we run a hql from a pyspark code

I have Pyspark code which writes hql commands to a .hql file. I thought of using the subprocess library to run the hql file directly but when I do so my hql isnt running and the program is closing fine..
I know I can use sqlcontext to read each and every line from the hql and running it individually.. but I want to run the hql file from subprocess command isnt this possible???
note: i do spark-submit to run the .py code
You can directly submit it in a shell script with spark-sql
$ spark-sql –master yarn-client <..other parameters for executor memory etc..> -i ./script.hql
spark-sql internally invokes spark-submit.