SPARK: Perforf linear/logistic regression from spark-glmnet package - scala

I'm new in Spark and for last few weeks I'm learning about methods implemented in it. This time I want to use functions implemented in spark-glmnet package: spark-glmnet. I am most interested in running logistic regression.
I downloaded a source files and created a fat JAR using command:
sbt assembly
When the process was done i copy the JAR file to a server and run Spark shell.
export HADOOP_CONF_DIR=/opt/etc-hadoop/;
/opt/spark-1.5.0-bin-hadoop2.4/bin/spark-shell \
--master yarn-client \
--num-executors 5 \
--executor-cores 6 \
--executor-memory 8g \
--jars /opt/spark-glmnet-assembly-1.5.jar,some_other_jars \
--driver-class-path /usr/share/hadoop-2.2.0/share/hadoop/common/lib/mysql-connector-java-5.1.30.jar
But I don't know how to run functions from this package in Spark. How can I for example perform logistic regression with coordinate descent ?

The answer was really easy:
sc.addJar("path_to_my_jar")

Related

How to pass differnt filenames to spark using scala

I have below code at cluster:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
df.show()
spark.stop
}
I want to pass different files to spark.read.format using spark-submit command.
The files are on my linux box.
I used this :
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties,${csv_file} \
abc.jar
Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception.
Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.
I tried:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
val filepath = args(0)
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
df.show()
spark.stop
}
Used below to submit which doesnt work:
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties \
abc.jar ${csv_file}
But program is not picking the fie. Can anyone please help?
The local files URL format should be:
csv_file="file:///usr/usr1/Test.csv".
Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
I don't have a cluster on my hand right now, so I cannot test it. However:
You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.
But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):
spark-submit\
...\
--files ..., /path/to/your/file.csv\
abs.jar file.csv
=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)
hdfs fs -put /path/to/your/file.csv /user/your/data
spark-submit ... abc.jar hdfs:///user/your/data/file.csv
For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)
Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:
spark2-submit \
... \
--files myprop.properties,${csv_file} \
abc.jar `basename ${csv_file}`
basename strips the directory part from the full path leaving only the file name:
$ basename /usr/usr1/foo.csv
foo.csv
That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.

Where to add spark commands in python code

I am very new to this spark python world so I have another question. It's good to know I can write spark commands in spark shell or python code, so:
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1\
--files path_to/secure-connect-test.zip \
--conf spark.cassandra.connection.config.cloud.path=secure-connect-test.zip \
--conf spark.cassandra.auth.username=UserName \
--conf spark.cassandra.auth.password=Password \
--conf spark.dse.continuousPagingEnabled=false
This part of code, if I want to write it inside python code, do I have to add it with os.environ command? I have seen some posts with that command but they add a variable. Thanks

Run PySpark job from .egg instead of .py

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.
In order to submit the PySpark job in a regular commodity cluster would be something like:
spark2-submit --master yarn \
--driver-memory 20g \
--deploy-mode client \
--conf parquet.compression=SNAPPY \
--jars spark-avro_2.11-3.2.0.jar \
--py-files dummyproject-1_spark-py2.7.egg \
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"
Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:
gcloud dataproc jobs submit pyspark \
file:///dummyproject-1_spark-py2.7.egg#__main__.py \
--cluster=my-cluster-001 \
--py-files=file:///dummyproject-1_spark-py2.7.egg
The error I am getting is:
Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg
It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.
Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark \
--cluster=my-cluster-001 \
--py-files=file:///dummyproject-1_spark-py2.7.egg \
file:///__main__.py \

Passing sys props to Spark 1.5 especially properties with spaces in it

After using Spark 1.2 for quite a long time, I have realised that you can no longer pass spark configuration to the driver via the --conf via command line.
I am thinking about using system properties and picking the config up using the following bit of code:
def getConfigOption(conf: SparkConf, name: String)
conf getOption name orElse sys.props.get(name)
How do i pass a config.file option and string version of the date specified as a start time to a spark-submit command?
I have attempted using the following in my start up shell script:
--conf "spark.executor.extraJavaOptions=-Dconfig.file=../conf/application.conf -DstartTime=2016-06-04 00:00:00"
but this fails at it space splits the command up.
Any idea how to do this successfully, or has anyone got any advice on this one?
I am EDITing this to show the bash script being used:
#!/bin/bash
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
LIB_DIRECTORY=/opt/app/latest/lib/
ANALYSIS_JAR=spark-fa-2.16.18-standalone.jar
ANALYSIS_DRIVER_CLASS=com.spark.fa.Main
OTHER_OPTIONS=""
KEYTAB="/opt/app/keytab/fa.keytab"
PRINCIPAL="spark_K"
CLUSTER_OPTIONS=" \
--master yarn-client \
--driver-memory 2000M \
--executor-memory 5G \
--num-executors 39 \
--executor-cores 5 \
--conf spark.default.parallelism=200 \
--driver-java-options=-Dconfig.file=../conf/application.conf \
--conf "spark.executor.extraJavaOptions=-DstartTime='2016-06-04 00:00:00'" \
--conf spark.storage.memoryFraction=0.9 \
--files /opt/app/latest/conf/application.conf \
--conf spark.storage.safetyFraction=0.9 \
--keytab ${KEYTAB} \
--principal ${PRINCIPAL} \
"
spark-submit --class ${ANALYSIS_DRIVER_CLASS} ${CLUSTER_OPTIONS} ${LIB_DIRECTORY}/${ANALYSIS_JAR} ${CONFIG} ${#}
Use quotes:
--conf "spark.executor.extraJavaOptions=-Dconfig.file=../conf/application.conf -DstartTime='2016-06-04 00:00:00'"
If your parameter contains both spaces and single quotes (for instance a query paramter) you should enclose it with in escaped double quote \"
Example:
spark-submit --master yarn --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DfileFormat=PARQUET -Dquery=\"select * from bucket where code in ('A')\" -Dchunk=yes" spark-app.jar

How to pass -D parameter or environment variable to Spark job?

I want to change Typesafe config of a Spark job in dev/prod environment. It seems to me that the easiest way to accomplish this is to pass -Dconfig.resource=ENVNAME to the job. Then Typesafe config library will do the job for me.
Is there way to pass that option directly to the job? Or maybe there is better way to change job config at runtime?
EDIT:
Nothing happens when I add --conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev" option to spark-submit command.
I got Error: Unrecognized option '-Dconfig.resource=dev'. when I pass -Dconfig.resource=dev to spark-submit command.
Change spark-submit command line adding three options:
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
Here is my spark program run with addition java option
/home/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--files /home/spark/jobs/fact_stats_ad.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf' \
--class jobs.DiskDailyJob \
--packages com.databricks:spark-csv_2.10:1.4.0 \
--jars /home/spark/jobs/alluxio-core-client-1.2.0-RC2-jar-with-dependencies.jar \
--driver-memory 2g \
/home/spark/jobs/convert_to_parquet.jar \
AD_COOKIE_REPORT FACT_AD_STATS_DAILY | tee /data/fact_ad_stats_daily.log
as you can see
the custom config file
--files /home/spark/jobs/fact_stats_ad.conf
the executor java options
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf
the driver java options.
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf'
Hope it can helps.
I Had a lot of problems with passing -D parameters to spark executors and the driver, I've added a quote from my blog post about it:
"
The right way to pass the parameter is through the property:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).
For example (was written in properties file passed in spark-submit with “—properties-file”):
“
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties
“
You can read my blog post about overall configurations of spark.
I'm am running on Yarn as well.
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
if you write in this way, the later --conf will overwrite the previous one, you can verify this by looking at sparkUI after job started under Environment tab.
so the correct way is to put the options under same line like this:
--conf 'spark.executor.extraJavaOptions=-Da=b -Dc=d'
if you do this, you can find all your settings will be shown under sparkUI.
I am starting my Spark application via a spark-submit command launched from within another Scala application. So I have an Array like
Array(".../spark-submit", ..., "--conf", confValues, ...)
where confValues is:
for yarn-cluster mode:
"spark.driver.extraJavaOptions=-Drun.mode=production -Dapp.param=..."
for local[*] mode:
"run.mode=development"
It is a bit tricky to understand where (not) to escape quotes and spaces, though. You can check the Spark web interface for system property values.
spark-submit --driver-java-options "-Denv=DEV -Dmode=local" --class co.xxx.datapipeline.jobs.EventlogAggregator target/datapipeline-jobs-1.0-SNAPSHOT.jar
The above command works for me:
-Denv=DEV => to read DEV env properties file, and
-Dmode=local => to create SparkContext in local - .setMaster("local[*]")
Use the method like in below command, may be helpful for you -
spark-submit --master local[2] --conf
'spark.driver.extraJavaOptions=Dlog4j.configuration=file:/tmp/log4j.properties'
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/log4j.properties'
--class com.test.spark.application.TestSparkJob target/application-0.0.1-SNAPSHOT-jar-with-dependencies.jar prod
I have tried and it worked for me, I would suggest also go through heading below spark post which is really helpful -
https://spark.apache.org/docs/latest/running-on-yarn.html
I originally had this config file:
my-app {
environment: dev
other: xxx
}
This is how I'm loading my config in my spark scala code:
val config = ConfigFactory.parseFile(File<"my-app.conf">)
.withFallback(ConfigFactory.load())
.resolve
.getConfig("my-app")
With this setup, despite what the Typesafe Config documentation and all the other answers say, the system property override didn't work for me when I launched my spark job like so:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Dmy-app.environment=prod' \
--files my-app.conf \
my-app.jar
To get it to work I had to change my config file to:
my-app {
environment: dev
environment: ${?env.override}
other: xxx
}
and then launch it like so:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Denv.override=prod' \
--files my-app.conf \
my-app.jar