Submitting a pyspark job to Amazon EMR cluster from terminal

Submitting a pyspark job to Amazon EMR cluster from terminal - pyspark

I have SSH-ed into the Amazon EMR server and I want to submit a Spark job ( a simple word count file and a sample.txt are both on the Amazon EMR server ) written in Python from the terminal. How do I do this and what's the syntax?
The word_count.py is as follows:
from pyspark import SparkConf, SparkContext
from operator import add
import sys
## Constants
APP_NAME = " HelloWorld of Big Data"
##OTHER FUNCTIONS/CLASSES
def main(sc,filename):
textRDD = sc.textFile(filename)
words = textRDD.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))
wordcount = words.reduceByKey(add).collect()
for wc in wordcount:
print (wc[0],wc[1])
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId","XXXX")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey","YYYY")
filename = "s3a://bucket_name/sample.txt"
# filename = sys.argv[1]
# Execute Main functionality
main(sc, filename)

You can run this command:
spark-submit s3://your_bucket/your_program.py
if you need to run the script using python3, you can run this command before spark-submit:
export PYSPARK_PYTHON=python3.6
Remember to save your program in a bucket before spark-submit.

Related

Issues with Spark and Salesforce Connection

I am trying to load in a table from SalesForce using spark. I invoked this code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Sample {
def main(arg: Array[String]) {
val spark = SparkSession.builder().
appName("salesforce").
master("local[*]").
getOrCreate()
val tableName = "Opportunity"
val outputPath = "output/result" + tableName
val salesforceDf = spark.
read.
format("jdbc").
option("url", "jdbc:datadirect:sforce://login.salesforce.com;").
option("driver", "com.ddtek.jdbc.sforce.SForceDriver").
option("dbtable", tableName).
option("user", "").
option("password", "xxxxxxxxx").
option("securitytoken", "xxxxx")
.load()
salesforceDf.createOrReplaceTempView("Opportunity")
spark.sql("select * from Opportunity").collect.foreach(println)
//save the result
salesforceDf.write.save(outputPath)
}
}
And the docs I was referring to said to start a spark shell as:
spark-shell --jars /path_to_driver/sforce.jar
Which outputted a lot of lines in the terminal and this was the last line:
22/07/12 14:57:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a12b060d-5c82-4283-b2b9-53f9b3863b53
And then to submit spark
spark-submit --jars sforce.jar --class <Your class name> your jar file
However I am not sure where this jar file is and if that was substantiated? and how to submit that. Any help is appreciated, thank you.

Access home directory within spark task node

Directory: /home/hadoop/
module.py
def incr(value):
return int(value + 1)
main.py
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
import sys
sys.path.append('/home/hadoop/')
import module
if __name__ == '__main__':
df = spark.createDataFrame([['a', 1], ['b', 2]], schema=['id', 'value'])
df.show()
print(module.incr(5)) #this works
# this throws module not found error
incr_udf = F.udf(lambda val: module.incr(val), T.IntegerType())
df = df.withColumn('new_value', incr_udf('value'))
df.show()
Spark task nodes do not have access to /home/hadoop/
How do I import module.py from within spark task nodes?

if you are submitting the spark to yarn. the task will be progress launched by user 'yarn' in the worknode and will not have permission to access.
you can add --py-files module.py to your spark-submit command, then you want directly call the function module.py by adding from module import * since they are all in the container now

Parquet file being read as empty

I'm trying to read a parquet file that I downloaded from the HDFS on my Jupyter notebook however it is showing up as empty. I know it is not empty because I had worked on it prior to saving it to the HDFS. Does anyone know why it is being read as empty?
The size of the file on the HDFS and cluster environment:
hadoop fs -du -s -h /user/some/test.parquet
1.2 M 3.5 M /user/some/test.parquet
val test = spark.read.parquet("hdfs:///user/some/test.parquet")
test.count()
res0: Long = 10
On an almond-kernel in Jupyter notebook to work in Scala.
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)
import org.apache.spark.sql._
val spark = {
SparkSession.builder()
.master("local[*]")
.getOrCreate()
}
def sc = spark.sparkContext
val test = spark.read.parquet("/Users/me/some/test.parquet")
test: DataFrame = [UnitId: string, GeoId: string ... 26 more fields]
test.count()
res28: Long = 0L

For anyone who's interested, I figured out the issue.
I had downloaded the parquet file from the HDFS using "hadoop fs -getmerge", resulting in the corruption of the file.
The right approach when dealing with the parquet file would be to "hadoop fs -get.

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.

You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer

Run Scala Program with Spark on Hadoop

i have create a scala program that search a word in a text file.
I create the file scala with eclipse and after i compile and create a jar with sbt and sbt assembly.After that i run the .jar with Spark in local and it run correctly.
Now i want try to run this program using Spark on hadoop, i have 1 master and 2 work machine.
I have to change the code ? and what command i do with the shell of the master?
i have create a bucket and i have put the text file in hadoop
this is my code:
import scala.io.Source
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object wordcount {
def main(args: Array[String]) {
// set spark context
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
val sc = new SparkContext(conf)
val distFile = sc.textFile("bible.txt")
print("Enter word to look for in the HOLY BILE: ")
val word = Console.readLine
var count = 0;
var finalCount=0;
println("You entered " + word)
val input = sc.textFile("bible.txt")
val splitedLines = input.flatMap(line => line.split(" "))
.filter(x => x.equals(word))
System.out.println("The word " + word + " appear " + splitedLines.count())
}
}
Thanks all

Just change the following line,
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
to
val conf = new SparkConf().setAppName("wordcount")
This will allow you not to modify the code whenever you want to switch from local mode to cluster mode. The master option can be passed via the spark-submit command as follows,
spark-submit --class wordcount --master <master-url> --jars wordcount.jar
and if you want to run your program locally, use the following command,
spark-submit --class wordcount --master local[*] --jars wordcount.jar
here is the list of master option that you can set while running the application.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Submitting a pyspark job to Amazon EMR cluster from terminal - pyspark

You can run this command: spark-submit s3://your_bucket/your_program.py if you need to run the script using python3, you can run this command before spark-submit: export PYSPARK_PYTHON=python3.6 Remember to save your program in a bucket before spark-submit.

Related

Issues with Spark and Salesforce Connection

Access home directory within spark task node

Parquet file being read as empty

Spark MLLib unable to write out to S3 : path already exists

Run Scala Program with Spark on Hadoop

Categories

Resources