Convert pyspark script to awsglue script - pyspark

i have a bunch of existing pyspark scripts that I want to execute using AWS Glue. The scripts use APIs like SparkSession.read and various transformation in pyspark DataFrames.
I wasn't able to find docs outlining how to convert such a script. Do you have a hint / examples where I could find more infos? Thanks :)

One approach: Use the source/sink read/write APIs from AWS Glue and keep the DataFrame transformations as Pyspark code. This enables "easy" integration with AWS services (e.g. S3, Glue Catalog) and makes unit testing DataFrame transformations simple (since this is well known in Pyspark).
Example:
import sys
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame, DynamicFrameWriter
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
# init Glue context (and Spark context)
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
# init Glue job
args = getResolvedOptions(sys.argv, ["JOB_NAME", "PARAM_1"])
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
# read from source (use Glue APIs)
dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# convert DynamicFrame to DataFrame
df = dynamic_frame.toDF()
# do DataFrame transformations (use Pyspark API)
# convert DataFrame back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(
dataframe=df,
glue_ctx=glue_context,
)
# write to sink (use Glue APIs)
DynamicFrameWriter(glue_context).from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# commit job
job.commit()
There are different ways to organize this example code into classes and functions, etc. Do whatever is appropriate for your existing script.
References:
GlueContext class (AWS)
DynamicFrame class (AWS)
DynamicFrameWriter class (AWS)

Pyspark script should run as is on AWS Glue since Glue is basically Spark with some custom AWS library added. For start, I would just paste it into Glue and try to run it.
If you need some functionality of Glue like dynamic frames or bookmarks, then you will need to modify the scripts to get GlueContext and work with that. The basic initialization is:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
glueContext = GlueContext(spark_session.sparkContext)
From here onwards, you can use glueContext for Glue features or spark_session for plain Spark functionality.
I would however avoid using Glue-specific stuff just for the sake of it, because:
it will reduce portability
there is much better community support for Spark than it is for Glue

Related

Connect glue job to Amazon keyspaces

I'm trying to connect AWS glue job to Amazon keyspaces. Is there anyway to connect and work on those tables using pyspark.
PS: I can't use AWS cli due to organization restrictions.
You can connect AWS Glue with Amazon Keyspaces by leveraging the open source spark cassandra connector.
First, You will need to enable murmur3 partitioner or random partitioner for your account.
UPDATE system.local set partitioner='org.apache.cassandra.dht.Murmur3Partitioner' where key='local';
Second, make sure you understand the capacity required. By default, Keyspaces tables are created with OnDemand mode which learns required capacity by doubling resources based on your previous traffic peek. Newly created tables have ability to perform 4000 WCU/per sec and 12,000 RCU/per sec. If you need higher capacity create your table in provisioned mode with the desired throughput, then switch to on-demand mode.
Third, find our prebuilt examples in our samples repositories. We have patterns for export, import, count, and top-N. The examples show how to load the spark cassandra connector to s3, setup best practices for data loading. The following snippet shows export to s3.
val spark: SparkContext = new SparkContext(conf)
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession: SparkSession = glueContext.getSparkSession
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
import sparkSession.implicits._
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val tableName = args("TABLE_NAME")
val keyspaceName = args("KEYSPACE_NAME")
val backupS3 = args("S3_URI")
val backupFormat = args("FORMAT")
val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> tableName, "keyspace" -> keyspaceName))
.load()
tableDf.write.format(backupFormat).mode(SaveMode.ErrorIfExists).save(backupS3)
Job.commit()
}
}
Best practice is to use rate limiting with Glue per DPU/Worker. Understand the throughput you want per achieve per DPU and set the the throttler in the cassandra driver settings.
advanced.throttler = {
class = RateLimitingRequestThrottler
max-requests-per-second = 1000
max-queue-size = 50000
drain-interval = 1 millisecond
}
You will want to ensure that you have proper IAM permissions to access Amazon Keyspaces. If your using a VPC endpoint you will also want to include privileges here.

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext - AWS Glue

Getting below error while executing Glue job,
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext
Not sure how to stop the existing SparkContext,
any suggestion will be helpful.
If it is a spark code, we can do it using ,
sc = SparkContext.getOrCreate()
or
sc.stop()
Not sure how to stop in glue

Code errors out from IntelliJ but runs well on the Databricks Notebook

I develop Spark code using Scala APIs on IntelliJ and when I run this I get below error but runs well on the Databricks notebook though.
I am using Databricks Connect to connect from local installation of IntelliJ to the Databricks Spark Cluster. I am connected to the cluster and was able to submit a job from IntelliJ to the Cluster too. AMOF, everything else works except the below piece.
DBConnect is 6.1 , Databricks Runtime is 6.2
Imported the jar file from the cluster (using Databricks-connect get-jar-dir) and set up the SBT project with the jar in the project library
source code:
val sparkSession = SparkSession.builder.getOrCreate()
val sparkContext = sparkSession.sparkContext
import sparkSession.implicits._
val v_textFile_read = sparkContext.textFile(v_filename_path)
v_textFile_read.take(2).foreach(println)
Error:
cannot assign instance of scala.Some to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.HadoopRDD
The reason I use a RDD reader for text is so I can pass this output to a createDataFrame API. As you know, the createdataframe API takes in an RDD and schema as input parameters.
step-1: val v_RDD_textFile_read = sparkContext.textFile(v_filename_path).map(x => MMRSplitRowIntoStrings(x))
step-2: val v_DF_textFile_read = sparkSession.sqlContext.createDataFrame(v_RDD_textFile_read, v_schema) (edited

Apache spark - loading data from elasticsearch is too slow

I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.
Here is my script:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")
import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")
df.cache()
df.show()
And it works, but It's terribly slow. Am I doing anything wrong here?
Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.
Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?
Thanks a lot
Are you running this locally in a single machine? If so it could be normal... You
will have to check your network your spark web ui etc...
About submitting all the dependencies without specifying them in the shell withing spark-submit what we usually create a FAT jar by using sbt assembly.
http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin

Google Cloud Dataproc - job file erroring on sc.textFile() command

Here is my file that I submit as a PySpark job in Dataproc, thru the UI
# Load file data fro Google Cloud Storage to Dataproc cluster, creating an RDD
# Because Spark transforms are 'lazy', we do a 'count()' action to make sure
# we successfully loaded the main data file
allFlt = sc.textFile("gs://mybucket/mydatafile")
allFlt.count()
# Remove header from file so we can work w data ony
header = allFlt.take(1)[0]
dataOnly = allFlt.filter(lambda line: line != header)
It starts and then errors out with
allFlt = sc.textFile("gs://thomtect/flightinfo")
NameError: name 'sc' is not defined
Why is this? Shouldn't a spark context have alraedy been established by Dataproc? What do I need to add to my code so that it is accepted as Spark commands
https://cloud.google.com/dataproc/submit-job has an example python spark job submission.
The short answer is to add the following to the top of your script:
#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
And to expand a bit on why this is required: when Dataproc runs python scripts, it uses spark-submit (http://spark.apache.org/docs/latest/submitting-applications.html) instead of running the pyspark shell.