Connect glue job to Amazon keyspaces - pyspark

I'm trying to connect AWS glue job to Amazon keyspaces. Is there anyway to connect and work on those tables using pyspark.
PS: I can't use AWS cli due to organization restrictions.

You can connect AWS Glue with Amazon Keyspaces by leveraging the open source spark cassandra connector.
First, You will need to enable murmur3 partitioner or random partitioner for your account.
UPDATE system.local set partitioner='org.apache.cassandra.dht.Murmur3Partitioner' where key='local';
Second, make sure you understand the capacity required. By default, Keyspaces tables are created with OnDemand mode which learns required capacity by doubling resources based on your previous traffic peek. Newly created tables have ability to perform 4000 WCU/per sec and 12,000 RCU/per sec. If you need higher capacity create your table in provisioned mode with the desired throughput, then switch to on-demand mode.
Third, find our prebuilt examples in our samples repositories. We have patterns for export, import, count, and top-N. The examples show how to load the spark cassandra connector to s3, setup best practices for data loading. The following snippet shows export to s3.
val spark: SparkContext = new SparkContext(conf)
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession: SparkSession = glueContext.getSparkSession
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
import sparkSession.implicits._
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val tableName = args("TABLE_NAME")
val keyspaceName = args("KEYSPACE_NAME")
val backupS3 = args("S3_URI")
val backupFormat = args("FORMAT")
val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> tableName, "keyspace" -> keyspaceName))
.load()
tableDf.write.format(backupFormat).mode(SaveMode.ErrorIfExists).save(backupS3)
Job.commit()
}
}
Best practice is to use rate limiting with Glue per DPU/Worker. Understand the throughput you want per achieve per DPU and set the the throttler in the cassandra driver settings.
advanced.throttler = {
class = RateLimitingRequestThrottler
max-requests-per-second = 1000
max-queue-size = 50000
drain-interval = 1 millisecond
}
You will want to ensure that you have proper IAM permissions to access Amazon Keyspaces. If your using a VPC endpoint you will also want to include privileges here.

Related

Convert pyspark script to awsglue script

i have a bunch of existing pyspark scripts that I want to execute using AWS Glue. The scripts use APIs like SparkSession.read and various transformation in pyspark DataFrames.
I wasn't able to find docs outlining how to convert such a script. Do you have a hint / examples where I could find more infos? Thanks :)
One approach: Use the source/sink read/write APIs from AWS Glue and keep the DataFrame transformations as Pyspark code. This enables "easy" integration with AWS services (e.g. S3, Glue Catalog) and makes unit testing DataFrame transformations simple (since this is well known in Pyspark).
Example:
import sys
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame, DynamicFrameWriter
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
# init Glue context (and Spark context)
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
# init Glue job
args = getResolvedOptions(sys.argv, ["JOB_NAME", "PARAM_1"])
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
# read from source (use Glue APIs)
dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# convert DynamicFrame to DataFrame
df = dynamic_frame.toDF()
# do DataFrame transformations (use Pyspark API)
# convert DataFrame back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(
dataframe=df,
glue_ctx=glue_context,
)
# write to sink (use Glue APIs)
DynamicFrameWriter(glue_context).from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# commit job
job.commit()
There are different ways to organize this example code into classes and functions, etc. Do whatever is appropriate for your existing script.
References:
GlueContext class (AWS)
DynamicFrame class (AWS)
DynamicFrameWriter class (AWS)
Pyspark script should run as is on AWS Glue since Glue is basically Spark with some custom AWS library added. For start, I would just paste it into Glue and try to run it.
If you need some functionality of Glue like dynamic frames or bookmarks, then you will need to modify the scripts to get GlueContext and work with that. The basic initialization is:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
glueContext = GlueContext(spark_session.sparkContext)
From here onwards, you can use glueContext for Glue features or spark_session for plain Spark functionality.
I would however avoid using Glue-specific stuff just for the sake of it, because:
it will reduce portability
there is much better community support for Spark than it is for Glue

Apache spark - loading data from elasticsearch is too slow

I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.
Here is my script:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")
import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")
df.cache()
df.show()
And it works, but It's terribly slow. Am I doing anything wrong here?
Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.
Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?
Thanks a lot
Are you running this locally in a single machine? If so it could be normal... You
will have to check your network your spark web ui etc...
About submitting all the dependencies without specifying them in the shell withing spark-submit what we usually create a FAT jar by using sbt assembly.
http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin

Set spark.driver.memory for Spark running inside a web application

I have a REST API in Scala Spray that triggers Spark jobs like the following:
path("vectorize") {
get {
parameter('apiKey.as[String]) { (apiKey) =>
if (apiKey == API_KEY) {
MoviesVectorizer.calculate() // Spark Job run in a Thread (returns Future)
complete("Ok")
} else {
complete("Wrong API KEY")
}
}
}
}
I'm trying to find the way to specify Spark driver memory for the jobs. As I found, configuring driver.memory from within the application code doesn't effect anything.
The whole web application along with the Spark is packaged in a fat Jar.
I run it by running
java -jar app.jar
Thus, as I understand, spark-submit is not relevant here (or is it?). So, I can not specify --driver-memory option when running the app.
Is there any way to set the driver memory for Spark within the web app?
Here's my current Spark configuration:
val spark: SparkSession = SparkSession.builder()
.appName("Recommender")
.master("local[*]")
.config("spark.mongodb.input.uri", uri)
.config("spark.mongodb.output.uri", uri)
.config("spark.mongodb.keep_alive_ms", "100000")
.getOrCreate()
spark.conf.set("spark.executor.memory", "10g")
val sc = spark.sparkContext
sc.setCheckpointDir("/tmp/checkpoint/")
val sqlContext = spark.sqlContext
As it is said in the documentation, Spark UI Environment tab shows only variables that are effected by the configuration. Everything I set is there - apart from spark.executor.memory.
This happens because you use local mode. In local mode there is no real executor - all Spark components run in a single JVM, with single heap configuration, so executor specific configuration doesn't matter.
spark.executor options are applicable only when applications is submitted to a cluster.
Also, Spark supports only a single application per JVM instance. This means that all core Spark properties, will be applied only when SparkContext is initialized, and persist as long as context (not SparkSession) is kept alive. Since SparkSession initializes SparkContext, no additional "core" settings will can applied after getOrCreate.
This means that all "core" options should be provided using config method of the SparkSession.builder.
If you're looking for alternatives to embedding you check an exemplary answer to Best Practice to launch Spark Applications via Web Application? by T. Gawęda.
Note: Officially Spark doesn't support applications running outside spark-submit and there are some elusive bugs related to that.

Different outputs per number of partition in spark

I run spark code in my local machine and cluster.
I create SparkContext object for local machine with following code:
val sc = new SparkContext("local[*]", "Trial")
I create SparkContext object for cluster with following code:
val spark = SparkSession.builder.appName(args(0)+" "+args(1)).getOrCreate()
val sc = spark.sparkContext
and I set the number of partition as 4 for local machine and cluster with following code
val dataset = sc.textFile("Dataset.txt", 4)
In my cluster, I created 5 workers. One of them is driver node, rest of them run as worker.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
I create SparkContext object for local machine with following code
and
I create SparkContext object for cluster with following code:
It appears that you may have defined two different environments for sc and spark as you define local[*] explicitly for sc while taking some default value for spark (that may read external configuration files or take so-called master URL from spark-submit).
These may be different that may affect what you use.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
Dataset.txt you process in local vs cluster environments are different and hence the difference in the results. I'd strongly recommend using HDFS or some other shared file system to avoid such "surprises" in the future.

How to connect to ElasticSearch from Apache Spark Service on Bluemix

I use Apache Spark service on Bluemix to create demo (collecting/parsing twitter data).I want to transport Elastic Search.
I created my scala app according to the following URL [1]:
[1] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
However, when using Jupyter notebook on Bluemix, I couldn't run my app properly. A special interpreter-aware SparkContext "sc" was already running, but I cloudn't add properties to "sc" such as "es.nodes", "es.port" ,and so on to connect Elastic Search.
Q1.
Does anyone know how to add extra properties to a special interpreter-aware SparkContext on Bluemix? In my local spark environment, it's easy to add.
Q2.
I tried to create another SparkContext as follows and use for streaming, but it was uncontrollable on Jupyter notebook..
var conf = sc.getConf
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "XXXXXXXX")
conf.set("es.port", "9020")
conf.set("spark.driver.allowMultipleContexts", "true")
val sc1 = new SparkContext(conf)
My procedure to create extra SparkContext may not be right, I think.
Does anyone know how to create 2nd SparkContext properly on Bluemix?
If I'm not mistaken, you're already setting the properties on the configuration object within the existing SparkContext.
These lines (correcting what I assume is a typo) should be setting the option on the existing SparkContext's configuration:
val conf = sc.getConf
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "XXXXXXXX")
conf.set("es.port", "9020")
conf.set("spark.driver.allowMultipleContexts", "true")
You mentioned you couldn't add these properties -- can you elaborate on the problem it was causing doing it this way?