How to set up a local development environment for Scala Spark ETL to run in AWS Glue? - scala

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.
The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs
The template scala code from AWS:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// #params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// #type: DataSource
// #args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// #return: datasource0
// #inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// #type: ApplyMapping
// #args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// #return: applymapping1
// #inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// #type: DataSink
// #args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// #return: datasink2
// #inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
And the build.sbt I have started putting together for a local build:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.

#Frederic gave a very helpful hint to get the dependency from s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar.
Unfortunately that version of glue-assembly.jar is already outdated and brings spark in versoin 2.1.
It's fine if you're using backward compatible features, but if you rely on latest spark version (and possibly latest glue features) you can get the appropriate jar from a Glue dev-endpoint under /usr/share/aws/glue/etl/jars/glue-assembly.jar.
Provided you have a dev-endpoint named my-dev-endpoint you can copy the current jar from it:
export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`
scp -i dev-endpoint-private-key \
glue#$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .

Unfortunately, there are no libraries available for Scala glue API. Already contacted amazon support and they are aware about this problem. However, they didn't provide any ETA for delivering API jar.

As a workaround you can download the jar from S3. The S3 URI is s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
See https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html

now it supports, a recent release from AWS.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

Related

How can I access a table in a AWS KMS encrypted redshift cluster from a glue job using pyspark script?

My requirement:
I want to write a pyspark script to read data from a table in a AWS KMS encrypted redshift cluster(required SSL is true).
How can I retrieve connection details like password and use it connect to redshift like in the sample code?
What is the standard way to perform this?
Do I have to use any api?
I know that the below command generates temporary password, but this password does not work in glue redshift connection. Plus, it is not the recommended way I believe.
aws redshift get-cluster-credentials --db-user adminuser --db-name dev --cluster-identifier mycluster
My sample glue spark script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#job.commit()
connection_options = {
"url": "jdbc:redshift://endpoint",
"dbtable": "some_table",
"user": "user",
"password": "some_password", # how can I retrieve password and avoid plaintext?
"redshiftTmpDir": args["TempDir"]
}
df = glueContext.create_dynamic_frame_from_options("redshift", connection_options).toDF()
print(df.count())

AWS GLUE ERROR : An error occurred while calling o75.pyWriteDynamicFrame. Cannot cast STRING into a IntegerType (value: BsonString{value=''})

I have a simple glue pyspark job, which connects to Mongodb source through a glue catalog table and extracts data from Mongodb collections and writes to json output into s3 using a glue dynamic frame.
The Mongo database here is deeply nested no sql with structs and arrays. Since it is a no-sql db, source schema is not fixed. Nested columns may vary between document to document.
However, the job fails with the below error.
ERROR: py4j.protocol.Py4JJavaError: An error occurred while calling o75.pyWriteDynamicFrame.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, 10.3.29.22, executor 1): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
As, the job fails due to datatype mismatch reason, I have tried all possible solutions like using resolveChoice(). Since error is for property with 'int' datatype, I tried casting all the property with 'int' type to 'string'.
I also tried the code with dropnullfields, writing with spark dataframe, applymapping, without using catalog table (from_options directly from mongo table), with and without repartition.
All these attempts are commented in the code for reference.
CODE SNIPPET
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("Started")
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<catalog_db_name>", table_name = "<catalog_table_name>", additional_options = {"database": "<mongo_database_name>", "collection": "<mongo_db_collection>"}, transformation_ctx = "datasource0")
# Code to read data directly from mongo database
# datasource0 = glueContext.create_dynamic_frame_from_options(connection_type = "mongodb", connection_options = { "uri": "<connection_string>", "database": "<mongo_db_name>", "collection": "<mongo_collection>", "username": "<db_username>", "password": "<db_password>"})
# Code sample for resolveChoive (converted all the 'int' datatype to 'string'
# resolve_dyf = datasource0.resolveChoice(specs = [("nested.property", "cast:string"),("nested.further[].property", "cast:string")])
# Code sample to dropnullfields
# dyf_dropNullfields = DropNullFields.apply(frame = resolve_dyf, transformation_ctx = "dyf_dropNullfields")
data_sink0 = datasource0.repartition(1)
print("Repartition done")
# Code sample to sink using spark's write method
# data_sink0.write.format("json").option("header","true").save("s3://<s3_folder_path>")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = data_sink0, connection_type = "s3", connection_options = {"path": "s3://<S3_folder_path>"}, format = "json", transformation_ctx = "datasink1")
print("Data Sink complete")
job.commit()
NOTE
I am not exactly sure why it is happening because this isssue is intermittent. Sometimes it works perfectly but at times it fails. So it is quite confusing.
Any help will be highly appreciated.
I was facing the same problem. Simple solution of this is to increase the sample size from 1000 (which is default for MongoDB) to 100000. Adding sample config for your reference.
`read_config = {
"uri": documentdb_write_uri,
"database": "your_db",
"collection": "your_collection",
"username": "user",
"password": "password",
"partitioner": "MongoSamplePartitioner",
"sampleSize": "100000",
"partitionerOptions.partitionSizeMB": "1000",
"partitionerOptions.partitionKey": "_id"
}`

error: object Service is not a member of package com.twitter.finagle - Defining Bazel dependencies in Build file, Scala finagle

Im trying to add the finagle-http library to my new bazel project as an external maven dependency. But getting the following error. I assume im doing something wrong in creating the build without fully understanding it. Trying to learning. Appreciate any help on this.
error: object Service is not a member of package com.twitter.finagle
error: object util is not a member of package com.twitter
error: type Request is not a member of package com.twitter.finagle.http
error: object Response is not a member of package com.twitter.finagle.http
error: Symbol 'type com.twitter.finagle.Client' is missing from the classpath. This symbol is required by 'object com.twitter.finagle.Http'.
error: not found: value Await
The same code is working using sbt. Below is the code.
import com.twitter.finagle.{Http, Service}
import com.twitter.finagle.http
import com.twitter.util.{Await, Future}
object HelloWorld extends App {
val service = new Service[http.Request, http.Response] {
def apply(req: http.Request): Future[http.Response] =
Future.value(http.Response(req.version, http.Status.Ok))
}
val server = Http.serve(":8080", service)
Await.ready(server)
}
WORKSPACE file
maven_install(
artifacts = [
"org.apache.spark:spark-core_2.11:2.4.4",
"org.apache.spark:spark-sql_2.11:2.4.1",
"org.apache.spark:spark-unsafe_2.11:2.4.1",
"org.apache.spark:spark-tags_2.11:2.4.1",
"org.apache.spark:spark-catalyst_2.11:2.4.1",
"com.twitter:finagle-http_2.12:21.8.0",
],
repositories = [
"https://repo.maven.apache.org/maven2/",
"https://repo1.maven.org/maven2/",
]
)
BUILD file
load("#io_bazel_rules_scala//scala:scala.bzl", "scala_binary")
package(default_visibility = ["//visibility:public"])
scala_binary(
name="helloworld",
main_class="microservices.HelloWorld",
srcs=[
"Main.scala",
],
deps = ["spark],
)
java_library(
name = "spark",
exports = [
"#maven//:com_twitter_finagle_http_2_12_21_8_0",
],
)
Working SBT dependency that was working in my initial sbt project
libraryDependencies += "com.twitter" %% "finagle-http" % "21.8.0"
Figured out the issue, unlike in sbt, in bazel i had induvidualy add the related dependencies. I modified the workspace as below.
maven_install(
artifacts = [
"com.twitter:finagle-http_2.12:21.8.0",
"com.twitter:util-core_2.12:21.8.0",
"com.twitter:finagle-core_2.12:21.8.0",
"com.twitter:finagle-base-http_2.12:21.8.0",
"com.fasterxml.jackson.module:jackson-module-scala_2.12:2.11.2",
"com.fasterxml.jackson.core:jackson-databind:2.11.2",
],
repositories = [
"https://repo.maven.apache.org/maven2/",
"https://repo1.maven.org/maven2/",
]
Build file --
java_library(
name = "finagletrial",
exports = [
"#maven//:com_twitter_finagle_http_2_12_21_8_0",
"#maven//:com_twitter_util_core_2_12_21_8_0",
"#maven//:com_twitter_finagle_core_2_12_21_8_0",
"#maven//:com_twitter_finagle_base_http_2_12_21_8_0",
"#maven//:com_fasterxml_jackson_module_jackson_module_scala_2_12_2_11_2",
"#maven//:com_fasterxml_jackson_core_jackson_databind_2_11_2"
],

java.lang.NoClassDefFoundError: Could not initialize class XXXXXXXX in scala spark

I have written the scala-spark code to build my project and IDE is IntelliJ and it was showing this error while running it on AWS EMR cluster and working fine on the local.
It was cracking at below commented line:
var join_sql="select ipfile.id,ipfile.col1,opfile.col2 from ipfile join opfile on ipfile.id=opfile.id"
var df1=Operation.spark.sql(join_sql)
df1.createOrReplaceTempView("df1")
var df2 = df1.groupBy("col1","col2").count()
df2.createOrReplaceTempView("df2")
df2=Operation.spark.sql("select * from df2 order by count desc")
print("count : ",df2.count())
try {
df2.foreach(t => {
impact=t.getAs[Long]("impact").toString // Job was aborting at this particular line
m1 = t.getAs[String]("col1")
m2=t.getAs[String]("col2")
print("m1" + "m2" )
})
When I created the jar through sbt assembly to run it on the local mode, it was working fine but when I created the jar for yarn-client and executing that on cluster mode, it was showing this error.

exclusions doesn't work in AWS Glue ELT job s3 connection

According to AWS Glue documentation, we can use exlusions to exclude files when the connection type is s3:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
My s3 bucket likes following and I want to exclude test1 folder.
/mykkkkkk-test
test1/
testfolder/
11.json
22.json
test2/
1.json
test3/
2.json
test4/
3.json
test5/
4.json
I use following code to exclude test1 folder, but it will still ETL files under my test1 folder and it doesn't work
datasource0 = glueContext.create_dynamic_frame_from_options("s3",
{'paths': ["s3://mykkkkkk-test/"],
'exclusions': "[\"test1/**\"]",
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '1048576'},
format="json",
transformation_ctx = "datasource0")
Does the exclusions really work in ETL pyspark script? I also tried following but none works
'exclusions': "[\"test1/**\"]",
'exclusions': ["test1/**"],
'exclusions': "[\"test1\"]",
Try using the full path for exclusion.
datasource0 = glueContext.create_dynamic_frame.from_options(
's3',
{
"paths": [
's3://bucket/sample_data/'
],
"recurse" : True,
"exclusions" : "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "datasource0")