AWS GLUE ERROR : An error occurred while calling o75.pyWriteDynamicFrame. Cannot cast STRING into a IntegerType (value: BsonString{value=''}) - mongodb

I have a simple glue pyspark job, which connects to Mongodb source through a glue catalog table and extracts data from Mongodb collections and writes to json output into s3 using a glue dynamic frame.
The Mongo database here is deeply nested no sql with structs and arrays. Since it is a no-sql db, source schema is not fixed. Nested columns may vary between document to document.
However, the job fails with the below error.
ERROR: py4j.protocol.Py4JJavaError: An error occurred while calling o75.pyWriteDynamicFrame.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, 10.3.29.22, executor 1): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
As, the job fails due to datatype mismatch reason, I have tried all possible solutions like using resolveChoice(). Since error is for property with 'int' datatype, I tried casting all the property with 'int' type to 'string'.
I also tried the code with dropnullfields, writing with spark dataframe, applymapping, without using catalog table (from_options directly from mongo table), with and without repartition.
All these attempts are commented in the code for reference.
CODE SNIPPET
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("Started")
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<catalog_db_name>", table_name = "<catalog_table_name>", additional_options = {"database": "<mongo_database_name>", "collection": "<mongo_db_collection>"}, transformation_ctx = "datasource0")
# Code to read data directly from mongo database
# datasource0 = glueContext.create_dynamic_frame_from_options(connection_type = "mongodb", connection_options = { "uri": "<connection_string>", "database": "<mongo_db_name>", "collection": "<mongo_collection>", "username": "<db_username>", "password": "<db_password>"})
# Code sample for resolveChoive (converted all the 'int' datatype to 'string'
# resolve_dyf = datasource0.resolveChoice(specs = [("nested.property", "cast:string"),("nested.further[].property", "cast:string")])
# Code sample to dropnullfields
# dyf_dropNullfields = DropNullFields.apply(frame = resolve_dyf, transformation_ctx = "dyf_dropNullfields")
data_sink0 = datasource0.repartition(1)
print("Repartition done")
# Code sample to sink using spark's write method
# data_sink0.write.format("json").option("header","true").save("s3://<s3_folder_path>")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = data_sink0, connection_type = "s3", connection_options = {"path": "s3://<S3_folder_path>"}, format = "json", transformation_ctx = "datasink1")
print("Data Sink complete")
job.commit()
NOTE
I am not exactly sure why it is happening because this isssue is intermittent. Sometimes it works perfectly but at times it fails. So it is quite confusing.
Any help will be highly appreciated.

I was facing the same problem. Simple solution of this is to increase the sample size from 1000 (which is default for MongoDB) to 100000. Adding sample config for your reference.
`read_config = {
"uri": documentdb_write_uri,
"database": "your_db",
"collection": "your_collection",
"username": "user",
"password": "password",
"partitioner": "MongoSamplePartitioner",
"sampleSize": "100000",
"partitionerOptions.partitionSizeMB": "1000",
"partitionerOptions.partitionKey": "_id"
}`

Related

java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing

I'm trying to perform read/write streaming data from CosmosDB API for MongoDB into databricks pyspark and gettting error java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
Please help anyone how can we achieve data streaming in pyspark.
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.streaming import *
from pyspark.sql.types import StringType,BooleanType,DateType,StructType,LongType,IntegerType
spark = SparkSession.\
builder.\
appName("streamingExampleRead").\
config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.0').\
getOrCreate()
sourceConnectionString = <primary connection string of cosmosDB API for MongoDB isntance>
sourceDb = <your database name>
sourceCollection = <yourcollection name>
dataStreamRead=(
spark.readStream.format("mongodb")
.option('spark.mongodb.connection.uri', sourceConnectionString)
.option('spark.mongodb.database', sourceDb) \
.option('spark.mongodb.collection', sourceCollection) \
.option('spark.mongodb.change.stream.publish.full.document.only','true') \
.option("forceDeleteTempCheckpointLocation", "true") \
.load()
)
display(dataStreamRead)
query2=(dataStreamRead.writeStream \
.outputMode("append") \
.option("forceDeleteTempCheckpointLocation", "true") \
.format("console") \
.trigger(processingTime='1 seconds')
.start().awaitTermination());
Getting following error:
java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
at org.apache.spark.sql.errors.QueryExecutionErrors$.microBatchUnsupportedByDataSourceError(QueryExecutionErrors.scala:1579)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:123)
Data source mongodb does not support microbatch processing.
=== Streaming Query ===
Identifier: [id = 78cfcef1-19de-40f4-86fc-847109263ee9, runId = d2212e1f-5247-4cd2-9c8c-3cc937e2c7c5]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE```
Try using trigger(continuous="1 second") instead of trigger(processingTime='1 seconds').

How to upload files from local to staging table in Snowflake using Spark

I was using Snowflake connector for the same purpose in Python.
The code I had used was:
import snowflake.connector as sf
conn = sf.connect(user=user, password=password, account=account, warehouse=warehouse,
database=database,schema=schema)
def execute_query(connection, query):
cursor = connection.cursor()
cursor.execute(query)
cursor.close()
query = "create or replace stage table_stage file_format = (TYPE=CSV);"
execute_query(conn, query)
query = "put file://local_file.csv #table_stage auto_compress=true"
execute_query(conn, query)
Now, I need to achieve the same using Spark, the code I'm using is:
sfOptions = {
"sfURL": "url",
"sfAccount": "account",
"sfUser": "user",
"sfPassword": "user",
"sfDatabase": "database",
"sfSchema": "PUBLIC",
"sfWarehouse": "warehouse"
}
spark.sparkContext.jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions,
"create or replace stage table_stage file_format = (TYPE=CSV); "
spark.sparkContext.jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions,
"put file://local_file.csv #table_stage auto_compress=true"
I'm able to create staging table with this, but not able to upload the files.
Please suggest any alternative method for doing the same.

How can I access a table in a AWS KMS encrypted redshift cluster from a glue job using pyspark script?

My requirement:
I want to write a pyspark script to read data from a table in a AWS KMS encrypted redshift cluster(required SSL is true).
How can I retrieve connection details like password and use it connect to redshift like in the sample code?
What is the standard way to perform this?
Do I have to use any api?
I know that the below command generates temporary password, but this password does not work in glue redshift connection. Plus, it is not the recommended way I believe.
aws redshift get-cluster-credentials --db-user adminuser --db-name dev --cluster-identifier mycluster
My sample glue spark script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#job.commit()
connection_options = {
"url": "jdbc:redshift://endpoint",
"dbtable": "some_table",
"user": "user",
"password": "some_password", # how can I retrieve password and avoid plaintext?
"redshiftTmpDir": args["TempDir"]
}
df = glueContext.create_dynamic_frame_from_options("redshift", connection_options).toDF()
print(df.count())

Airflow task not running on schedule with PrestoDB Query

I have defined a Airflow sample task where I wanted to run a PrestoDB Query followed by a Spark job to perform a simple word count example. Here is the DAG I defined:
from pandas import DataFrame
import logging
from datetime import timedelta
from operator import add
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.presto_hook import PrestoHook
default_args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'presto_dag',
default_args=default_args,
description='A simple tutorial DAG with PrestoDB and Spark',
# Continue to run DAG once per hour
schedule_interval='#daily',
)
def talk_to_presto():
ph = PrestoHook(host='presto.myhost.com', port=9988)
# Query PrestoDB
query = "show catalogs"
# Fetch Data
data = ph.get_records(query)
logging.info(data)
return data
def submit_to_spark():
# conf = SparkConf().setAppName("PySpark App").setMaster("http://sparkhost.com:18080/")
# sc = SparkContext(conf)
# data = sc.parallelize(list("Hello World"))
# counts = data.map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).collect()
# for (word, count) in counts:
# print("{}: {}".format(word, count))
# sc.stop()
return "Hello"
presto_task = PythonOperator(
task_id='talk_to_presto',
provide_context=True,
python_callable=talk_to_presto,
dag=dag,
)
spark_task = PythonOperator(
task_id='submit_to_spark',
provide_context=True,
python_callable=submit_to_spark,
dag=dag,
)
presto_task >> spark_task
When I submit the task, about 20 DAG instances stay in the running state:
But it never completes and no logs are generated, at least for the PrestoDB Query. I am able to run the same PrestoDB Query from the Airflow's Data Profiling > Ad-Hoc Query section correctly.
I have intentionally commented out the PySpark code as it wasn't running and not of focus in the question.
I have two questions:
Why aren't the tasks completed and stays in the running state?
What am I doing wrong with the PrestoHook as the query isn't running?

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.
The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs
The template scala code from AWS:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// #params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// #type: DataSource
// #args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// #return: datasource0
// #inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// #type: ApplyMapping
// #args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// #return: applymapping1
// #inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// #type: DataSink
// #args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// #return: datasink2
// #inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
And the build.sbt I have started putting together for a local build:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.
#Frederic gave a very helpful hint to get the dependency from s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar.
Unfortunately that version of glue-assembly.jar is already outdated and brings spark in versoin 2.1.
It's fine if you're using backward compatible features, but if you rely on latest spark version (and possibly latest glue features) you can get the appropriate jar from a Glue dev-endpoint under /usr/share/aws/glue/etl/jars/glue-assembly.jar.
Provided you have a dev-endpoint named my-dev-endpoint you can copy the current jar from it:
export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`
scp -i dev-endpoint-private-key \
glue#$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .
Unfortunately, there are no libraries available for Scala glue API. Already contacted amazon support and they are aware about this problem. However, they didn't provide any ETA for delivering API jar.
As a workaround you can download the jar from S3. The S3 URI is s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
See https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html
now it supports, a recent release from AWS.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html