Connecting to DocumentDB from AWS Glue

Connecting to DocumentDB from AWS Glue - mongodb

For a current ETL job, I am trying to create a Python Shell Job in Glue. The transformed data needs to be persisted in DocumentDB. I am unable to access the DocumentDB from Glue.
Since the DocumentDb cluster resides in a VPC, I thought of creating a Interface gateway to access the Document DB from Glue but DocumentDB was not one of the approved service in Interface gateway. I see tunneling as a suggested option but I do not wanna do that.
So, I want to know is there a way to connect to DocumentDB from Glue.

Create a dummy JDBC connection in AWS Glue. You will not need to do a test connection but this will allow ENIs to be created in the VPC. Attach this connection to your python shell job. This will allow you to interact with your resources.

Have you tried using the mongo db connection in glue connections, we can connect document db through that option.

I have been able to connect DocumentDb with glue and ingest data using a csv in S3, here's the script to do that
# Constants
data_catalog_database = 'sample-db'
data_catalog_table = 'data'
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
job = Job(glue_context)
job.init(args['JOB_NAME'], args)
# Read from data source
## #type: DataSource
## #args: [database = "glue-gzip", table_name = "glue_gzip"]
## #return: dynamic_frame
## #inputs: []
dynamic_frame = glue_context.create_dynamic_frame.from_catalog(
database=data_catalog_database,
table_name=data_catalog_table
)
documentdb_write_uri = 'mongodb://yourdocumentdbcluster.amazonaws.com:27017'
write_documentdb_options = {
"uri": documentdb_write_uri,
"database": "yourdbname",
"collection": "yourcollectionname",
"username": "###",
"password": "###"
}
# Write DynamicFrame to MongoDB and DocumentDB
glue_context.write_dynamic_frame.from_options(dynamic_frame, connection_type="documentdb",
connection_options=write_documentdb_options)
In summary:
Create a crawler that creates the schema of your data and a table, which can be stored in an S3 bucket.
Use that db and table to ingest it into your documentdb.

Related

AWS Glue ETL Job Missing collection name

I have data catalog tables generated by crawlers one is data source from mongodb, and second is datasource Postgres sql (rds). Crawlers running successfully & connections test working.
I am trying to define an ETL job from mongodb to postgres sql (simple transform).
In the job I defined source as AWS Glue Data Catalog (mongodb) and target as Data catalog Postgres.
When I run the job I get this error:
IllegalArgumentException: Missing collection name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.collection' property
It looks like this is related to the mongodb part. I tried to set the 'database' and 'collection' parameters in the data catalog tables and it didn't help
Script generated for source is:
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
What could be missing?

I had the same problem, just add the parameter below.
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
additional_options = {"database":"data-catalog-db",
"collection":"data-catalog-table"}
Additional parameters can be found on the AWS page
https://docs.aws.amazon.com/glue/latest/dg/connection-mongodb.html

Load data from CSV file to AWS Aurora Serverless (PostgreSQL) db

Scenario:
We have a source SQL database table which gets updated every 24 hours. I am designing an automated process which would export that table to CSV file to an EC2 instance after the update of the source DB happens.
Problem:
I am trying to figure out what would be the best way to load a CSV file containing DB records from a table exported with bcp command-line utility to Aurora Serverless PostgreSQL database.
My current plan is to generate a bunch of insert statements from that CSV file using a script
Then use the AWS CLI on the EC2 Linux instance to talk to the Aurora DB and execute the following:
// empty the table
AWS rds-data execute-statement --transaction-id $ID --database users --sql "delete from mytable"
Use the Data API feature of Aurora Serverless to run a transaction such as:
$ $ID=`aws rds-data begin-transaction --database users --output json | jq .transactionId`
// populate the table with latest data
$ aws rds-data execute-statement --transaction-id $ID --database users --sql "insert into mytable values (value1,value2)"
$ aws rds-data execute-statement --transaction-id $ID --database users --sql "insert into mytable values (value1,value2)"
$ ...
$ aws rds-data commit-transaction $ID
Is there a better way to load that CSV file to the Aurora DB? Or I should stick with the above solution.
Note:
I found that article on AWS docs - "Loading data into an Amazon Aurora MySQL DB cluster from text files in an Amazon S3 bucket" but it explicitly states that This feature currently isn't available for Aurora Serverless clusters.

mongoimport JSON from Google Cloud Storage in an Airflow task

It seems that moving data from GCS to MongoDB is not common, since there is not very much documentation on this. We have the following task that we we pass as the python_callable to a Python operator - this task moves data from BigQuery into GCS as JSON:
def transfer_gcs_to_mongodb(table_name):
# connect
client = bigquery.Client()
bucket_name = "our-gcs-bucket"
project_id = "ourproject"
dataset_id = "ourdataset"
destination_uri = f'gs://{bucket_name}/{table_name}.json'
dataset_ref = bigquery.DatasetReference(project_id, dataset_id)
table_ref = dataset_ref.table(table_name)
configuration = bigquery.job.ExtractJobConfig()
configuration.destination_format = 'NEWLINE_DELIMITED_JSON'
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=configuration,
location="US",
) # API request
extract_job.result() # Waits for job to complete.
print("Exported {}:{}.{} to {}".format(project_id, dataset_id, table_name, destination_uri))
This task is successfully getting data into GCS. However, we are stuck now when it comes to how to run mongoimport correctly, to get this data into MongoDB. In particular, it seems like mongoimport cannot point to the file in GCS, but rather it has to be downloaded locally first, and then imported into MongoDB.
How should this be done in Airflow? Should we write a shell script that downloads the JSON from GCS, and then runs mongoimport with the correct uri and all the correct flags? Or is there another way to run mongoimport in Airflow that we are missing?

You don't need to write shell script to download from GCS. You can simply use the GCSToLocalFilesystemOperator then you can open the file and write it to mongo using the insert_many function of the MongoHook.
I didn't test it but it should be something like:
mongo = MongoHook(conn_id=mongo_conn_id)
with open('file.json') as f:
file_data = json.load(f)
mongo.insert_many(file_data)
This is for a pipe of: BigQuery -> GCS -> Local File System -> MongoDB.
You can also do it in memory as: BigQuery -> GCS -> MongoDB if you prefer.

Delete redshift table from within databricks using pyspark

I tried to connect to a redshift system table called stv_sessions and I can read the data into a dataframe.
This stv_sessions table is a redshift system table which has the process id's of all the queries that are currently running.
To delete a query from running we can do this.
select pg_terminate_backend(pid)
While this works for me if I directly connect to redshift (using aginity), it gives me insuffecient previlege issues when trying to run from databricks.
Simply put I dont know how to run the query from databricks notebook.
I have tried this so far,
kill_query = "select pg_terminate_backend('12345')"
some_random_df_i_created.write.format("com.databricks.spark.redshift").option("url",redshift_url).option("dbtable","stv_sessions").option("tempdir", temp_dir_loc).option("forward_spark_s3_credentials", True).options("preactions", kill_query).mode("append").save()
Please let me know if the methodology i follow is correct.
Thank you

Databricks purposely does not preinclude this driver. You need to Download and install the offical Redshift JDBC driver for databricks. : download the official Amazon Redshift JDBC driver, upload it to Databricks, and attach the library to your cluster.(recommend using v1.2.12 or lower with Databricks clusters). Then, use JDBC URLs of the form
val jdbcUsername = "REPLACE_WITH_YOUR_USER"
val jdbcPassword = "REPLACE_WITH_YOUR_PASSWORD"
val jdbcHostname = "REPLACE_WITH_YOUR_REDSHIFT_HOST"
val jdbcPort = 5439
val jdbcDatabase = "REPLACE_WITH_DATABASE"
val jdbcUrl = s"jdbc:redshift://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"
jdbcUsername: String = REPLACE_WITH_YOUR_USER
jdbcPassword: String = REPLACE_WITH_YOUR_PASSWORD
jdbcHostname: String = REPLACE_WITH_YOUR_REDSHIFT_HOST
jdbcPort: Int = 5439
jdbcDatabase: String = REPLACE_WITH_DATABASE
jdbcUrl: String = jdbc:redshift://REPLACE_WITH_YOUR_REDSHIFT_HOST:5439/REPLACE_WITH_DATABASE?user=REPLACE_WITH_YOUR_USER&password=REPLACE_WITH_YOUR_PASSWORD
Then try putting jdbcUrl in place of your redshift_url.
That may be the only reason you are getting privilege issues.
Link1:https://docs.databricks.com/_static/notebooks/redshift.html
Link2:https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#installation
Another reason could be the redshift-databricks connector only uses SSL(encryption in flight) and it is possible that IAM roles may have been set on your redshift cluster to only allow some users to delete tables.
Apologies if none of this helps your case.

Connecting to and executing queries on a timescsleDB running in an EC2 or Lightsail instance from a Lambda function

I am planning to have a timescale database running in an EC2 or Lightsail instance. I would like to be able to connect to and run queries on this timescale database from a Lambda function to insert data into and read data from the DB.
I know timescaleDB is a Postgres plugin and there are plenty of articles online documenting the process of connecting to a Postgres DB running inside of AWS RDS from a lambda, but I can't seem to find any describing how I would connect to one running in an EC2 or Lightsail instance.
Question: How do I connect to a timescaleDB running in an EC2 or Lightsail instance from a lambda function?

I'd say the answer is the same as for how to connect to RDS, as documented here:
https://docs.aws.amazon.com/lambda/latest/dg/vpc-rds.html
This answer also gives a good example how to connect to a PostgreSQL RDS, but instead of using the rds_config, you'll need to specify the hostname/ip in such a way that they point to your EC2 instance.
The differences being that you will need to specify the hostname/ip and other connection details to point to the EC2 instance. For example, if your EC2 instances
import sys, logging, psycopg2
host = "ec2-3-14-229-184.us-east-2.compute.amazonaws.com"
name = "demo_user"
password = "p#assword"
db_name = "demo"
try:
conn = psycopg2.connect(host=host,
database=db_name,
user=name,
password=password)
cur = conn.cursor()
except:
logger.error("ERROR: Unexpected error: Could not connect to Postgresql instance.")
sys.exit()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Connecting to DocumentDB from AWS Glue - mongodb

Create a dummy JDBC connection in AWS Glue. You will not need to do a test connection but this will allow ENIs to be created in the VPC. Attach this connection to your python shell job. This will allow you to interact with your resources.

Have you tried using the mongo db connection in glue connections, we can connect document db through that option.

Related

AWS Glue ETL Job Missing collection name

Load data from CSV file to AWS Aurora Serverless (PostgreSQL) db

mongoimport JSON from Google Cloud Storage in an Airflow task

Delete redshift table from within databricks using pyspark

Connecting to and executing queries on a timescsleDB running in an EC2 or Lightsail instance from a Lambda function

Categories

Resources