AWS Glue ETL Job Missing collection name - mongodb

I have data catalog tables generated by crawlers one is data source from mongodb, and second is datasource Postgres sql (rds). Crawlers running successfully & connections test working.
I am trying to define an ETL job from mongodb to postgres sql (simple transform).
In the job I defined source as AWS Glue Data Catalog (mongodb) and target as Data catalog Postgres.
When I run the job I get this error:
IllegalArgumentException: Missing collection name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.collection' property
It looks like this is related to the mongodb part. I tried to set the 'database' and 'collection' parameters in the data catalog tables and it didn't help
Script generated for source is:
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
What could be missing?

I had the same problem, just add the parameter below.
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
additional_options = {"database":"data-catalog-db",
"collection":"data-catalog-table"}
Additional parameters can be found on the AWS page
https://docs.aws.amazon.com/glue/latest/dg/connection-mongodb.html

Related

export Amazon RDS into S3 or locally

i am using Amazon RDS Aurora postgreSQL 10.18, i need to export a specific tables with more than 50,000 rows into csv file (either local or into s3 bucket), i have tried many procedure but ended up with fail :
i tried the button export to csv from the query editor after select all rows but the API response with too large data to return
i tried to use aws_s3.query_export_to_s3, but ERROR: (credentials stored with the database cluster can’t be accessed Hint: Has the IAM role Amazon Resource Name (ARN) been associated with the feature-name "s3Export")
i tried to take a snapshot from our instance, then export it into s3 bucket but ended up with error (The specified db snapshot engine mode isn’t supported and can’t be exported)

How to read tables from synapse database tables using pyspark

I am a newbie to Azure Synapse, I have to work on the Azure spark notebook. One of my colleagues connected the on-prime database using the azure link service. Now I have written a test framework for comparing the on-prime data and data-lake(curated) data. but I don't understand how to read those tables using Pyspark.
here is my linked service data structure.
enter image description here
here my Link service names and Database name.
You can read any file as a table which is stored in Synapse Linked location by using Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
First you need to read the file which you need to read as the table in Synapse. Use below code to read the file.
%%pyspark
df = spark.read.load('abfss://sampleadls2#sampleadls1.dfs.core.windows.net/business.csv', format='csv', header=True)
Then convert this file into table using the code below:
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS business")
df.write.mode("overwrite").saveAsTable("business.data")
Refer below image.
Now you can run any Spark SQL command on this table as shown below:
%%pyspark
data = spark.sql("SELECT * FROM business.data")
display(data)
See the output in below image.

o110.pyWriteDynamicFrame. null

I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null
So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).

Linking Access tables into a PostgreSQL Database using a foreign data wrapper

I'm new to postgres so this problem is probably a relatively easy one for someone else. However, I have spent many frustrating hours trying to figure out the solution. I have an Access Database of metadata that must be kept updated for sending records to other groups. I also have a database using PostgreSQL and PGAdmin which also has these same metadata tables. Currently these tables in the Postgres database get updated manually by exporting the Access tables as excel files, and then importing them into the SQL tables. It's not the most efficient process and could lead to errors in the SQL database if someone forgets to check before running any queries that they are using the most recent data from Access. So I would like to integrate some of the tables from my Access database with my Postgres database.
Originally I tried just installing drivers to export the Access tables directly to Postgres which worked, but not in the way that I wanted since it just brings in a table which I would still need to manually update. From my understanding, I can create a server connection in postgres to access and that would then bring in updated data using a foreign data wrapper.
I tried to use ogr_fdw.
CREATE EXTENSION ogr_fdw;
When I try:
CREATE SERVER metadata
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource 'H:\Databases\20170712.accdb',
format 'ODBC' );
I receive: ERROR: unable to connect to data source "H:\Databases\20170712.accdb"
SQL state: HV00D
When I try:
CREATE SERVER metadata
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource 'H:\Databases\20170712.accdb',
format 'ACCDB' );
I receive: ERROR: unable to find format "ADDCB"
HINT: See the formats list at http://www.gdal.org/ogr_formats.html.
I also tried MDB and received the same error. However, MDB is the code name given by the website but it says that it needs JDK/JRE to compile and I'm not really sure if that's another type of driver that I would need or what it is.
When I try:
CREATE SERVER metadata
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource 'H:\Databases\20170712.mdb',
format 'ODBC' );
I receive: ERROR: unable to connect to data source "H:\Databases\20170712.mdb"
SQL state: HV00D
Hint: Unable to initialize ODBC connection to DSN for DRIVER=Microsoft Access Driver (*.mdb);DBQ=H:\Databases\20170712.mdb,
[Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified
However I thought after looking at the github help page for ogr_fdw didn't need ODBC and special drivers https://github.com/pramsey/pgsql-ogr-fdw/blob/master/FAQ.md.
A lot of this is probably due to my limited knowledge of the terminology when I'm reading through a lot of this stuff. Also my Access database is an .accdb file but since that wasn't working I tried around with mdb and ODBC as the "format" too. If anyone has any suggestions I would greatly appreciate it.
Thanks!

SQL Database + LOAD + CLOB files = error SQL3229W

I'm having trouble making loads of tables that have CLOBS and BLOBS columns in a 'SQL Database' database in Bluemix.
The error returned is:
SQL3229W The field value in row "617" and column "3" is invalid. The row was
rejected. Reason code: "1".
SQL3185W The previous error occurred while processing data from row "617" of
the input file.
The same procedures performed in a local environment functioned normally.
under the command you use to load:
load client from /home/db2inst1/ODONTO/tmp/ODONTO.ANAMNESE.IXF OF IXF LOBS FROM /home/db2inst1/ODONTO/tmp MODIFIED BY IDENTITYOVERRIDE replace into USER12135.TESTE NONRECOVERABLE
The only manner currently you can upload lob files to a SQLDB or dashDB is to load the data and lobs from the cloud. The option is to get data from a Swift object storage in Softlayer or a Amazon S3 storage. You should have an account on one of those services.
After that, you can use the following syntax:
db2 "call sysproc.admin_cmd('load from Softlayer::softlayer_end_point::softlayer_username::softlayer_api_key::softlayer_container_name::mylobs/blob.del of del LOBS FROM Softlayer::softlayer_end_point::softlayer_username::softlayer_api_key::softlayer_container_name::mylobs/ messages on server insert into LOBLOAD')"
Where:
mylobs/ is the directory inside the Softlayer swift object storage container, defined in
LOBLOAD is the table name to be loaded in
Example:
db2 "call sysproc.admin_cmd('load from Softlayer::https://lon02.objectstorage.softlayer.net/auth/v1.0::SLOS424907-2:SL523907::0ac631wewqewre8af20c576ad5214ec70f163d600d247bd5d4dfef5453f72ff6::TestContainer::mylobs/blob.del of del LOBS FROM Softlayer::https://lon02.objectstorage.softlayer.net/auth/v1.0::SLOS424907-2:SL523907::0ac631wewqewre8af20c576ad5214ec70f163d600d247bd5d4dfef5453f72ff6::TestContainer::mylobs/ messages on server insert into LOBLOAD')"