I'm doing ETL from Postgres to Redshift with AWS Glue. I have imported a table with a Crawler and created a Job to just transfer the data and create a new table in Redshift. I get:
An error occurred while calling o65.getDynamicFrame. ERROR: column "id" does not exist
In the original table the column is "Id", case sensitive. Is there a way to make Glue case sensitive? (I don't have permissions to change the postgres schema).
dyf.apply_mapping(mappings, case_sensitive = True, transformation_ctx = "tfx")
In mappings, you should map Id to id.
Related
Wanted to check if Large Object replication is supported by AWS DMS when Source and destination DB's are PostgreSQL?
I just used pglogical to replicate a DB which has Large Object (Like IOD's etc) and the target DB does not have LO's.
When I query a table on the destination which uses a OID column:
select id, lo_get(json) from table_1 where id=998877;
ERROR: large object 6698726 does not exist
The json column is oid datatype
If AWS DMS takes care of it, I will start using it.
Thanks
Task - I have to insert some data into table which reside in AWS Glue data catalog.
I use boto3 to retrieve the data from table already but i can't write to glue catalog.
client = boto3.client('glue', 'us-east-1')
client.put_item(tablename = 'abcd', item={'col1':{'S','goal'},'col2':{'S','goal1'})
job.commit()
got an error
glue object has no attribute 'put_item'
Question - How to insert data into table in AWS Glue data catalog.
Please help!
AWS Glue does not have a putItem method - see here for
Boto3 Glue documentation. While invalid, it appears your code should be used for DynamoDB given the reference to the method name, and parameters (e.g. table, item).
I have a Glue job that loads a CSV from S3 into a redshift table. There is 1 column (updated_date) which is not mapped. Default value for that column is set to current_timestamp in UTC. But each time the Glue job runs, this updated_date column is null.
I tried removing updated_dt from Glue metadata table. I tried removing updated_dt from SelectFields.apply() in Glue script.
When I do a normal insert statement in Redshift without using updated_dt column, default current_timestamp() value is inserted for those rows.
Thanks
Well I had the same problem. AWS support told me to convert the Glue DynamicFrame into a Spark DataFrame and use the Spark writer to load data into Redshift:
SparkDF = GlueDynFrame.toDF()
SparkDF.write.format('jdbc').options(
url = ‘<‘JDBC url>’,
dbtable=‘<schema>.<table>’,
user=‘<username>’,
password=‘<password>’).mode(‘append’).save()
Me on the other hand handled the problem by either dropping and creating the target table using preactions or just use some update to set the default values in the postactions.
I am trying to insert spark DF to Postgres using JDBC write. The postgres table has a unique constraint on one of the columns, when the df to be inserted violates the constraint entire batch is rejected and spark session closes giving an error duplicate key value violates unique constraint which is correct as the data is duplicate (already exists in the database)
org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:148
What is needed that the data rows which do not violate the constraint be inserted and the failed row be ignored, without failing the entire batch.
The code used is:
mode = "Append"
url = "jdbc:postgresql://IP/DB name"
properties = {"user": "username", "password": "password"}
DF.write
.option("numPartitions",partitions_for_parallelism)
.option("batchsize",batch_size)
.jdbc(url=url, table="table name", mode=mode, properties=properties)
How can I do this?
Unfortunately, there is no out of the box solution by Spark. There is a number of possible solutions I see:
Implement business logic of conflict resolution in PostgreSQL database as part of the forEachPartition function. For example, catch the exception of the constraint violation then report to the log.
Drop the constraint on PostgreSQL database, use autogenerated PK means enable to store duplicated rows in the database. Deduplication logic may be further implemented as a part of each SQL query or running deduplication on a daily/hourly basis. You can see example here.
In case there is no other system or process writing to PostgreSQL table except your Spark job it is possible to do filter using the join operation to remove all existing rows from Spark Dataframe before spark.write something like this
I hope my ideas would be helpful.
That is not possible if you have a unique constraint on the target. There is no UPSert mode currently with these techniques. You need to design around this aspect.
I'm currently facing an issue while trying to save an Apache Spark DataFrame loaded from an Apache Spark temp table to a distributed MemSQL database.
The trick is that I cannot use MemSQLContext connector for the moment. So I'm using JDBC driver.
Here is my code:
//store suppliers data from temp table into a dataframe
val suppliers = sqlContext.read.table("tmp_SUPPLIER")
//append data to the target table
suppliers.write.mode(SaveMode.Append).jdbc(url_memsql, "R_SUPPLIER", prop_memsql)
Here is the error message (occuring during the suppliers.write statement):
java.sql.SQLException: Distributed tables must either have a PRIMARY or SHARD key.
Note:
R_SUPPLIER table has exactly the same fields and datatypes than the temp table and has a primary key set.
FYI, here are some clues:
R_SUPPLIER script:
`CREATE TABLE R_SUPPLIER
(
SUP_ID INT NOT NULL PRIMARY KEY,
SUP_CAGE_CODE CHAR(5) NULL,
SUP_INTERNAL_SAP_CODE CHAR(5) NULL,
SUP_NAME VARCHAR(255) NULL,
SHARD KEY(SUP_ID)
);`
The suppliers.write statement has worked once, but data was then loaded in the DataFrame with a sqlContext.read.jdbc command and not sqlContext.sql (data was stored in a distant database and not in Apache Spark local temp table).
Did anyone face the same issue, please?
Are you getting that error when you run the create table, or when you run the suppliers.write code? That is an error that you should only get when creating a table. Therefore if you are hitting it when running suppliers.write, your code is probably trying to create and write to a new table, not the one you created before.