How to handle NULL with AWS Glue bookmark - pyspark

I have a table of 30GB in size. I am running an ETL with an AWS Glue job that copies the table to an S3 bucket.
I try to bookmark using the combination of a couple of columns as the bookmark key. Some of the columns have rows with null values.
I get this error:
An error occurred while calling o97.getDynamicFrame. Incorrect DATETIME value: 'null'
I would like to ask if there is any way to give the column a default value.
The other alternative was moving the entire table without bookmark which I don't think is efficient.

Related

Is there a way to get the row and column where the INSERT-Statement failed in PostgreSQL?

I'm trying to insert around 15000 rows with 200 columns in one batch into a table in PostgreSQL (with TimescaleDB extension) via Python by using the psycopg2 library. Some of the values I want to insert might be larger than the table's column datatype allows. This results in:
ERROR: smallint out of range, SQL state: 22003
I haven't found a way to get more information about the location of the error to handle it.
In MySQL, the column and the row where the error occured is reported back by default and it is even possible to clip the values to the max value of its datatype (which also would be fine). Is there a way to handle this in a similar manner in PostgreSQL?

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Copy activity auto-creates nvarchar(max) columns

I have Azure Data Factory copy activity which loads parquet files to Azure Synapse. Sink is configured as shown below:
After data loading completed I had a staging table structure like this:
Then I create temp table based on stg one and it has been working fine until today when new created tables suddenly received nvarchar(max) type instead of nvarchar(4000):
Temp table creation now is failed with obvious error:
Column 'currency_abbreviation' has a data type that cannot participate in a columnstore index.'
Why the AutoCreate table definition has changed and how can I return it to the "normal" behavior without nvarchar(max) columns?
I've got exactly the same problem! I'm using a data factory to read csv-files into my Azure datawarehouse and this used to result in nvarchar(4000) columns, but now they are all nvarchar(max). I also get the error
Column xxx has a data type that cannot participate in a columnstore index.
My solution for now is to change my SQL code and use a CAST to change the formats, but there must be a setting in the data factory to get the former results back...

invalid input syntax for type json aws dms postgres

I'm running a task that migrates all data from a postgres 10.4 to a RDS postgres 10.4.
Not able to migrate tables which have jsonb column.
After error, whole table is getting suspended.Table contain 449 rows only.
I have made following error policy, still whole table suspended.
"DataErrorPolicy": "IGNORE_RECORD",
"DataTruncationErrorPolicy": "IGNORE_RECORD",
"DataErrorEscalationPolicy": "SUSPEND_TABLE",
"DataErrorEscalationCount": 1000,
My expectation is that whole table should be transferred, it can ignore record if any json is wrong.
I dont know why its giving this error 'invalid input syntax for type json' , i have checked all json and all jsons are valid.
After debugging more, this error has been considered as TABLE error , but why ? Thats why table got suspended since TableErrorPolicy is 'SUSPEND_TABLE'.
Why this error considered as table error instead of record error?
Is JSONB column not supported by DMS thats why we are getting below error?
Logs :-
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Next table to load 'public'.'TEMP_TABLE' ID = 1, order = 0 (tasktablesmanager.c:1817)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Start loading table 'public'.'TEMP_TABLE' (Id = 1) by subtask 1.
Start load timestamp 0005AE3F66381F0F (replicationtask_util.c:755)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: REPLICA IDENTITY information for table 'public'.'TEMP_TABLE': Query status='Success' Type='DEFAULT'
Description='Old values of the Primary Key columns (if any) will be captured.' (postgres_endpoint_unload.c:191)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Unload finished for table 'public'.'TEMP_TABLE' (Id = 1). 449 rows sent. (streamcomponent.c:3485)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Table 'public'.'TEMP_TABLE' contains LOB columns, change working mode to default mode (odbc_endpoint_imp.c:4775)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Table 'public'.'TEMP_TABLE' has Non-Optimized Full LOB Support (odbc_endpoint_imp.c:4788)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Load finished for table 'public'.'TEMP_TABLE' (Id = 1). 449 rows received. 0 rows skipped.
Volume transferred 190376. (streamcomponent.c:3770)
2020-09-01T12:10:04 https://forums.aws.amazon.com/E: RetCode: SQL_ERROR SqlState: 22P02 NativeError: 1 Message: ERROR: invalid input syntax for type json;
Error while executing the query https://forums.aws.amazon.com/ (ar_odbc_stmt.c:2648)
2020-09-01T12:10:04 https://forums.aws.amazon.com/W: Table 'public'.'TEMP_TABLE' (subtask 1 thread 1) is suspended (replicationtask.c:2471)
Edit- after debugging more, this error has been considered as TABLE error , but why ?
JSONB column data type must be nullable in target DB.
Note- In my case, after making JSONB column as nullable, this error disappeared.
As mentioned in AWS documentation-
In this case, AWS DMS treats JSONB data as if it were a LOB column. During the full load phase of a migration, the target column must be nullable.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Prerequisites
https://aws.amazon.com/premiumsupport/knowledge-center/dms-error-null-value-column/
AWS DMS treats the JSON data type in PostgreSQL as a LOB data type column. This means that the LOB size limitation when you use limited LOB mode applies to JSON data. For example, suppose that limited LOB mode is set to 4,096 KB. In this case, any JSON data larger than 4,096 KB is truncated at the 4,096 KB limit and fails the validation test in PostgreSQL.
Reference: AWS DMS - JSON data types being truncated
Update: You can tweak the error handling task settings to skip erroneous rows by setting the value for DataErrorPolicy to IGNORE_RECORD which determines the action AWS DMS takes when there is an error related to data processing at the record level.
Some examples of data processing errors include conversion errors, errors in transformation, and bad data. The default is LOG_ERROR. IGNORE_RECORD, the task continues and the data for that record is ignored.
Reference: AWS DMS - Error handling task settings
You mentioned that you're migrating from PostgreSQL to PostgreSQL. Is there a specific reason to Use AWS DMS?
AWS Docs: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Homogeneous
When you migrate from a database engine other than PostgreSQL to a PostgreSQL database, AWS DMS is almost always the best migration tool to use. But when you are migrating from a PostgreSQL database to a PostgreSQL database, PostgreSQL tools can be more effective.
...
We recommend that you use PostgreSQL database migration tools such as pg_dump under the following conditions:
You have a homogeneous migration, where you are migrating from a source PostgreSQL database to a target PostgreSQL database.
You are migrating an entire database.
The native tools allow you to migrate your data with minimal downtime.

copy csv postgres ignore rows that violate constraints

I have a .csv file with ~300,000 rows, some of which violate certain constraints I set in my postgres database. Is there a way to copy my .csv file into the database and have postgres filter out the rows that violate the constraints? I do not want these rows to show up in the database.
If this is not possible, is there any other way to solve this problem?
what I'm doing right now is
COPY blocksequences from '/tmp/blocksequences.csv CSV HEADER;
And I get
'ERROR: new row for relation "blocksequences" violates check constraint "blocksequences_partid3_check"
DETAIL: Failing row contains (M001-M049-S186, M001, null, M049, S186).
CONTEXT: COPY blocksequences, line 680: "M001-M049-S186,M001,,M049,S186"
reason for the error: column that contains M049 is not allowed to have that string entered. Many other rows have violations like this.
I read a little about exception when check violation --do nothing am I on the right track here? seems like it's only a mysql thing maybe
Usually this is done in this way:
create a temporary table with the same structure as the destination one but without constraints,
copy data to the temporary table with COPY command,
copy rows that do fulfill constraints from temp table to the destination one, using INSERT command with conditions in the WHERE clause based on the table constraint,
drop the temporary table.
When dealing with really large CSV files or very limited server resources, use the extension file_fdw instead of temporary tables. It's much more efficient way but it requires server access to a CSV file (while copying to a temporary table can be done over the network).
In Postgres 12 you can use the WHERE clause in COPY FROM.