Redshift STL_LOAD_ERRORS reporting col_length different than DDL - amazon-redshift

My copy from S3 into Redshift is failing due to "String length exceeds DDL length". Looking at STL_LOAD_ERRORS there are 4 columns that are showing this error. The col_length is shown as 256 in STL_LOAD_ERRORS, but the DDL clearly shows these columns as varchar(5000). The records in error have values for these columns in the range of 1000 to 4500 chars, so should fit into the column per the DDL.
Can anyone suggest why the STL_LOAD_ERRORS would show a length of 256 when the DDL says 5000?

Related

How to store a column value in Redshift varchar column with length more than 65535

I tried to load the redshift table but failed on one column- The length of the data column 'column_description'is longer than the length defined in the table. Table: 65535, Data: 86555.
I tried to increase the length of column in RS table, looks like 65535 is the max length RS supports.
Do we have any alternatives to store value in Redshift?
The answer is that Redshift doesn't support anything larger and that one shouldn't store large artifacts in an analytic database. If you are using Redshift for its analytic powers to find specific artifacts (images, files, etc) then these should be stored in S3 and the object key (pointer) should be stored in redshift.

Is there a way to get the row and column where the INSERT-Statement failed in PostgreSQL?

I'm trying to insert around 15000 rows with 200 columns in one batch into a table in PostgreSQL (with TimescaleDB extension) via Python by using the psycopg2 library. Some of the values I want to insert might be larger than the table's column datatype allows. This results in:
ERROR: smallint out of range, SQL state: 22003
I haven't found a way to get more information about the location of the error to handle it.
In MySQL, the column and the row where the error occured is reported back by default and it is even possible to clip the values to the max value of its datatype (which also would be fine). Is there a way to handle this in a similar manner in PostgreSQL?

invalid input syntax for type json aws dms postgres

I'm running a task that migrates all data from a postgres 10.4 to a RDS postgres 10.4.
Not able to migrate tables which have jsonb column.
After error, whole table is getting suspended.Table contain 449 rows only.
I have made following error policy, still whole table suspended.
"DataErrorPolicy": "IGNORE_RECORD",
"DataTruncationErrorPolicy": "IGNORE_RECORD",
"DataErrorEscalationPolicy": "SUSPEND_TABLE",
"DataErrorEscalationCount": 1000,
My expectation is that whole table should be transferred, it can ignore record if any json is wrong.
I dont know why its giving this error 'invalid input syntax for type json' , i have checked all json and all jsons are valid.
After debugging more, this error has been considered as TABLE error , but why ? Thats why table got suspended since TableErrorPolicy is 'SUSPEND_TABLE'.
Why this error considered as table error instead of record error?
Is JSONB column not supported by DMS thats why we are getting below error?
Logs :-
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Next table to load 'public'.'TEMP_TABLE' ID = 1, order = 0 (tasktablesmanager.c:1817)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Start loading table 'public'.'TEMP_TABLE' (Id = 1) by subtask 1.
Start load timestamp 0005AE3F66381F0F (replicationtask_util.c:755)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: REPLICA IDENTITY information for table 'public'.'TEMP_TABLE': Query status='Success' Type='DEFAULT'
Description='Old values of the Primary Key columns (if any) will be captured.' (postgres_endpoint_unload.c:191)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Unload finished for table 'public'.'TEMP_TABLE' (Id = 1). 449 rows sent. (streamcomponent.c:3485)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Table 'public'.'TEMP_TABLE' contains LOB columns, change working mode to default mode (odbc_endpoint_imp.c:4775)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Table 'public'.'TEMP_TABLE' has Non-Optimized Full LOB Support (odbc_endpoint_imp.c:4788)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Load finished for table 'public'.'TEMP_TABLE' (Id = 1). 449 rows received. 0 rows skipped.
Volume transferred 190376. (streamcomponent.c:3770)
2020-09-01T12:10:04 https://forums.aws.amazon.com/E: RetCode: SQL_ERROR SqlState: 22P02 NativeError: 1 Message: ERROR: invalid input syntax for type json;
Error while executing the query https://forums.aws.amazon.com/ (ar_odbc_stmt.c:2648)
2020-09-01T12:10:04 https://forums.aws.amazon.com/W: Table 'public'.'TEMP_TABLE' (subtask 1 thread 1) is suspended (replicationtask.c:2471)
Edit- after debugging more, this error has been considered as TABLE error , but why ?
JSONB column data type must be nullable in target DB.
Note- In my case, after making JSONB column as nullable, this error disappeared.
As mentioned in AWS documentation-
In this case, AWS DMS treats JSONB data as if it were a LOB column. During the full load phase of a migration, the target column must be nullable.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Prerequisites
https://aws.amazon.com/premiumsupport/knowledge-center/dms-error-null-value-column/
AWS DMS treats the JSON data type in PostgreSQL as a LOB data type column. This means that the LOB size limitation when you use limited LOB mode applies to JSON data. For example, suppose that limited LOB mode is set to 4,096 KB. In this case, any JSON data larger than 4,096 KB is truncated at the 4,096 KB limit and fails the validation test in PostgreSQL.
Reference: AWS DMS - JSON data types being truncated
Update: You can tweak the error handling task settings to skip erroneous rows by setting the value for DataErrorPolicy to IGNORE_RECORD which determines the action AWS DMS takes when there is an error related to data processing at the record level.
Some examples of data processing errors include conversion errors, errors in transformation, and bad data. The default is LOG_ERROR. IGNORE_RECORD, the task continues and the data for that record is ignored.
Reference: AWS DMS - Error handling task settings
You mentioned that you're migrating from PostgreSQL to PostgreSQL. Is there a specific reason to Use AWS DMS?
AWS Docs: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Homogeneous
When you migrate from a database engine other than PostgreSQL to a PostgreSQL database, AWS DMS is almost always the best migration tool to use. But when you are migrating from a PostgreSQL database to a PostgreSQL database, PostgreSQL tools can be more effective.
...
We recommend that you use PostgreSQL database migration tools such as pg_dump under the following conditions:
You have a homogeneous migration, where you are migrating from a source PostgreSQL database to a target PostgreSQL database.
You are migrating an entire database.
The native tools allow you to migrate your data with minimal downtime.

Redshift select * vs select single column

I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.

SQL 2008 R2 Row size limit exceeded

I have sql 2008 R2 database. I created a table and when trying to execute a select statement (with order by clause) against it, I receive the error "Cannot create a row of size 8870 which is greater than the allowable maximum row size of 8060."
I am able to select the data without an order by clause, however the order by clause is important and I require it. I have tried a ROBUST PLAN option but I still received the same error.
My table has 300+ columns with data type TEXT. I have tried using varchar and nvarchar, but have had no success.
Can someone please provide some insight?
Update:
Thanks for comments. I agree. 300+ columns in one table is not very good design. What I'm trying to do is bring excel tabs into the database as data tables. Some tabs have 300+ columns.
I first use a CREATE statement to create a table based on the excel tab so the columns vary. Then I do various SELECT, UPDATE, INSERT, etc statements on the table after the table is created with data.
The structure of the table usually follow this patter:
fkVersionID, RowNumber(autonumber), Field1, Field2, Field3, etc...
is there any way to get around the 8060 row size limit?
You mentioned that you tried nvarchar and varchar ... remember that nvarchar doubles the bytes used, but it is the only one of the two to support foreign characters in some cases, such as accent marks.
varchar is a good choice if you can limit its maximum size appropriately.
8000 characters is still a real limit, but if on average each varchar column is no more than 26 characters, you'll be okay.
You could go riskier and go with varchar and 50char length, but on average only utilize 26characters per column.. meaning one column maybe 36 character length, and the next is 16character length... then you are okay again. (As long as you never exceed the average of 26characters per column for the 300 columns.)
Obviously with dynamic number of fields, and potential to way exceed the 8000 character limit, it is doomed by SQL's specs.
Your only other alternative is to create multiple tables and when you access the data, have a unique key to join appropriate records on. So in your select statement, use the join, and from multiple tables then you can handle rows with 8000 + 8000 + ...
So it is doable, but you have to work with SQL rules.
I believe you're running into this limitation:
There is no limit to the number of items in the ORDER BY clause. However, there is a limit of 8,060 bytes for the row size of intermediate worktables needed for sort operations. This limits the total size of columns specified in an ORDER BY clause.
I had a legacy app like this, it was a nightmare.
First, I broke it into multiple tables, all one-to-one. This is bad, but less bad than what you've got.
Then I changed the queries to request only the columns that were actually needed. (I can't tell if you have that option.)