Debugging "String length exceeds DDL Length" error AWS Glue - pyspark

I'm writing a dynamic frame to Redshift as a table and I'm getting the following error:
An error occurred while calling o3225.pyWriteDynamicFrame. Error (code 1204) while loading data into Redshift: "String length exceeds DDL length"
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("transactionId", "string", "transaction_id", "string"), ("basicChannelGroupingPath", "string", "channel_grouping", "string"))], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = applymapping1,catalog_connection = "redshift_test", connection_options ={"preactions":"truncate table dw.table;","dbtable": "dw.table", "database": "test","postactions":post_query},redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink2")

The error you are getting is not happening on the AWS Glue level, it's being passed from Amazon Redshift. Error code 1204 from Redshift means:
Input data exceeded the acceptable range for the data type.
That being said, some string data you are trying to write to a Redshift table exceeds the byte size limit of the string value in the table. To resolve that, I would recommend you first check the system table STL_LOAD_ERRORS and check the raw_field_value column to see the pre-parsing value and the string that causes the issue. After that, you could do additional pre-processing for such cases (if needed) and resolve your issue.

Related

Clickhouse client syntax error kafka integration

ClickHouse client version 18.16.1 and I'm following this blog post-https://altinity.com/blog/2020/5/21/clickhouse-kafka-engine-tutorial
when creating a table I'm using this syntex
CREATE TABLE readings (
readings_id Int32 Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
) Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (readings_id, time);
and I'm getting an error that says
"""
Code: 62, e.displayText() = DB::Exception: Syntax error: failed at position 76 (line 2, col 23): Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
)
ENGINE = MergeTr. Expected one of: token, ClosingRoundBracket, Comma, DEFAULT, MATERIALIZED, ALIAS, COMMENT, e.what() = DB::Exception
"""
let me know what I'm doing wrong thanks.

How to make an existing column NOT NULL in AWS REDSHIFT?

I had dynamically created a table through glue job and it is successfully working fine. But as per new requirement, I need to add a new column which generates unique values and should be primary key in redshift.
I had implemented the same using rownum() function and it’s working fine. But the latest requirement is that particular column should be primary key.
When I try to do that, it asks the column to have not null. Do you know how to make the column not null dynamically through glue job ? Or any redshift query to make it not null.
I tried all the ways without luck.
w = Window().orderBy(lit('A'))
df = timestampedDf.withColumn("row_num", row_number().over(w))
rowNumDf = DynamicFrame.fromDF(df1, glueContext, "df1")
postStep = "begin; ALTER TABLE TAB_CUSTOMER_DATA ALTER COLUMN row_num INTEGER NOT NULL; ALTER TABLE TAB_CUSTOMER_DATA ADD CONSTRAINT PK_1 PRIMARY KEY (row_num); end;"
## #type: DataSink
## #args: [catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "tab_customer_data", "database": "randomdb"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## #return: datasink4
## #inputs: [frame = rowNumDf]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = rowNumDf, catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "TAB_CUSTOMER_DATA", "database": "randomdb", "postactions": postStep}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()
I had solved this using below link approach:
add a new Column with default and not null.
update the old column values to new column.
drop an old column.
make this new column primary.
https://ubiq.co/database-blog/how-to-remove-not-null-constraint-in-redshift/

why ADD COLUMN to kafka table is not supported in Clickhouse

I have a problem adding a column to the Kafka queue in ClickHouse.
I've created a table with the command
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`ts` String,
.... some other columns
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
And then trying to add a column
ALTER TABLE my_db.my_queue ON CLUSTER my_cluster ADD COLUMN new_column String;
But getting an error
SQL Error [48]: ClickHouse exception, code: 48, host: 172.21.0.4, port: 8123; Code: 48,
e.displayText() = DB::Exception: There was an error on [clickhouse-server:9000]: Code: 48,
e.displayText() = DB::Exception: Alter of type 'ADD COLUMN' is not supported by storage Kafka
(version 20.11.4.13 (official build)) (version 20.11.4.13 (official build))
I am not familiar with ClickHouse and any analytical database.
So I am wondering why it is not supported? Or I should add a column in another way?
A way of supporting messages with different schema from a Kafka queue consists on storing the raw JSON messages like this:
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`message` String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONAsString',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
The JSONAsString format will store the raw JSON in the message column. This way from the Kafka table you can post-process each new row through materialized views and JSON functions.
For instance:
CREATE TABLE my_db.post_processed_data (
`ts` String,
`another_column` String
)
-- use a proper engine
Engine=Log;
CREATE MATERIALIZED VIEW my_db.my_queue_mv TO my_db.post_processed_data
AS
SELECT
JSONExtractString(message, 'ts') AS ts,
JSONExtractString(message, 'another_column') AS another_column
FROM my_db.my_queue;
If there's any change in the JSON schema of the Kafka queue, you can react accordingly doing an ALTER TABLE .. ADD COLUMN .. in the post_processed_data table and updating the materialized view accordingly. That way the Kafka table would remain as it is.
kafka Engine does not support it.
Just drop the table and create with a new schema.
It does not support alter because an author of KafkaEngine does not need it.

how to link python pandas dataframe to mysqlconnector '%s' value

I am trying to pipe a webscraped pandas dataframe into a MySql table with mysql.connector but I can't seem to link df values to the %s variable. The connection is good (I can add individual rows) but it just returns errors when I replace the value witht he %s.
cnx = mysql.connector.connect(host = 'ip', user = 'user', passwd = 'pass', database = 'db')
cursor = cnx.cursor()
insert_df = ("""INSERT INTO table"
"(page_1, date_1, record_1, task_1)"
"VALUES ('%s','%s','%s','%s')""")
cursor.executemany(insert_df, df)
cnx.commit()
cnx.close()
This returns "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
If I add any additional oiperations it returns "ProgrammingError: Parameters for query must be an Iterable."
I am very new to this so any help is appreciated
Work around for me was to redo my whole process. I ran sqlalchemy, all the documentation makes this very easy. message if you want the code I used.

How to get data out of a postgres bytea column into a python variable using sqlalchemy?

I am working with the script below.
If I change the script so I avoid the bytea datatype, I can easily copy data from my postgres table into a python variable.
But if the data is in a bytea postgres column, I encounter a strange object called memory which confuses me.
Here is the script which I run against anaconda python 3.5.2:
# bytea.py
import sqlalchemy
# I should create a conn
db_s = 'postgres://dan:dan#127.0.0.1/dan'
conn = sqlalchemy.create_engine(db_s).connect()
sql_s = "drop table if exists dropme"
conn.execute(sql_s)
sql_s = "create table dropme(c1 bytea)"
conn.execute(sql_s)
sql_s = "insert into dropme(c1)values( cast('hello' AS bytea) );"
conn.execute(sql_s)
sql_s = "select c1 from dropme limit 1"
result = conn.execute(sql_s)
print(result)
# <sqlalchemy.engine.result.ResultProxy object at 0x7fcbccdade80>
for row in result:
print(row['c1'])
# <memory at 0x7f4c125a6c48>
How to get the data which is inside of memory at 0x7f4c125a6c48 ?
You can cast it use python bytes()
for row in result:
print(bytes(row['c1']))