AWS Glue DynamicFrame tries to write empty string as null - amazon-redshift

I have a AWS Glue Job moving data from RDS table to Redshift.
both table has same schema:
-- RDS
CREATE TABLE my_table (
id varchar(256) not null primary key
col1 varchar(256) not null
)
-- Redshift
CREATE TABLE my_table (
id varchar(256) not null
col1 varchar(256) not null
) sortkey(id)
I crawled both schemas and wrote a trivial job to write DynamicFrame from RDS source to Redshift sink.
val datasource = glueContext.getCatalogSource(
database = "my_rds",
tableName = "my_table",
redshiftTmpDir = ""
).getDynamicFrame()
glueContext.getCatalogSink(
database = "my_redshift",
tableName = "my_table",
redshiftTmpDir = "s3://some-bucket/some-path"
).writeDynamicFrame(datasource)
but the job fails for rows with empty string values of col1 with:
java.sql.SQLException:
Error (code 1213) while loading data into Redshift: "Missing data for not-null field"
Table name: my_table
Column name: col1
Column type: varchar(254)
Raw line: 3027616797,#NULL#
Raw field value: #NULL#
When I debug this with a glue-spark-shell I can verify that the value is an empty string "".
scala> datasource.toDF().filter("id = '3027616797'").select("col1").collect().head.getString(0)
res23: String = ""
How can I tell glue to distinguish between empty strings "" and NULLs?

It looks like it's a problem in Databricks Datasource for Redshift (docs) (apparently AWS Glue uses it internally). There are open tickets regarding exactly this problem but they have not been touched for over a year:
https://github.com/databricks/spark-redshift/issues/331
https://github.com/databricks/spark-redshift/issues/49
I have tried that code, but the result is exactly the same:
datasource
.toDF()
.write
.format("com.databricks.spark.redshift")
.option("url", "<RS_JDBC_URL>?user=<USER>&password=<PASSWORD>")
.option("dbtable", "my_table")
.option("tempdir", "s3://S_PATH")
.option("forward_spark_s3_credentials", "true")
.save

Related

Mapping Dataflows time column to Sql time column

SQL server db has a time column datatype of
[start_time] time NULL,
[end_time] time NULL,
but dataflows doesn't have a function for this.
The only way I can think of doing this is a post query (if you fully recreate the table)
alter table dbo.[testTable]
alter column [start_time] time(0)
alter column [end_time] time(0)
I tried using timestamp but again it's not a matching datatype
toTimestamp(substring(start_date,12,9),'HH:mm:ss')
so this doesn't work.
Any help on understanding this would be great
** updating with screens shots
So this issue is for parquet or csv to Sql db tables.
if you have a column that looks like DateTime you need to keep it as a string as there is no toDateTime function only toTimestamp. Neither string or Timestamp can be converted to DateTime datatype in SQLdb sink. You will end up with nulls in your column
Sample befor using expression to change the start_date to yyyy-mm-dd THH:mm:ss
You can simply map the DATETIME column to the target TIME column in the sink activity.
Make sure the option "Allow schema drift" in sink activity is unchecked.
My test schema:
-- source table
DROP TABLE IF EXISTS [dbo].[tempSourceTable]
CREATE TABLE [dbo].[tempSourceTable](
[id] int IDENTITY(1,1) NOT NULL,
[key] nvarchar(max) NULL,
[start_date] datetime NULL
)
INSERT INTO [dbo].[tempSourceTable] VALUES ('key1', '2021-10-14 12:34:56')
SELECT * FROM [dbo].[tempSourceTable]
-- target table
DROP TABLE IF EXISTS [dbo].[tempTargetTable]
CREATE TABLE [dbo].[tempTargetTable](
[id] int IDENTITY(1,1) NOT NULL,
[key] nvarchar(max) NULL,
[start_time] time NULL
)
result after execute the dataflow in a pipeline:
Here is my testing CSV input:
start_date,end_date,start_date_time,end_date_time,start_time,end_time
09/01/2020,09/01/2020,09/01/2020 11:01,09/01/2020 11:01,11:01:46,11:01:52
09/01/2020,,09/01/2020 11:01,,11:01:47,
09/01/2020,09/01/2020,09/01/2020 11:01,09/01/2020 11:50,11:01:49,11:50:41
09/01/2020,09/01/2020,09/01/2020 11:01,09/01/2020 11:01,11:01:51,11:01:55
09/01/2020,09/01/2020,09/01/2020 11:01,09/01/2020 11:01,11:01:52,11:01:56
You may specify the data/time/datetime format for CSV source data:
You can see the correct parsing result in data preview:
After that, a simple sink activity should achieve what OP wants to do:
The sink table schema I used for testing:
CREATE TABLE [dbo].[tempTargetTable](
[start_date] date NULL,
[end_date] date NULL,
[start_date_time] datetime NULL,
[end_date_time] datetime NULL,
[start_time] time NULL,
[end_time] time NULL
)
Result in DB:

leading zero insert/update issue in python and postgresql

I'm trying to insert empid into my Postgres table, issue here is I'm seeing the leading zero is missing in all cases. How could i retain it ?
my python code:
sql = "insert into emp (empid,name,sal) values (%s,%s,%s)"
data = (05599,xyz,10000)
cur.execute(sql, data)
Postgresql column datatype
empid --> character varying(5)
in the table I'm seeing only 5599 instead of 05599. How could I retain the leading zero while inserting/updating data into the table?
Thanks
Not sure how you got that code to work:
create table emp (empid varchar ,name varchar, sal numeric);
import psycopg2
con = psycopg2.connect("dbname=test host=localhost user=aklaver")
cur = con.cursor()
sql = "insert into emp (empid,name,sal) values (%s,%s,%s)"
data = (05599,xyz,10000)
^
SyntaxError: invalid token
# This works
data = ('05599', 'xyz' ,10000)
cur.execute(sql, data)
con.commit()
select * from emp;
empid | name | sal
-------+------+-------
05599 | xyz | 10000
So as #a_horse_with_no_name suggested pass in as a string.

Psycopg2: how to insert and update on conflict using psycopg2 with python?

I am using psycopg2 to insert command to postgres database and when there is a confilict i just want to update the other column values.
Here is the query:
insert_sql = '''
INSERT INTO tablename (col1, col2, col3,col4)
VALUES (%s, %s, %s, %s) (val1,val2,val3,val4)
ON CONFLICT (col1)
DO UPDATE SET
(col2, col3, col4)
= (val2, val3, val4) ; '''
cur.excecute(insert_sql)
I want to find where I am doing wrong? I am using variables val1 , val2, val3 not actual values.
To quote from psycopg2's documentation:
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
Now, for an upsert operation you can do this:
insert_sql = '''
INSERT INTO tablename (col1, col2, col3, col4)
VALUES (%s, %s, %s, %s)
ON CONFLICT (col1) DO UPDATE SET
(col2, col3, col4) = (EXCLUDED.col2, EXCLUDED.col3, EXCLUDED.col4);
'''
cur.execute(insert_sql, (val1, val2, val3, val4))
Notice that the parameters for the query are being passed as a tuple to the execute statement (this assures psycopg2 will take care of adapting them to SQL while shielding you from injection attacks).
The EXCLUDED bit allows you to reuse the values without the need to specify them twice in the data parameter.
Using:
INSERT INTO members (member_id, customer_id, subscribed, customer_member_id, phone, cust_atts) VALUES (%s, %s, %s, %s, %s, %s) ON CONFLICT (customer_member_id) DO UPDATE SET (phone) = (EXCLUDED.phone);
I received the following error:
psycopg2.errors.FeatureNotSupported: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression
LINE 1: ...ICT (customer_member_id) DO UPDATE SET (phone) = (EXCLUDED.p...
Changing to:
INSERT INTO members (member_id, customer_id, subscribed, customer_member_id, phone, cust_atts) VALUES (%s, %s, %s, %s, %s, %s) ON CONFLICT (customer_member_id) DO UPDATE SET (phone) = ROW(EXCLUDED.phone);
Solved the issue.
Try:
INSERT INTO tablename (col1, col2, col3,col4)
VALUES (val1,val2,val3,val4)
ON CONFLICT (col1)
DO UPDATE SET
(col2, col3, col4)
= (val2, val3, val4) ; '''
I haven't seen anyone comment on this, but you can utilize psycopg2.extras.execute_values to insert/update many rows of data at once, which I think is the intended solution to many inserts/updates.
There's a few tutorials on YouTube that illustrate this, one being How to connect to PSQL Database using psycopg2 + Python
In the video they load a dataframe using pandas and insert the data from a CSV source into multiple schemas/tables. The code snippet example in that video is
from psycopg2.extras import execute_values
sql_insert = """
INSERT INTO {state}.weather_county(fips_code, county_name, temperature)
VALUES %s
ON CONFLICT (fips_code) DO UPDATE
SET
temperature=excluded.temperature,
updated_at=NOW()
;
"""
grouped = new_weather_data.groupby(by='state') ## new_weather_data is a dataframe
conn = create_rw_conn(secrets=secrets)
for state, df in grouped:
# select only the neccessary columns
df = df[['fips_code', 'county_name', 'temperature']]
print("[{}] upsert...".format(state))
# convert dataframe into list of lists for `execute_values`
data = [tuple(x) for x in df.values.tolist()]
cur = conn.cursor()
execute_values(cur, sql_insert.format(state=state), data)
conn.commit() # <- We MUST commit to reflect the inserted data
print("[{}] changes were commited...".format(state))
cur.close()
The Jupyter Notebook is psycopg2-python-tutorial/new-schemas-tables-insert.ipynb
Here's the function that takes df, schemaname of the table, name of the table, the column that you want to use as a conflict in the name of conflict, and the engine created by create_engine of sqlalchemy. It updates the table with respect to conflict column. This is extended solution for the solution of #Ionut Ticus .
Don't use pandas.to_sql() together. pandas.to_sql() destroys the primary key setting. In this case, one need to set primary key by the ALTER query, which is a suggestion by the function below. Primary key does not necessarily be destroyed by pandas, one might haven't been set it. Error would be in that case:
there is no unique constraint matching given keys for referenced table? Function will suggest you to execute below.
engine.execute('ALTER TABLE {schemaname}.{tablename} ADD PRIMARY KEY ({conflictcolumn});
Function:
def update_query(df,schemaname,tablename,conflictcolumn,engine ):
"""
This function takes dataframe as df, name of schema as schemaname,name of the table to append/add/insert as tablename,
and column name that only other elements of rows will be changed if it's existed as conflictname,
database engine as engine.
Example to engine : engine_portfolio_pg = create_engine('postgresql://pythonuser:vmqJRZ#dPW24d#145.239.121.143/cetrm_portfolio')
Example to schemaname,tablename : weatherofcities.sanfrancisco , schemaname = weatherofcities, tablename = sanfrancisco.
"""
excluded = ""
columns = df.columns.tolist()
deleteprimary = columns.copy()
deleteprimary.remove(conflictcolumn)
excluded = ""
replacestring = '%s,'*len(df.columns.tolist())
replacestring = replacestring[:-1]
for column in deleteprimary:
excluded += "EXCLUDED.{}".format(column)+","
excluded = excluded[:-1]
columns = ','.join(columns)
deleteprimary = ','.join(deleteprimary)
insert_sql = """ INSERT INTO {schemaname}.{tablename} ({allcolumns})
VALUES ({replacestring})
ON CONFLICT ({conflictcolumn}) DO UPDATE SET
({deleteprimary}) = ({excluded})""".format( tablename = tablename, schemaname=schemaname,allcolumns = columns, replacestring= replacestring,
conflictcolumn= conflictcolumn,deleteprimary = deleteprimary, excluded=excluded )
conn = engine.raw_connection()
conn.autocommit = True
#conn = engine.connect()
cursor = conn.cursor()
i = 0
print("------------------------"*5)
print("If below error happens:")
print("there is no unique constraint matching given keys for referenced table?")
print("Primary key is not set,you can execute:")
print("engine.execute('ALTER TABLE {}.{} ADD PRIMARY KEY ({});')".format(schemaname,tablename,conflictcolumn))
print("------------------------"*5)
for index, row in df.iterrows():
cursor.execute(insert_sql, tuple(row.values))
conn.commit()
if i == 0:
print("Order of Columns in Operated SQL Query for Rows")
columns = df.columns.tolist()
print(insert_sql%tuple(columns))
print("----")
print("Example of Operated SQL Query for Rows")
print(insert_sql%tuple(row.values))
print("---")
i += 1
conn.close()

Can we set String column as partitionColumn?

Table only has String column as primary column EMPLOYEE_ID how to partition it.
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "EMPLOYEE_ID", P00100001, P00100005000000, 10, new java.util.Properties()).rdd.map(_.mkString(","))
Is there any other way to Read JDBC table and process it.
It is not possible. Only integer columns can be used here. If your database supports some variant of rowid, which is integer or can be casted to integer, you can extract it in a query (pseudocode):
(SELECT CAST(rowid AS INTEGER), * FROM TABLE) AS tmp

How to write JSON column type to Postgres with PySpark?

I have a Postgresql table that has a column with data type JSONB.
How do I insert DataFrame to the Postgresql table via JDBC?
If I have a UDF to convert the the body column to the JSONB Postgresql data type, what is the corresponding pyspark.sql.types should I use?
Postgresql Table with a JSONB column:
CREATE TABLE dummy (
id bigint,
body JSONB
);
Thanks!
It turned out if I set "stringtype":"unspecified" as the properties of the JDBC, Postgres will cast automatically:
properties = {
"user": "***",
"password": "***",
"stringtype":"unspecified"
}
df.write.jdbc(url=url, table="dummy", properties=properties)