I have a glue task that is reading in data from S3, running a couple of SQL queries on the data, and outputting the data to Redshift. I am having an odd problem where when writing the dynamic_frame to Redshift(using glueContext.write_dynamic_frame.from_options) new columns are being created. These are some of my existing columns with the type appended to the end. For example if my frame schema is as follows:
id: string
value: short
value2: long
ts: timestamp
In Redshift I am seeing:
id varchar(256)
value: smallint <---- The data here is always null
value2: bigint <---- The data here is always null
ts: timestamp
value_short: smallint
value2_long: bigint
The value_short and value2_long columns are being created at time of exection(currently testing with creds that have alter table permissions)
When looking at the COPY command that was run I am seeing the columns value_short and value2_long in the command. I am not seeing the columns present in the dynamic frame before that is being written with glueContext.write_dynamic_frame.from_options
Casting the types explicitly as aloissiola suggested solved this problem for me. Specifically, I used the dynamicFrame.resolveChoice function:
changetypes = select1.resolveChoice(
specs=[
("value", "cast:int"),
("value2", "cast:int")
]
)
It looks like you can cast to short and long types as well. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-types.html I went through and specified types for all my columns.
The trick is to cast the short value to integer. Long -> bigint seems working for me.
Related
Good day!
I need to export/import data from SQL Server 2019 to AWS RDS running PostgreSQL 13.3
It's just a few hundred rows from a handful of tables.
This is my first ever encounter with Postgres, so I decided to simply script data as "INSERT ... SELECT", as I would with SQL Server... and I've looked into AWS Glue, RDS S3 Import - it all seems waaay too much for what I need.
I am using DBeaver v21 for of this as I have easy access to both source and destination DBs.
This I tested with success:
CREATE TABLE public.invoices (
invoiceno int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
terminalid int4 NOT NULL,
invoicedate timestamp NOT NULL,
description varchar(100) NOT null
);
INSERT INTO public.invoices(InvoiceNo,TerminalID,InvoiceDate,Description)
SELECT 7 as invoiceno , 5 as terminalid , '2018-10-24 21:29:00' as invoicedate , N'Coffe and cookie' as description
-- Updated Rows 1
-- No problem here
I scripted the rest of the data with UNION ALL, like so (shortened example) :
INSERT INTO public.invoices(InvoiceNo,TerminalID,InvoiceDate,Description)
SELECT 7 as invoiceno , 5 as terminalid , '2018-10-24 21:29:00' as invoicedate , N'Coffe and cookie' as description
UNION ALL
SELECT 1000, 5 , '2018-10-24 21:29:00' , N'Tea and crumpets'
and now I get:
SQL Error [42804]: ERROR: column "invoicedate" is of type timestamp without time zone but expression is of type text
Hint: You will need to rewrite or cast the expression.
Position: 118
I do see in the message that it can be "fixed" with a CAST (or rewrite!)....
but how come Postgres can convert 1 row implicitly, yet 2 rows is impossible?
why does this fail when more than 1 row is being inserted? - it clearly knows how to convert text -> date ...
I tried using VALUES, CTE, derived tables with no success.
As I have to spend more time with postgres - I really would like to understand what's going on here. Is my syntax wrong (works fine SQL Server), is DBeaver messing up something with my data, etc...?
Any suggestions would be appreciated.
Thank you
'2018-10-24 21:29:00' is a string value and Postgres is a bit more picky about correct data types then SQL Server.
You need to specify the value as a proper timestamp constant,
timestamp '2018-10-24 21:29:00'
Note that you can write that in a bit more compact form using a values clause:
INSERT INTO public.invoices(InvoiceNo,TerminalID,InvoiceDate,Description)
values
(7, 5, timestamp '2018-10-24 21:29:00', 'Coffe and cookie'),
(1000, 5 , timestamp '2018-10-24 21:29:00' , 'Tea and crumpets');
The reason of such behaviour is in order of compilation.
In situation when you use VIEW first are compiled querys in view and types of columns (names too) in view is taken from the first part of a "view" (the first SELECT command).
So, you have got text instead of timestamp and it doesn't match to inserted table type.
MSSQL compiler is a little bit smarter :-).
In first example you have simple INSERT INTO ... SELECT ....
and compiler at once expect timestamp type - so , it not rise any compilation error (but error can ocure in execution time when the data do not pass rules of automatic conversion).
Postgres (V11.3, 64bit, Windows) truncates trailing zeros for timestamps. So if I insert the timestamp '2019-06-12 12:37:07.880' into the table and I read it back as text postgres returns '2019-06-12 12:37:07.88'.
Table date_test:
CREATE TABLE public.date_test (
id SERIAL,
"timestamp" TIMESTAMP WITHOUT TIME ZONE NOT NULL,
CONSTRAINT pkey_date_test PRIMARY KEY(id)
)
SQL command when inserting data:
INSERT INTO date_test (timestamp) VALUES( '2019-06-12 12:37:07.880' )
SQL command to retrieve data:
SELECT dt.timestamp ::TEXT FROM date_test dt
returns '2019-06-12 12:37:07.88'
Do you consider this a bug or a feature?
My real issue is: I´m running queries from a C++ program and I have to convert the data returned from the database to appropriate data types. Since the protocol is text-based everything I read from the database is plain text. When parsing timestamps I first tokenize the string and then convert each token to integer. And because the millisecond part is truncated, the last token is "88" instead of "880", and converting "88" yields another value that converting "880" to integer.
That's the default display format when using a cast to text.
If you want to see all three digits, use to_char()
SELECT to_char(dt.timestamp,'yyyy-mm-d hh24:mi:ss.ms')
FROM date_test dt;
will return 2019-06-12 12:37:07.880
It’s a matter of presentation only.
First note that 07.88 seconds and 07.880 seconds is the same amount of time (also 7.88 and 07.880000000 for that matter).
PostgreSQL internally represents a timestamp in a way that we shouldn’t be concerned about as long as it’s an unambiguous representation. When you retrieve the timestamp, it is formatted into some string. This is where PostgreSQL apparently chooses not to print redundant trailing zeros. So it’s probably not even correct to say that it truncates anything. It just refrains from generating that 0.
I think that the nice solution would be to modify your parser in C++ to accept any number of decimals and parse them correctly with and without trailing zeroes. Another solution that should work is given in the answer by a_horse_with_no_name.
I have a table with a column defined as follows:
PrincipalBalance DECIMAL(13,2) NOT NULL
I have a stored procedure that returns that value as follows:
SELECT D.PrincipalBalance
FROM MyTable AS D
WHERE ID = #myKey;
If I inspect the metadata for the table, the column definition is correct (it shows as DECIMAL(13,2).) It shows with the correct type in Object Explorer as well. However, IntelliSense for the column in the stored procedure shows that it is an INT, which is thoroughly puzzling.
The only workaround I have found for this is to use a CAST/CONVERT in the stored procedure, which seems like it should be unnecessary.
Note Earlier today I experienced the same problem with a decimal column that the was being returned as a VARCHAR(50) (or some other relevant length). It's interesting that in both cases, the value of the decimal column was zero.
What is going on here? Why is SQL Server selecting the wrong type?
I am tying to insert null value to a postgres timestamp datatype variable using python psycopg2.
The problem is the other data types such as char or int takes None, whereas the timestamp variable does not recognize None.
I tried to insert Null , null as a string because I am using a dictionary to get append the values for insert statement.
Below is the code.
queryDictOrdered[column] = queryDictOrdered[column] if isNull(queryDictOrdered[column]) is False else NULL
and the function is
def isNull(key):
if str(key).lower() in ('null','n.a','none'):
return True
else:
False
I get the below error messages:
DataError: invalid input syntax for type timestamp: "NULL"
DataError: invalid input syntax for type timestamp: "None"
Empty timestamps in Pandas dataframes come through as NaT (not a time), which is NOT pg compatible with NULL. A quick work around is to send it as a varchar and then run these 2 queries:
update <<schema.table_name>> set <<column_name>> = Null where
<<column_name>> = 'NULL';
or (depending on what you hard coded empty values as)
update <<schema.table_name>> set <<column_name>> = Null where <<column_name>> = 'NaT';
Finally run:
alter table <<schema.table_name>>
alter COLUMN <<column_name>> TYPE timestamp USING <<column_name>>::timestamp without time zone;
Surely you are adding quotes around the placeholder. Read psycopg documentation about passing parameters to queries.
Dropping this here incase it's helpful for anyone.
Using psycopg2 and the cursor object's copy_from method, you can copy missing or NaT datetime values from a pandas DataFrame to a Postgres timestamp field.
The copy_from method has a null parameter that is a "textual representation of NULL in the file. The default is the two characters string \N". See this link for more information.
Using pandas' fillna method, you can replace any missing datetime values with \N via data["my_datetime_field"].fillna("\\N"). Notice the double backslash here, where the first backslash is necessary to escape the second backslash.
Using the select_columns method from the pyjanitor module (or .loc[] and some subsetting with the column names of your DataFrame), you can coerce multiple columns at once via something akin to this, where all of your datetime fields end with an _at suffix.
data_datetime_fields = \
(data
.select_columns("*_at")
.apply(lambda x: x.fillna("\\N")))
I get the following Postgres error:
ERROR: value too long for type character varying(1024)
The offending statement is:
INSERT INTO integer_array_sensor_data (sensor_id, "time", status, "value")
VALUES (113, 86651204, 0, '{7302225, 7302161, 7302593, 730211,
... <total of 500 values>... 7301799, 7301896}');
The table:
CREATE TABLE integer_array_sensor_data (
id SERIAL PRIMARY KEY,
sensor_id INTEGER NOT NULL,
"time" INTEGER NOT NULL,
status INTEGER NOT NULL,
"value" INTEGER[] NOT NULL,
server_time TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW()
);
Researching PostgreSQL documentation doesn't mention anything about limitation on the array size.
Any idea how to fix this?
The problem doesn't come from the Array itself, but from the Varchar String declaring the array values in your Insert. Some drivers type the string literals as varchar(1024) causing that issue.
Instead of
'{1,2,3}'
try using
ARRAY[1,2,3]
otherwise you can try declaring the type of your string as TEXT (unlimited)
'{1,2,3}'::text
I start understanding my question although I have not found a solution. Problem is that there is a limitation on the string somewhere library level. I am actually using pqxx and you can't have strings longer than 1024 chars. I have accepted the answer of Guillaume F. because he figured this out but the casting doesn't work. I will edit my reply once I find a solution so people know what to do.
I just tried prepared statements and they have the same limitation.
The workaround is to use COPY or its pqxx binding pqxx:tablewriter.