AWS Glue failed to insert UUID into Postgres DB - postgresql

I created a table in Aurora Postgres DB with one UUID column id, and created a AWS Glue Studio
Job with the following code:
schema = ['id']
rdd = [[str(uuid.uuid4())]]
dyf = glueContext.create_dynamic_frame_from_rdd(rdd, 'dyf', schema=schema)
glueContext.write_from_options(frame_or_dfc=dyf, connection_type='postgresql', connection_options={...})
An error was reported:
2023-01-05 20:27:35,757 INFO [task-result-getter-0]
scheduler.TaskSetManager (Logging.scala:logInfo(57)): Lost task 35.1
in stage 0.0 (TID 36) on 10.248.10.50, executor 1:
java.sql.BatchUpdateException (Batch entry 0 INSERT INTO "data"."t"
("id") VALUES ('6f2ac9cd-c6a9-4798-bc9b-59c8a3d37ca1') was aborted:
ERROR: column "id" is of type uuid but expression is of type character
varying Hint: You will need to rewrite or cast the expression.
I tried rdd = [[uuid.uuid4()]], but it seems Spark doesn't support uuid.
RecursionError: maximum recursion depth exceeded while calling a Python object
INSERT INTO "data"."t" ("id") VALUES ('6f2ac9cd-c6a9-4798-bc9b-59c8a3d37ca1')
The weird thing is I can run the SQL statement successfully from a SQL client. Is it a bug in AWS Glue?

Problem solved by adding an item to the connection_options
connection_options = {
...
'stringtype': 'unspecified',
}
But it still doesn't support inserting NULL values into UUID columns. A workaround would be inserting 00000000-0000-0000-0000-000000000000 instead.

Related

Unexpected type: BINARY

I am trying to read parquet files via the Flink table, and it throws the error when I select one of the timestamps.
My parquet table is something like this.
I create a table with this SQL :
CREATE TABLE MyDummyTable (
`id` INT,
ts BIGINT,
ts_ltz AS TO_TIMESTAMP_LTZ(ts, 3),
ts2 TIMESTAMP,
ts3 TIMESTAMP,
ts4 TIMESTAMP,
ts5 TIMESTAMP
)
It throws an error when I select one of the ts2, ts3, ts4, ts5
The error stack is this.
Caused by: java.lang.IllegalArgumentException: Unexpected type: BINARY
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:77)
at org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:369)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createWritableVectors(ParquetVectorizedInputFormat.java:264)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReaderBatch(ParquetVectorizedInputFormat.java:254)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createPoolOfBatches(ParquetVectorizedInputFormat.java:244)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:137)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:73)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.checkSplitOrStartNext(FileSourceSplitReader.java:112)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:65)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:138)
... 7 more
My Current appproach
I am currently using this approach to jump over the problem but it does not seem a legit solution since I create two columns in the table.
CREATE TABLE MyDummyTable (
`id` INT,
ts2 STRING,
ts2_ts AS TO_TIMESTAMP(ts2)
)
Flink Version: 1.13.2
Scala Version: 2.11.12

Updating JSONB object using another table

I am trying to update a JSONB field in one table with data from another table. For example,
update ms
set data = data || '{"COMMERCIAL": 3.4, "PCT" : medi_percent}'
from mix
where mix.id = mss.data_id
and data_id = 6000
and set_id = 20
This is giving me the following error -
Invalid input syntax for type json
DETAIL: Token "medi_percent" is invalid.
When I change medi_percent to a number, I don't get this error.
{"COMMERCIAL": 3.4, "PCT" : medi_percent} is not a valid JSON text. Notice there is no string interpolation happening here. You might be looking for
json_build_object('COMMERCIAL', 3.4, 'PCT', medi_percent)
instead where medi_percent is now an expression (that will presumably refer to your mix column).

How to resolve this erros "org.apache.spark.SparkException: Requested partitioning does not match the tablename table" in spark-shell

While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string
Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")

Getting `Mispartitioned tuple in single-partition insert statement` while trying to insert data into a partitioned table using `TABLE_NAME.insert`

I am creating a VoltDB table with the given insert statement
CREATE TABLE EMPLOYEE (
ID VARCHAR(4) NOT NULL,
CODE VARCHAR(4) NOT NULL,
FIRST_NAME VARCHAR(30) NOT NULL,
LAST_NAME VARCHAR(30) NOT NULL,
PRIMARY KEY (ID, CODE)
);
And partitioning the table with
PARTITION TABLE EMPLOYEE ON COLUMN ID;
I have written one spark job to insert data into VoltDB, I am using below scala code to insert records into VoltDB, Code works well if we do not partition the table
import org.voltdb._;
import org.voltdb.client._;
import scala.collection.JavaConverters._
val voltClient:Client = ClientFactory.createClient();
voltClient.createConnection("IP:PORT");
val empDf = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("sep", ",")
.load("/FileStore/tables/employee.csv")
// Code to convert scala seq to java varargs
def callProcedure(procName: String, parameters: Any*): ClientResponse =
voltClient.callProcedure(procName, paramsToJavaObjects(parameters: _*): _*)
def paramsToJavaObjects(params: Any*) = params.map { param ⇒
val value = param match {
case None ⇒ null
case Some(v) ⇒ v
case _ ⇒ param
}
value.asInstanceOf[AnyRef]
}
empDf.collect().foreach { row =>
callProcedure("EMPLOYEE.insert", row.toSeq:_*);
}
But I get below error if I partition the table
Mispartitioned tuple in single-partition insert statement.
Constraint Type PARTITIONING, Table CatalogId EMPLOYEE
Relevant Tuples:
ID CODE FIRST_NAME LAST_NAME
--- ----- ----------- ----------
1 CD01 Naresh "Joshi"
at org.voltdb.client.ClientImpl.internalSyncCallProcedure(ClientImpl.java:485)
at org.voltdb.client.ClientImpl.callProcedureWithClientTimeout(ClientImpl.java:324)
at org.voltdb.client.ClientImpl.callProcedure(ClientImpl.java:260)
at line4c569b049a9d4e51a3e8fda7cbb043de32.$read$$iw$$iw$$iw$$iw$$iw$$iw.callProcedure(command-3986740264398828:9)
at line4c569b049a9d4e51a3e8fda7cbb043de40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3986740264399793:8)
at line4c569b049a9d4e51a3e8fda7cbb043de40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3986740264399793:7)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
I found found a link (https://forum.voltdb.com/forum/voltdb-discussions/building-voltdb-applications/1182-mispartitioned-tuple-in-single-partition-insert-statement) regarding the problem and tried to partition the procedure using below query
PARTITION PROCEDURE EMPLOYEE.insert ON TABLE EMPLOYEE COLUMN ID;
AND
PARTITION PROCEDURE EMPLOYEE.insert ON TABLE EMPLOYEE COLUMN ID [PARAMETER 0];
But I am getting [Ad Hoc DDL Input]: VoltDB DDL Error: "Partition references an undefined procedure "EMPLOYEE.insert"" error while executing these statement.
However, I am able to insert the data by using #AdHoc stored procedure but I am not able figure out the problem or the solution for the above scenario where I am using EMPLOYEE.insert stored procedure to insert data into a partitioned table.
The procedure "EMPLOYEE.insert" is what is referred to as a "default" procedure, which is automatically generated by VoltDB when you create the table EMPLOYEE. It is already automatically partitioned based on the partitioning of the table, therefore you cannot call "PARTITION PROCEDURE EMPLOYEE.insert ..." to override this.
I think what is happening is that the procedure is partitioned by the ID column which in the EMPLOYEE table is a VARCHAR. The input parameter therefore should be a String. However, I think your code is somehow reading the CSV file and passing in the first column as an int value.
The java client callProcedure(String procedureName, Object... params) method accepts varargs for the parameters. This can be any Object[]. There is a check along the way somewhere on the server where the # of arguments must match the # expected by the procedure, or else the procedure call is returned as rejected, and it would never have been executed. However, I think in your case the # of arguments is ok, so it then tries to execute the procedure. It hashes the 1st parameter value corresponding to ID and then determines which partition this should go to. The invocation is routed to that partition for execution. When it executes, it tries to insert the values, but there is another check that the partition key value is correct for this partition, and this is failing.
I think if the value is passed in as an int, it is hashed to the wrong partition. Then in that partition it tries to insert the value into the column, which is a VARCHAR so it probably implicitly converts the int to a String, but it's not in the correct partition, so the insert fails with this error "Mispartitioned tuple in single-partition insert statement." which is the same error you would get if you wrote a java stored procedure and configured the wrong column as the partition key.
Disclosure: I work at VoltDB.

Creating and Querying a Volatile table using Teradata JDBC in Spark using Scala

I am getting the following error when using Spark 1.5.1 when querying a Volatile Table in Teradata:
Exception in thread "main" java.sql.SQLException: [Teradata Database] [TeraJDBC 15.00.00.20] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or '(' between the 'FROM' keyword and the 'CREATE' keyword.
This is the code I am running that generates the above error:
val url = "jdbc:teradata://FOO/, TMODE=TERA,TYPE=DEFAULT"
val properties = new java.util.Properties()
val driver = "com.teradata.jdbc.TeraDriver"
properties.setProperty("driver",driver)
properties.setProperty("username","USER")
properties.setProperty("password","PASSWORD")
var query =
f"""
CREATE VOLATILE MULTISET TABLE tmp AS
( SELECT * FROM database.table )
WITH DATA PRIMARY INDEX(CR_PLCY_ID) ON COMMIT PRESERVE ROWS;
COMMIT;
SELECT * FROM tmp;
"""
var df = sqlContext.read.jdbc(url,query,properties)
Sidenote : I have oversimplified the original query to test the functionality of Volatile tables within a query over the JDBC. The original query has multiple volatile tables.
Any help or suggestions would be greatly appreciated.
I don't know Scala, nor Spark, but looking at the Java doc for the read.jdbc method (Spark Javadoc), it is expecting a table name as the second parameter.
So, the Spark library is doing some sort of "select * from (table parameter value)" and this justifies the error message (as dnoeth had already said)