Using Kafka TimestampConverter with microseconds? - postgresql

I'm new to Kafka/Avro. My downstream database (Postgres) has a timestamptz column. My upstream database (Materialize) produces the following Avro schema for the column:
"type": [
"null",
{
"logicalType": "timestamp-micros",
"type": "long"
}
]
This seems consistent with timestamptz, which also stores microseconds. However, I'm getting the following:
org.postgresql.util.PSQLException: ERROR: column "time" is of type timestamp with time zone but expression is of type bigint
Hint: You will need to rewrite or cast the expression.
Position: 207
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:122)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:581)
... 10 more
It seems like the Kafka messages contain bigint/long and JdbcSinkConnector doesn't know how to convert it to a timestamp. I tried using Kafka's TimestampConverter, but it assumes the input timestamp is in milliseconds.
It doesn't look like TimestampConverter supports microseconds directly (https://issues.apache.org/jira/browse/KAFKA-10561)
Is there a way to convert microseconds to milliseconds in Kafka connectors? This can just be a hack, so I don't need to change the Avro schema. If there's a transform to divide by 1000 or remove the last 3 digits, that should work. Is there a way to do this without a custom transform or ksqlDB?

Related

How to transform all timestamp fields according to avro scheme when using Kafka Connect?

In our database we have over 20 fields which we need to transform from long to timestamp. Why there is no generic solution to transfer all this value ?
I know I can define:
"transforms":"tsFormat",
"transforms.tsFormat.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.tsFormat.target.type": "string",
"transforms.tsFormat.field": "ts_col1",
"transforms.tsFormat.field": "ts_col2",
but this is not solution for us. When we add new timestamp to db we need to update connector too
is there some generic solution to transform all fields according to avro schema ?
We are using debezium which for all timestamp fields create something like this:
{
"name": "PLATNOST_DO",
"type": {
"type": "long",
"connect.version": 1,
"connect.name": "io.debezium.time.Timestamp"
}
},
so how to find all type with connect.name = 'io.debezium.time.Timestamp' and transform it to timestamp
You'd need to write your own transform to be able to dynamically iterate of the record schema, check the types, and do the conversion.
Thus why they are called simple message transforms.
Alternatively, maybe take a closer look at the Debezium properties to see if there is a missing setting that alters how timestamps get produced.

Unsupported type:TIMESTAMP(6) WITH LOCAL TIME ZONE NOT NULL

I am reading data from a PostgreSQL database into Flink, and using Table API and SQL. Apparently my timestamp column (called inserttime) is of type TIMESTAMP(6) WITH LOCAL TIME ZONE NOT NULL. But when I do an SQL query like SELECT inserttime FROM books, Flink is complaining that it is an unsupported type, with the error Unsupported type:TIMESTAMP(6) WITH LOCAL TIME ZONE NOT NULL.
Is this because of the NOT NULL at the end? If so, how should I cast it such that it can be read by Flink?
I've tried to use a UDF to convert it to something readable by Flink, like below:
public static class TimestampFunction extends ScalarFunction {
public java.sql.Timestamp eval (java.sql.Timestamp t) {
// do some conversion
}
}
Obviously, the eval function signature is wrong, but I'm wondering if this could be a method to get Flink to read my timestamp type? If so, what should the eval function signature be?
I've also tried doing CAST(inserttime as TIMESTAMP) but the same Unsupported type error occurs.
Any ideas?

Postgresql jsonb vs datetime

I need to store two dates valid_from, and valid_to.
Is it better to use two datetime fields like valid_from:datetime and valid_to:datatime.
Would be it better to store data in jsonb field validity: {"from": "2001-01-01", "to": "2001-02-02"}
Much more reads than writes to database.
DB: PostgresSQL 9.4
You can use daterange type :
ie :
'[2001-01-01, 2001-02-02]'::daterange means from 2001-01-01 to 2001-02-02
bound inclusive
'(2001-01-01, 2001-02-05)'::daterange means from 2001-01-01
to 2001-02-05 bound exclusive
Also :
Special value like Infinite can be use
lower(anyrange) => lower bound of range
and many other things like overlap operator, see the docs ;-)
Range Type
Use two timestamp columns (there is no datetime type in Postgres).
They can efficiently be indexed and they protect you from invalid timestamp values - nothing prevents you from storing "2019-02-31 28:99:00" in a JSON value.
If you very often need to use those two values to check if another tiemstamp values lies in between, you could also consider a range type that stores both values in a single column.

Are there disadvantages on using as partition column a non-primitive column (date) in Hive?

Is there any reason why I shouldn't use a column formatted as date as the partitioning column in a table in Apache Hive?
The official documentation says:
Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn't make sense. The dynamic partitioning column's type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS.
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions#DynamicPartitions-Designissues
I don't see why columns formatted as date would create any issue, since by design these could be converted into string.

using different avro schema for new columns

I am using flume + kafka to sink the log data to hdfs. My sink data type is Avro. In avro schema (.avsc), there is 80 fields as columns.
So I created an external table like that
CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')
Now, I need to add 25 more columns to avro schema. In that case,
if I create a new table with new schema which has 105 columns, I will have two table for one project. And if I add or remove some columns in coming days, I have to create a new table for that. I am afraid of having a lot of table which use different schema for same project.
If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.
What is the best way to use avro schema in case like that?
This is indeed challenging. The best way is to make sure all schema changes you make are compatible with the old data - so only remove columns with defaults, and make sure you give defaults in the columns you are adding. This way you can safely swap out the schemas without a conflict and keep reading old data. Avro is pretty clever about that, it's called "schema evolution" (in case you want to google a bit more) and allows reader and writer schemas to be a bit different.
As an aside, I want to mention that Kafka has a native HDFS connector (i.e. without Flume) that uses Confluent's schema registry to handle these kinds of schema changes automatically - you can use the registry to check if the schemas are compatible, and if they are - simply write data using the new schema and the Hive table will automatically evolve to match.
I added new columns to avro schema like that
{"name":"newColumn1", "type": "string", "default": ""},
{"name":"newColumn2", "type": "string", "default": ""},
{"name":"newColumn3", "type": "string", "default": ""},
When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.
For setting null value as default, you need that
{ "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },
or
{ "name": "newColumn5", "type": [ "null", "string" ]},
The position of null in type property, can be first place or can be second place with default property.