column datatypes in postgres trigger - postgresql

Background: I am using logical decoding (wal2json) to get updates from a db and it works wonderfully except for the fact that view traffic does not come through logical decoding because its not represented in the WAL.
My idea to get around this is to put a trigger on a view which then sends the row change as json on a channel via pg_notify(). Getting the row as json was easy: to_jsonb(NEW) but where I am stuck is getting the column data types along with it. Wal2json presents row changes as such:
{
"kind": "insert",
"schema": "public",
"table": "table1_with_pk",
"columnnames": ["a", "b", "c"],
"columntypes": ["integer", "character varying(30)", "timestamp without time zone"],
"columnvalues": [2, "Tuning", "2018-03-27 11:58:28.988414"]
}
I would like to get my channel notify solution to have similar output, is this possible? Thanks in advance.

Related

Meltano: tag-postgres to target-postgres: data type such as uuid is converted to varchar

I'm actually working on a Meltano project where I have to extract data from one postgres database and load to the final warehouse (also a postgres database) using "Key-based Incremental Replication".
After the loading process of meltano, all table with columns of type uuid from the "tap-postgres" are changed to varchar in the target.
Could someone help to solve this issue?
Singer taps rarely capture "rich" string types such as uuid, and they output simple {"type": "string"} types without any additional metadata. Furthermore, Singer usually only supports JSONSchema Draft 4, so {"type": "string", "format": "uuid"} would not even be supported by most taps and targets.
Since Singer is an ELT analytics framework and in that context a uuid column type is not different in any meaningful way from a varchar, this is hardly an issue for most people.
If you need to replicate a database with that level of detail you may be better served by a different solution for database backup/replication.

What happens if you update the column of a Delta table by which it is partitioned?

What happens if you update the column of a Delta table by which it is partitioned?
Does it degrade Write performance substantially?
I am trying to find out which I haven't been able to so far from the docs whether lets say if we have underlying parquet, does Delta rebuild new files without the updated rows for the existing partitions OR is it virtually handled through transaction log entries?
You can always get this information from the history. For example, here is the data from operationsMetric column after execution of the update operation on the partition column. As you see, it rewrites files:
{
"numRemovedFiles": "5",
"numCopiedRows": "0",
"numAddedChangeFiles": "0",
"executionTimeMs": "478",
"scanTimeMs": "34",
"numAddedFiles": "5",
"numUpdatedRows": "5",
"rewriteTimeMs": "444"
}
and if you check file names, then you see that they are different.

Debezium New Record State Extraction SMT doesn't work properly in case of DELETE

I'm trying to apply Debezium's New Record State Extraction SMT using the following configuration:
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": true,
"transforms.unwrap.delete.handling.mode": "rewrite",
"transforms.unwrap.add.fields": "db,schema,table,txId,ts_ms"
For INSERT and UPDATE operations I get the messages as expected, but in case of DELETE I get the following as a payload:
"payload": {
"id": 2,
"first_name": "",
"last_name": "",
"__db": "postgres",
"__schema": "schema1",
"__table": "user_details",
"__txId": 5145,
"__ts_ms": 1638760801510,
"__deleted": "true"
}
As you can see above, both first_name and last_name fields have empty values, though the record I deleted has non-empty values for both of those fields. What I expect to see as a value for those 2 fields is their value at the moment of deletion as it is shown in debezium's before payload chunk in case when New Record State Extraction SMT is not applied.
The reason of empty values for all columns except the PK is not related to New Record State Extraction SMT at all. For postgres, there is a REPLICA IDENTITY table-level parameter that can be used to control the information written to WAL to identify tuple data that is being deleted or updated.
This parameter has 4 modes:
DEFAULT
USING INDEX index
FULL
NOTHING
In the case of DEFAULT, old tuple data is only identified with the primary key of the table. Columns that are not part of the primary key do not have their old value written.
In the case of FULL, all the column values of old tuple are properly written to WAL all the time. Hence, executing the following command for the target table will make the old record values to be properly populated in debezium message:
ALTER TABLE some_table REPLICA IDENTITY FULL;
NOTE!! FULL is the most verbose, and as well the most resource-consuming mode. Be careful with it particularly for heavily-updated tables.

Database calls, 484ms apart, are producing incorrect results in Postgres

We have "things" sending data to AWS IoT. A rule forwards the payloads to a Lambda which is responsible for inserting or updating the data into Postgres (AWS RDS). The Lambda is written in python and uses PG8000 for interacting with the db. The lambda event looks like this:
{
"event_uuid": "8cd0b9b1-be93-49f8-1234-af4381052672",
"date": "2021-07-08T16:09:25.138809Z",
"serial_number": "a1b2c3",
"temp": "34"
}
Before inserting the data into Postgres, a query is run on the table to look for any existing event_uuids which are required to be unique. For a specific reason, there is no UNIQUE constraint on the event_uuid column. If the event_uuid does not exist, the data is inserted. If the event_uuid does exist, the data is updated. This all works great, except for the following case.
THE ISSUE: one of our things is sending two of the same payloads in very quick succession. It's an issue with one of our things but it's not something we can resolve at the moment and we need to account for it. Here are the timestamps from CloudWatch of when each payload was received:
2021-07-08T12:10:09.288-04:00
2021-07-08T12:10:09.772-04:00
As a result of the payloads being received 484ms apart, the Lambda is inserting both payloads instead of inserting the first and performing an update with the second one.
Any ideas on how to get around this?
Here is part of the Lambda code...
conn = make_conn()
event_query = f"""
SELECT json_build_object('uuid', uuid)
FROM samples
WHERE event_uuid='{event_uuid}'
AND serial_number='{serial_number}'
"""
event_resp = fetch_one(conn, event_query)
if event_resp:
update_sample_query = f"""
UPDATE samples SET temp={temp} WHERE uuid='{event_resp["uuid"]}'
"""
else:
insert_sample_query = f"""
INSERT INTO samples (uuid, event_uuid, temp)
VALUES ('{uuid4()}', '{event_uuid}', {temp})
"""

using different avro schema for new columns

I am using flume + kafka to sink the log data to hdfs. My sink data type is Avro. In avro schema (.avsc), there is 80 fields as columns.
So I created an external table like that
CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')
Now, I need to add 25 more columns to avro schema. In that case,
if I create a new table with new schema which has 105 columns, I will have two table for one project. And if I add or remove some columns in coming days, I have to create a new table for that. I am afraid of having a lot of table which use different schema for same project.
If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.
What is the best way to use avro schema in case like that?
This is indeed challenging. The best way is to make sure all schema changes you make are compatible with the old data - so only remove columns with defaults, and make sure you give defaults in the columns you are adding. This way you can safely swap out the schemas without a conflict and keep reading old data. Avro is pretty clever about that, it's called "schema evolution" (in case you want to google a bit more) and allows reader and writer schemas to be a bit different.
As an aside, I want to mention that Kafka has a native HDFS connector (i.e. without Flume) that uses Confluent's schema registry to handle these kinds of schema changes automatically - you can use the registry to check if the schemas are compatible, and if they are - simply write data using the new schema and the Hive table will automatically evolve to match.
I added new columns to avro schema like that
{"name":"newColumn1", "type": "string", "default": ""},
{"name":"newColumn2", "type": "string", "default": ""},
{"name":"newColumn3", "type": "string", "default": ""},
When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.
For setting null value as default, you need that
{ "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },
or
{ "name": "newColumn5", "type": [ "null", "string" ]},
The position of null in type property, can be first place or can be second place with default property.