What happens if you update the column of a Delta table by which it is partitioned?

What happens if you update the column of a Delta table by which it is partitioned? - merge

What happens if you update the column of a Delta table by which it is partitioned?
Does it degrade Write performance substantially?
I am trying to find out which I haven't been able to so far from the docs whether lets say if we have underlying parquet, does Delta rebuild new files without the updated rows for the existing partitions OR is it virtually handled through transaction log entries?

You can always get this information from the history. For example, here is the data from operationsMetric column after execution of the update operation on the partition column. As you see, it rewrites files:
{
"numRemovedFiles": "5",
"numCopiedRows": "0",
"numAddedChangeFiles": "0",
"executionTimeMs": "478",
"scanTimeMs": "34",
"numAddedFiles": "5",
"numUpdatedRows": "5",
"rewriteTimeMs": "444"
}
and if you check file names, then you see that they are different.

Related

Debezium New Record State Extraction SMT doesn't work properly in case of DELETE

I'm trying to apply Debezium's New Record State Extraction SMT using the following configuration:
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": true,
"transforms.unwrap.delete.handling.mode": "rewrite",
"transforms.unwrap.add.fields": "db,schema,table,txId,ts_ms"
For INSERT and UPDATE operations I get the messages as expected, but in case of DELETE I get the following as a payload:
"payload": {
"id": 2,
"first_name": "",
"last_name": "",
"__db": "postgres",
"__schema": "schema1",
"__table": "user_details",
"__txId": 5145,
"__ts_ms": 1638760801510,
"__deleted": "true"
}
As you can see above, both first_name and last_name fields have empty values, though the record I deleted has non-empty values for both of those fields. What I expect to see as a value for those 2 fields is their value at the moment of deletion as it is shown in debezium's before payload chunk in case when New Record State Extraction SMT is not applied.

The reason of empty values for all columns except the PK is not related to New Record State Extraction SMT at all. For postgres, there is a REPLICA IDENTITY table-level parameter that can be used to control the information written to WAL to identify tuple data that is being deleted or updated.
This parameter has 4 modes:
DEFAULT
USING INDEX index
FULL
NOTHING
In the case of DEFAULT, old tuple data is only identified with the primary key of the table. Columns that are not part of the primary key do not have their old value written.
In the case of FULL, all the column values of old tuple are properly written to WAL all the time. Hence, executing the following command for the target table will make the old record values to be properly populated in debezium message:
ALTER TABLE some_table REPLICA IDENTITY FULL;
NOTE!! FULL is the most verbose, and as well the most resource-consuming mode. Be careful with it particularly for heavily-updated tables.

column datatypes in postgres trigger

Background: I am using logical decoding (wal2json) to get updates from a db and it works wonderfully except for the fact that view traffic does not come through logical decoding because its not represented in the WAL.
My idea to get around this is to put a trigger on a view which then sends the row change as json on a channel via pg_notify(). Getting the row as json was easy: to_jsonb(NEW) but where I am stuck is getting the column data types along with it. Wal2json presents row changes as such:
{
"kind": "insert",
"schema": "public",
"table": "table1_with_pk",
"columnnames": ["a", "b", "c"],
"columntypes": ["integer", "character varying(30)", "timestamp without time zone"],
"columnvalues": [2, "Tuning", "2018-03-27 11:58:28.988414"]
}
I would like to get my channel notify solution to have similar output, is this possible? Thanks in advance.

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost

If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)

Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

using different avro schema for new columns

I am using flume + kafka to sink the log data to hdfs. My sink data type is Avro. In avro schema (.avsc), there is 80 fields as columns.
So I created an external table like that
CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')
Now, I need to add 25 more columns to avro schema. In that case,
if I create a new table with new schema which has 105 columns, I will have two table for one project. And if I add or remove some columns in coming days, I have to create a new table for that. I am afraid of having a lot of table which use different schema for same project.
If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.
What is the best way to use avro schema in case like that?

This is indeed challenging. The best way is to make sure all schema changes you make are compatible with the old data - so only remove columns with defaults, and make sure you give defaults in the columns you are adding. This way you can safely swap out the schemas without a conflict and keep reading old data. Avro is pretty clever about that, it's called "schema evolution" (in case you want to google a bit more) and allows reader and writer schemas to be a bit different.
As an aside, I want to mention that Kafka has a native HDFS connector (i.e. without Flume) that uses Confluent's schema registry to handle these kinds of schema changes automatically - you can use the registry to check if the schemas are compatible, and if they are - simply write data using the new schema and the Hive table will automatically evolve to match.

I added new columns to avro schema like that
{"name":"newColumn1", "type": "string", "default": ""},
{"name":"newColumn2", "type": "string", "default": ""},
{"name":"newColumn3", "type": "string", "default": ""},
When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.
For setting null value as default, you need that
{ "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },
or
{ "name": "newColumn5", "type": [ "null", "string" ]},
The position of null in type property, can be first place or can be second place with default property.

Is there a possibility to keep column order when reading parquet?

Saving a dataframe with columns (e.g. "a", "b") as parquet and then reading the parquet at later point in time does not deliver the same column order (could be "b", "a" f.e.) as the file was saved with.
Unfortunately, I was not able to figure out, how the order is influenced and how I can control it.
How to keep original column order when reading in parquet?

PARQUET-188 suggests that column ordering is not part of the parquet spec, so it's probably not a good idea to rely on the ordering. You could however manage this yourself, e.g. by loading/saving the dataframe columns in lexicographical order, or by storing the column names.