Audit data changes with Debezium - hibernate-envers

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?

Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.

I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

Related

Kafka connect: How to handle database schema/table changes

Wondering if there is a documented process on how to handle database schema changes. I am using Debezium source connector for postgres and confluent JDBC Sink connector to replicate database changes. I need to do some changes in database as below
Add new columns to existing table
Modify database column type and update name.
I am not sure what is the best way to do this. Solution that I can think if is
Stop source connector
Wait for sinks to consume all messages
Upgrade the databases
Start source and sink connector
Debezium will automatically add new fields in the record schema for new columns. So you would update your consumer and downstream systems first to prepare for those events. No need to stop the source...
If you change types and names, then you might run into backwards incompatible schema changes, and these operations are generally not recommended. Instead, always add new columns but "deprecate" and dont use the old ones. After you are done reading events from those old columns in all other systems, then drop those columns.

write apache iceberg table to azure ADLS / S3 without using external catalog

I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in
the table’s metadata folder whose contents is the version number of
the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that
way because they are intended for HDFS. We also use them for local FS
tests, but they can't be safely used concurrently with S3. For S3,
you'll need a metastore to enforce atomicity when swapping table
metadata locations. You can use the one in iceberg-hive to use the
Hive metastore.
Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if
there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.
I would definitely try to use a catalog other than the HadoopCatalog / hdfs type for production workloads.
As somebody who works on Iceberg regularly (I work at Tabular), I can say that we do think of the hadoop catalog as being more for testing.
The major reason for that, as mentioned in your threads, is that the catalog provides an atomic locking compare-and-swap operation for the current top level metadata.json file. This compare and swap operation allows for the query that's updating the table to grab a lock for the table after doing its work (optimistic locking), write out the new metadata file, update the state in the catalog to point to the new metadata file, and then release that lock.
The lock isn't something that really works out of the box with HDFS / hadoop type catalog. And then it becomes possible for two concurrent actions to write out a metadata file, and then one sets it and the other's work gets erased or undefined behavior occurs as ACID compliance is lost.
If you have an RDS instance or some sort of JDBC database, I would suggest that you consider using that temporarily. There's also the DynamoDB catalog, and if you're using Dremio then nessie can be used as your catalog as well
In the next version of Iceberg -- the next major version after 0.14, which will likely be 1.0.0, there is a procedure to register tables into a catalog, which makes it easy to move a table from one catalog to another in a very efficient metadata only operation, such as CALL catalog.system.register_table('$new_table_name', '$metadata_file_location');
So you're not locked into one catalog if you start with something simple like the JDBC catalog and then move onto something else. If you're just working out a POC, you could start with the Hadoop catalog and then move to something like the JDBC catalog once you're more familiar, but it's important to be aware of the potential pitfalls of the hadoop type catalog which does not have the atomic compare-and-swap locking operation for the metadata file that represents the current table state.
There's also an option to provide a locking mechanism to the hadoop catalog, such as zookeeper or etcd, but that's a somewhat advanced feature and would require that you write your own custom lock implementation.
So I still stand by the JDBC catalog as the easiest to get started with as most people can get an RDBMS from their cloud provider or spin one up pretty easily -- especially now that you will be able to efficiently move your tables to a new catalog with the code in the current master branch or in the next major Iceberg release, it's not something to worry about too much.
Looking at comments on this thread, Is version-hint.text file optional?
Yes, the version-hint.txt file is used by the hadoop type catalog to attempt to provide an authoritative location where the table's current top-level metadata file is located. So version-hint.txt is only found with hadoop catalog, as other catalogs store it in their own specific mechanism. A table in an RDBMS instance is used to store all of the catalogs "version hints" when using the JDBC catalog or even the Hive catalog, which is backed by Hive Metastore (and very typically an RDBMS). Other catalogs include the DynamoDB catalog.
If you have more questions, the Apache Iceberg slack is very active.
Feel free to check out the docker-spark-iceberg getting started tutorial (which I helped create), which includes Jupyter notebooks and a docker-compose setup.
It uses the JDBC catalog backed by Postgres. With that, you can get a feel for what the catalog is doing by ssh'ing into the containers and running psql commands, as well as looking at table data on your local machine. There's also some nice tutorials with sample data!
https://github.com/tabular-io/docker-spark-iceberg

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

How to de-normalize data in Kafka?

I have a MySQL database with ~20 tables. The data is normalized.
Considering this example:
book -> book_authors <- authors
we try to stream the books info eg.:
{book_id:3, title='Red', authors:[{id:3, name:'Mary'}, {id:4, name:'John'}]}
An instance when we see a serious problem: if an author's name change, we have to re-generate all their books.
I'm using Debezium to post the change log for each table in Kafka.
I am unable to find an elegant solution for data denormalization, eg. for adding it to ElasticSearch, MongoDb etc.
I identified two solutions, but both seem to fail:
De-normalize data into a new MySQL table, at source, and use Debezium to stream only this new table. This might be not possible and we have to invest a lot of effort in changing the code of the source system.
Join the streams in Kafka, though, I didn't manage to make it work. It seems that Kafka does not allow joining on a non-primary-key field. This seems a common situation with N-to-N relations.
Did anyone find a solution to data denormalization and publish data into a Kafka stream? This seems to be a common problem and I couldn't find any solution yet.
Try publishing the changes from Debezium to the topics book, book_authors and authors in its raw form, which creates three disjoint streams.
Create a simple consumer application that subscribes to all three topics. Upon receiving a message on either topic, it queries the database to obtain the latest snapshot of the referenced entities, merges the data together, and publishes the denormalised version onto a new merged_book_authors topic. Downstream consumers can read directly from the merged topic.
A minor variation of the above: rather than querying the database for each Debezium change, which may be slow, build a materialised view using a fast key-value or document store such as Redis. This is a little more work, but will (1) improve the throughput of the overall pipeline and (2) take the load off the system-of-record database.

forwarding data from one source to another in real time

I have a legacy system that is capable of inserting updating data from its database to remote RDBMS (using jdbc driver) in real time. I cannot change the code since I don't have it.
We are thinking of moving this data to nosql data source like cassandra.
I am thinking of deploying postgres in the middle and pushing it to cassandra or writing it to flat file. Since there are frequent updates I will have to store the data in two database. Is there any ETL process which can listen to sql queries (insert,update,delete) and forward it to different source?
One option would be to use bottled water to capture changes in postgresql and a create a consumer that would apply those changes to e.g. cassandra.