Apache iceberg: how to set write.metadata.previous-versions-max - iceberg

Having many historical metadata files in apache iceberg helps us to produce a linear history of table versions and ensures that concurrent writes are not lost.
In Apache iceberg there is a table write property called:
write.metadata.previous-versions-max
and it's the max number of previous version metadata files to keep before deleting after commit (https://iceberg.apache.org/docs/latest/configuration/#write-properties).
The default is suggested to 100.
The documentation also says:
Iceberg keeps track of table metadata using JSON files. Each change to a table produces a new metadata file to provide atomicity.
Old metadata files are kept for history by default. Tables with frequent commits, like those written by streaming jobs, may need to regularly clean metadata files.
How does it work under the hood and what do I actually gain by reducing or increasing this number?

Related

How to actually delete files in Iceberg

I know that in Apache Iceberg I can set limits on number and age of snapshots, and that "deleting" data from the table does not result in underlying data removal, it simply masks or deletes tracking information.
I would like to actually delete the underlying files on delete, however. I know this will make time-travel inconsistent, but it is still a business requirement.
https://iceberg.apache.org/docs/latest/configuration/
As best as I can tell, I'll have to track and manage the physical life-cycle every file independently. Am I missing something?
If you don't care about table history (or time travel) you can simply call the expire_snapshots procedure after each delete.
What you get is a common question for many iceberg users.
We often need an asynchronous task to delete and expire snapshots\data.
If you use spark, you can use https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots, as shay saied.
you can also do this using the java api provided by iceberg https://iceberg.apache.org/docs/latest/api/.
Starting a task for each table is difficult to manage. Tables often have different TTL. In this case, You can add custom configurations to a table. Manually scan all iceberg tables, then determines whether to delete expired snapshots and data based on these configurations.
If you are using Iceberg with Hive (4.0.0-alpha2 + version), you can try expire_snapshot command on beeline.
Like
ALTER TABLE test_table EXECUTE expire_snapshots('2021-12-09 05:39:18.689000000');
Can read:
https://docs.cloudera.com/cdw-runtime/cloud/iceberg-how-to/topics/iceberg-expiring-snapshots.html
Hive Jira adding support:
https://issues.apache.org/jira/browse/HIVE-26354

write apache iceberg table to azure ADLS / S3 without using external catalog

I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in
the table’s metadata folder whose contents is the version number of
the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that
way because they are intended for HDFS. We also use them for local FS
tests, but they can't be safely used concurrently with S3. For S3,
you'll need a metastore to enforce atomicity when swapping table
metadata locations. You can use the one in iceberg-hive to use the
Hive metastore.
Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if
there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.
I would definitely try to use a catalog other than the HadoopCatalog / hdfs type for production workloads.
As somebody who works on Iceberg regularly (I work at Tabular), I can say that we do think of the hadoop catalog as being more for testing.
The major reason for that, as mentioned in your threads, is that the catalog provides an atomic locking compare-and-swap operation for the current top level metadata.json file. This compare and swap operation allows for the query that's updating the table to grab a lock for the table after doing its work (optimistic locking), write out the new metadata file, update the state in the catalog to point to the new metadata file, and then release that lock.
The lock isn't something that really works out of the box with HDFS / hadoop type catalog. And then it becomes possible for two concurrent actions to write out a metadata file, and then one sets it and the other's work gets erased or undefined behavior occurs as ACID compliance is lost.
If you have an RDS instance or some sort of JDBC database, I would suggest that you consider using that temporarily. There's also the DynamoDB catalog, and if you're using Dremio then nessie can be used as your catalog as well
In the next version of Iceberg -- the next major version after 0.14, which will likely be 1.0.0, there is a procedure to register tables into a catalog, which makes it easy to move a table from one catalog to another in a very efficient metadata only operation, such as CALL catalog.system.register_table('$new_table_name', '$metadata_file_location');
So you're not locked into one catalog if you start with something simple like the JDBC catalog and then move onto something else. If you're just working out a POC, you could start with the Hadoop catalog and then move to something like the JDBC catalog once you're more familiar, but it's important to be aware of the potential pitfalls of the hadoop type catalog which does not have the atomic compare-and-swap locking operation for the metadata file that represents the current table state.
There's also an option to provide a locking mechanism to the hadoop catalog, such as zookeeper or etcd, but that's a somewhat advanced feature and would require that you write your own custom lock implementation.
So I still stand by the JDBC catalog as the easiest to get started with as most people can get an RDBMS from their cloud provider or spin one up pretty easily -- especially now that you will be able to efficiently move your tables to a new catalog with the code in the current master branch or in the next major Iceberg release, it's not something to worry about too much.
Looking at comments on this thread, Is version-hint.text file optional?
Yes, the version-hint.txt file is used by the hadoop type catalog to attempt to provide an authoritative location where the table's current top-level metadata file is located. So version-hint.txt is only found with hadoop catalog, as other catalogs store it in their own specific mechanism. A table in an RDBMS instance is used to store all of the catalogs "version hints" when using the JDBC catalog or even the Hive catalog, which is backed by Hive Metastore (and very typically an RDBMS). Other catalogs include the DynamoDB catalog.
If you have more questions, the Apache Iceberg slack is very active.
Feel free to check out the docker-spark-iceberg getting started tutorial (which I helped create), which includes Jupyter notebooks and a docker-compose setup.
It uses the JDBC catalog backed by Postgres. With that, you can get a feel for what the catalog is doing by ssh'ing into the containers and running psql commands, as well as looking at table data on your local machine. There's also some nice tutorials with sample data!
https://github.com/tabular-io/docker-spark-iceberg

Saving JDBC db data as shared state Spark

I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed

Using Kafka for Data Integration with Updates & Deletes

So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.
At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.
I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.
We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.
This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?
Any suggestions/help is much appreciated. Thank you!
There are 3 approaches you can take:
Full dump load
Incremental dump load
Binlog replication
Full dump load
Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.
Incremental dump load
Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of
SELECT *
FROM my_table
WHERE last_update > #{last_import}
This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.
Binlog replication
Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.
This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.
Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.
You can read more about these different replication approaches in this blog post.
Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).

Use WAL files for PostgreSQL record version control?

I want to be able to track changes to records in a PostgreSQL database. I've considered using a version field and on-update rules or triggers such that previous versions of records are kept in the table (or in a separate table). This would have the advantage of making it possible to view the version history of a record with a simple select statement. However, this functionality is something I think likely to be seldom used.
How could I satisfy the requirement of being able to construct a "version history" for a record using the WAL files? Reading the WAL and Point-in-Time recovery documentation at PostgreSQL.org has helped me understand how the state of the entire database can be rolled back to an arbitrary point in time, but not how to deal with update mistakes in particular records.
No, you cannot do this at this time. There is a large effort underway on the postgresql-hackers mailing list (the dev list) to rework WAL and build an interface to allow for logical replication in (possibly) PostgreSQL 9.3.
This is basically what you appear to be trying to do and, based on the discussions on that list, it is definitely not a trivial task.