How to properly purge HBase META table from rows without region info? - nosql

I have HBase META table which contains some records about already deleted table regions. That rows do not contain region info so nothing really harmful but anyway it is not so good.
Of course I can just manually delete appropriate rows but maybe there is some ready to use tool or 'best practice' approach? I have tried hbase hbck and hbase hbck -fixMeta ... all of them see this situation as normal and don't make any correction but when I'm checking region locations using API HBase outputs lot of warnings about records in META which do not have embedded region info which is actually true.
HBase 0.94.6 is used (Cloudera CDH 4.4).
Any automatic solution for this situation?

Ended with offline repair tool:
hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
It fixed all my problems for now but I'd like to see other options.

Related

Is it possible to query the diff between two Apache Iceberg snapshots?

I have two snapshots in my Iceberg history table, and I want to be able to see the difference between them, or at least with columns/ rows that have been affected on the last snapshot. Is there an easy way of getting this information?
You can use the java api to get the incremental change log between two snapshot id in a table.
table
.newIncrementalChangelogScan()
.fromSnapshotExclusive(startSnapshotId)
.toSnapshot(toSnapshot)
.caseSensitive(caseSensitive)
.filter(filterExpression())
.project(expectedSchema)
.planTasks();
It will get the full change log.
If you just want to query incremental data, an easier way is to use spark or flink:
spark.read()
.format("iceberg")
.option("start-snapshot-id", "10963874102873")
.option("end-snapshot-id", "63874143573109")
.load("path/to/table")
Currently gets only the data from append operation. Cannot support replace, overwrite, delete operations.
Enjoy yourself.

How to actually delete files in Iceberg

I know that in Apache Iceberg I can set limits on number and age of snapshots, and that "deleting" data from the table does not result in underlying data removal, it simply masks or deletes tracking information.
I would like to actually delete the underlying files on delete, however. I know this will make time-travel inconsistent, but it is still a business requirement.
https://iceberg.apache.org/docs/latest/configuration/
As best as I can tell, I'll have to track and manage the physical life-cycle every file independently. Am I missing something?
If you don't care about table history (or time travel) you can simply call the expire_snapshots procedure after each delete.
What you get is a common question for many iceberg users.
We often need an asynchronous task to delete and expire snapshots\data.
If you use spark, you can use https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots, as shay saied.
you can also do this using the java api provided by iceberg https://iceberg.apache.org/docs/latest/api/.
Starting a task for each table is difficult to manage. Tables often have different TTL. In this case, You can add custom configurations to a table. Manually scan all iceberg tables, then determines whether to delete expired snapshots and data based on these configurations.
If you are using Iceberg with Hive (4.0.0-alpha2 + version), you can try expire_snapshot command on beeline.
Like
ALTER TABLE test_table EXECUTE expire_snapshots('2021-12-09 05:39:18.689000000');
Can read:
https://docs.cloudera.com/cdw-runtime/cloud/iceberg-how-to/topics/iceberg-expiring-snapshots.html
Hive Jira adding support:
https://issues.apache.org/jira/browse/HIVE-26354

write apache iceberg table to azure ADLS / S3 without using external catalog

I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in
the table’s metadata folder whose contents is the version number of
the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that
way because they are intended for HDFS. We also use them for local FS
tests, but they can't be safely used concurrently with S3. For S3,
you'll need a metastore to enforce atomicity when swapping table
metadata locations. You can use the one in iceberg-hive to use the
Hive metastore.
Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if
there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.
I would definitely try to use a catalog other than the HadoopCatalog / hdfs type for production workloads.
As somebody who works on Iceberg regularly (I work at Tabular), I can say that we do think of the hadoop catalog as being more for testing.
The major reason for that, as mentioned in your threads, is that the catalog provides an atomic locking compare-and-swap operation for the current top level metadata.json file. This compare and swap operation allows for the query that's updating the table to grab a lock for the table after doing its work (optimistic locking), write out the new metadata file, update the state in the catalog to point to the new metadata file, and then release that lock.
The lock isn't something that really works out of the box with HDFS / hadoop type catalog. And then it becomes possible for two concurrent actions to write out a metadata file, and then one sets it and the other's work gets erased or undefined behavior occurs as ACID compliance is lost.
If you have an RDS instance or some sort of JDBC database, I would suggest that you consider using that temporarily. There's also the DynamoDB catalog, and if you're using Dremio then nessie can be used as your catalog as well
In the next version of Iceberg -- the next major version after 0.14, which will likely be 1.0.0, there is a procedure to register tables into a catalog, which makes it easy to move a table from one catalog to another in a very efficient metadata only operation, such as CALL catalog.system.register_table('$new_table_name', '$metadata_file_location');
So you're not locked into one catalog if you start with something simple like the JDBC catalog and then move onto something else. If you're just working out a POC, you could start with the Hadoop catalog and then move to something like the JDBC catalog once you're more familiar, but it's important to be aware of the potential pitfalls of the hadoop type catalog which does not have the atomic compare-and-swap locking operation for the metadata file that represents the current table state.
There's also an option to provide a locking mechanism to the hadoop catalog, such as zookeeper or etcd, but that's a somewhat advanced feature and would require that you write your own custom lock implementation.
So I still stand by the JDBC catalog as the easiest to get started with as most people can get an RDBMS from their cloud provider or spin one up pretty easily -- especially now that you will be able to efficiently move your tables to a new catalog with the code in the current master branch or in the next major Iceberg release, it's not something to worry about too much.
Looking at comments on this thread, Is version-hint.text file optional?
Yes, the version-hint.txt file is used by the hadoop type catalog to attempt to provide an authoritative location where the table's current top-level metadata file is located. So version-hint.txt is only found with hadoop catalog, as other catalogs store it in their own specific mechanism. A table in an RDBMS instance is used to store all of the catalogs "version hints" when using the JDBC catalog or even the Hive catalog, which is backed by Hive Metastore (and very typically an RDBMS). Other catalogs include the DynamoDB catalog.
If you have more questions, the Apache Iceberg slack is very active.
Feel free to check out the docker-spark-iceberg getting started tutorial (which I helped create), which includes Jupyter notebooks and a docker-compose setup.
It uses the JDBC catalog backed by Postgres. With that, you can get a feel for what the catalog is doing by ssh'ing into the containers and running psql commands, as well as looking at table data on your local machine. There's also some nice tutorials with sample data!
https://github.com/tabular-io/docker-spark-iceberg

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

BigQuery - how to determine Data Availability

In regards to: https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
What's the best way to programatically determine if a table's data is available after streaming?
I am getting unexpected results trying to fetch Rows and TotalRows with the following apis: Jobs.Query, Jobs.GetQueryResults, Tables.Get, Tabledata.List
Thanks.
You can tell whether data is flushed on the table by executing the Tables.Get() API and looking at the streamingBuffer.oldestEntryTime value. This can be considered a high-water mark of data that has been flushed out of the buffer.
Any data before this timestamp should be available for copy, export, and list operations.
Also, I should clarify that data in the table is available for query immediately after streaming. It only is unavailable to table copy, export, and tabledata.list() operations. Yes, this is confusing, but yes, we're also working on addressing the problem.
For tables that haven't been streamed to before or recently, there is a warmup period where new streaming data won't show up.
See https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability for more information.