I need to copy data from one cluster to another cluster. I did some research and found I could do copyTable, which basically scan and put data from one to another.
I also know that I can copy over the whole HDFS volume for Hhase. I am wondering if this works and if perform better than copyTable? (I believe it should perform better since it copy files without logic operations)
Take a look at HBase replication you can find a short how to here
Related
I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in
the table’s metadata folder whose contents is the version number of
the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that
way because they are intended for HDFS. We also use them for local FS
tests, but they can't be safely used concurrently with S3. For S3,
you'll need a metastore to enforce atomicity when swapping table
metadata locations. You can use the one in iceberg-hive to use the
Hive metastore.
Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if
there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.
I would definitely try to use a catalog other than the HadoopCatalog / hdfs type for production workloads.
As somebody who works on Iceberg regularly (I work at Tabular), I can say that we do think of the hadoop catalog as being more for testing.
The major reason for that, as mentioned in your threads, is that the catalog provides an atomic locking compare-and-swap operation for the current top level metadata.json file. This compare and swap operation allows for the query that's updating the table to grab a lock for the table after doing its work (optimistic locking), write out the new metadata file, update the state in the catalog to point to the new metadata file, and then release that lock.
The lock isn't something that really works out of the box with HDFS / hadoop type catalog. And then it becomes possible for two concurrent actions to write out a metadata file, and then one sets it and the other's work gets erased or undefined behavior occurs as ACID compliance is lost.
If you have an RDS instance or some sort of JDBC database, I would suggest that you consider using that temporarily. There's also the DynamoDB catalog, and if you're using Dremio then nessie can be used as your catalog as well
In the next version of Iceberg -- the next major version after 0.14, which will likely be 1.0.0, there is a procedure to register tables into a catalog, which makes it easy to move a table from one catalog to another in a very efficient metadata only operation, such as CALL catalog.system.register_table('$new_table_name', '$metadata_file_location');
So you're not locked into one catalog if you start with something simple like the JDBC catalog and then move onto something else. If you're just working out a POC, you could start with the Hadoop catalog and then move to something like the JDBC catalog once you're more familiar, but it's important to be aware of the potential pitfalls of the hadoop type catalog which does not have the atomic compare-and-swap locking operation for the metadata file that represents the current table state.
There's also an option to provide a locking mechanism to the hadoop catalog, such as zookeeper or etcd, but that's a somewhat advanced feature and would require that you write your own custom lock implementation.
So I still stand by the JDBC catalog as the easiest to get started with as most people can get an RDBMS from their cloud provider or spin one up pretty easily -- especially now that you will be able to efficiently move your tables to a new catalog with the code in the current master branch or in the next major Iceberg release, it's not something to worry about too much.
Looking at comments on this thread, Is version-hint.text file optional?
Yes, the version-hint.txt file is used by the hadoop type catalog to attempt to provide an authoritative location where the table's current top-level metadata file is located. So version-hint.txt is only found with hadoop catalog, as other catalogs store it in their own specific mechanism. A table in an RDBMS instance is used to store all of the catalogs "version hints" when using the JDBC catalog or even the Hive catalog, which is backed by Hive Metastore (and very typically an RDBMS). Other catalogs include the DynamoDB catalog.
If you have more questions, the Apache Iceberg slack is very active.
Feel free to check out the docker-spark-iceberg getting started tutorial (which I helped create), which includes Jupyter notebooks and a docker-compose setup.
It uses the JDBC catalog backed by Postgres. With that, you can get a feel for what the catalog is doing by ssh'ing into the containers and running psql commands, as well as looking at table data on your local machine. There's also some nice tutorials with sample data!
https://github.com/tabular-io/docker-spark-iceberg
What is the best way to import data from few csv files in Spring Batch? I mean one csv file responds to one table in database.
I created one batch configuration class for each table and every table has its own job and step.
Is there any solution to do this in more elegant way?
There's a variety of ways you could tackle the problem, but the simplest job would look something like:
FlatFileItemWriter reader with a DelmitedLineTokenizer and BeanWrapperFieldSetMapper to read the file
Processor if you need to do any additional validation/filtering/transformation
JDBCBatchItemWriter to insert/update the target table
Here's an example that includes more information around specific dependencies, config, etc. The example uses context file config rather than annotation-based, but it should be sufficient to show you the way.
A more complex solution might be a single job with a partitioned step that scans the input folder for files and, leveraging reference table/schema information, creates a reader/writer step for each file that it finds.
You also may want to consider what to do with the files once you're done... Delete them? Compress them?
My Job is to Improve the speed of reading a lot of small file (1KB) from disk to write into our database.
The database is open source to me, and I can change all the code from the client to the server.
The database architecture is that , it is a simple master-slave distributed HDFS based database like HBase. The small file from disk can be insert into our database and combined into bigger block automatically and then write into HDFS.(also the big file can be split to smaller block by database and then write into HDFS)
One way to change the client is to increase the thread number.
I don't have any other idea.Or you can provide some idea to do the performance analysis.
One of the way to process such small files could be to convert these small files to a sequence file and store it into HDFS. Then use this file as a Map Reduce job input file to put the data into HBase or similar database.
This uses aws as an example but it could be any storage/queue setup:
If the files were able to exist on a shared storage such as S3 you could add one queue entry for each file and then just start throwing servers at the queue to add the files to the db. At that point the bottleneck becomes the db instead of the client.
Is there any way to save image to mongo's gridfs and after asynchronous upload to S3 in background?
Maybe it is possible to chain uploaders?
The problem in next: Multiple servers used, thus - saved to hard drive image and running background process can be on different servers.
Also
1. it should remove from gridfs when uploaded to s3
2. it should auto remove from s3 when correspond entity destroyed.
Thanks.
What does your deployment architecture look like? I'm a little confused by when you say "multiple servers"- do you mean multiple mongod instances? Also, it's a bit confusing when you specify your requirements. According to requirement 1, if you upload to S3, then the gridfs file should be removed. However, according to your requirements, it cannot exist in both S3 and Gridfs, so requirement 2 seems to be a contradiction to the first, ie, it shouldn't exist in gridfs in the first place. Are you preserving some files on both Gridfs and S3?
If you are running in a replica set or sharded cluster, you could create a tailable cursor on your gridfs collection (you can also do this on a single node, although it's not recommended). When you see an insert operation (will look like 'op':'i') you could execute a script or do something in your application to grab the file from gridfs and push the appropriate file to s3. Similarly, when you see a delete operation ('op':'d') you could summarily delete the file from s3.
The beauty of a tailable cursor is that it allows for asynchronous operations- you can have another process monitoring the oplog on a different server and performing the appropriate actions.
I used temp variable to store to gridfs and made Worker (see this) to perform async upload from gridfs to s3.
Hope this would help somebody, thanks.
I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I would like to use Amazon's Map/Reduce services so to do this instead of maintaining my own Hadoop cluster.
Does it make sense to copy the data from the database to S3. Then run Amazon Map/Reduce on it? Or are there better ways to get this done.
Also if further down the line I might want to run the queries for frequently like every day so the data on S3 would need to mirror what is in Mongo would this complicate things?
Any suggestions/war stories would be super helpful.
Amazon S3 provides a utility called S3DistCp to get data in and out of S3. This is commonly used when running Amazon's EMR product and you don't want to host your own cluster or use up instances to store data. S3 can store all your data for you and EMR can read/write data from/to S3.
However, transferring 100GB will take time and if you plan on doing this more than once (i.e. more than a one-off batch job), it will be a significant bottleneck in your processing (especially if the data is expected to grow).
It looks you may not need to use S3. Mongo has implemented an adapter to implement map reduce jobs on top of your MongoDB. http://blog.mongodb.org/post/24610529795/hadoop-streaming-support-for-mongodb
This looks appealing since it lets you implement the MR in python/js/ruby.
I think this mongo-hadoop setup would be more efficient than copying 100GB of data out to S3.
UPDATE: An example of using map-reduce with mongo here.