How to do bulk update of metadata of gcs files ( 5k files per minute ) - metadata

We have around ~5k GCS files whose metadata needs to be updated.
With the fillowing peince of code i can do one file at a time. Do we have way to bulk update of 5k files .
Blob b = storage.get(BlobId.fromGsUtilUri(file));
// add metadata to the output file
Map<String, String> md = new HashMap<>();
md.put("mtvalue", System.currentTimeMillis() + "");
b.toBuilder().setMetadata(md).build().update();
Google documentation, but couldn't find any such bulk update methods.

Related

How to process large amounts of data in scala fs2?

We have a Scala utility which reads data from database and then writes data to a text file in a csv format, using fs2 library. Then it does some processing on few columns and create the final file. So it is a 2 step process.
Read the data from db and create a data_tmp csv file.
Process few columns from _tmp file and create final file data_final csv file.
We use code similar to at link:
https://levelup.gitconnected.com/how-to-write-data-processing-application-in-fs2-2b6f84e3939c
Stream.resource(Blocker[IO]).flatMap { blocker =>
val inResource = getClass.getResource(in) // data_tmp file location
val outResource = getClass.getResource(out) // data_final file location
io.file
.readAll[IO](Paths.get(inResource.toURI), blocker, 4096)
.through(text.utf8Decode)
.through(text.lines)
..... // our processing logic here
.through(text.utf8Encode)
.through(io.file.writeAll(Paths.get(outResource.toURI), blocker))
}
Till now, this used to work as we did not have more than 5k records.
Now we have a new requirement where we expect the data from query to db to be in range of 50k to 1000k.
So we want to create multiple data_final files like data_final_1, data_final_2, ... and so on.
Each output file should not be more than a specific size, let's say 2 MB.
So the data_final should be created in chunks of 2 MB.
How can I modify the above code snippet so that we can create multiple output files from single large data_tmp csv file?

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

Appending data to a automatic partitioned dataframe stored externally as parquet files

Trying to write auto partitioned Dataframe data on an attribute to external store in append mode overwrites the parquet files.
I have a huge amount of data that I cannot load in one go. So I am reading data a folder at a time in a loop. In every iteration, I partition the data based on a certain attribute and use saveAsTable to write the parquet files on Amazon S3. I am finding that my s3 folder is getting wiped out at every iteration. I want to add on the data from every iteration to my hive store in partitioned folders, so I can categorize the data and can read only the category I want to work on.
This is the command I am using to save the dataframe.
DF.write.partitionBy('Type').format('parquet').mode("append").saveAsTable(('AllComponents', path='s3a://xxx/<Path>')
for Pos1 in HexKey:
folderKey = "{}".format(Pos1)
spark = SparkSession.builder \
.getOrCreate()
if DataSetSchema is None:
log.warn("Reviewing schema")
AllComponentsDF = spark.read \
.format('com.databricks.spark.xml') \
.load('s3a://location' + folderKey + '0/00/*')
DataSetSchema = AllComponentsDF.schema
else:
log.warn("Reading folder {}".format(Pos1))
AllComponentsDF = spark.read \
.format('com.databricks.spark.xml') \
.load('s3a://location/' + folderKey + '0/00/*', schema=DataSetSchema)
AllComponentsDF.write.partitionBy('Type').format('parquet').mode("append").saveAsTable(('AllComponents', path='s3a://spark-cluster-boomi/AllComponents')

Databricks - failing to write from a DataFrame to a Delta location

I wanted to change a column name of a Databricks Delta table.
So I did the following:
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Dropped and recereated the Delta files location
dbutils.fs.rm("dbfs:/mnt/main/sales", true)
dbutils.fs.mkdirs("dbfs:/mnt/main/sales")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
Here I'm getting an Error at the last step when writing to Delta:
java.io.FileNotFoundException: dbfs:/mnt/main/sales/sale_date_partition=2019-04-29/part-00000-769.c000.snappy.parquet
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement
Obviously the data was deleted and most likely I've missed something in the above logic. Now the only place that contains the data is the new_data_DF.
Writing to a location like dbfs:/mnt/main/sales_tmp also fails
What should I do to write data from new_data_DF to a Delta location?
In general, it is a good idea to avoid using rm on Delta tables. Delta's transaction log can prevent eventual consistency issues in most cases, however, when you delete and recreate a table in a very short time, different versions of the transaction log can flicker in and out of existence.
Instead, I'd recommend using the transactional primitives provided by Delta. For example, to overwrite the data in a table you can:
df.write.format("delta").mode("overwrite").save("/delta/events")
If you have a table that has already been corrupted, you can fix it using FSCK.
You could do that in the following way.
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.mode("overwrite") // this would overwrite the whole data files
.option("overwriteSchema", "true") //this is the key line.
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
OverWriteSchema option will create new physical files with latest schema that we have updated during transformation.

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;