Where does databricks delta stores it's metadata? - pyspark

Hive stores its metadata I'm external database like SQL server. Similar to that where does the databricks delta stores its metadata Information?

Databricks Delta stores its metadata on the file system. They are just files in either json (for each transaction) or parquet format (for a snapshot of the table metadata at some version).

Related

Merging data in Datalake

I'm working on a project where we need to bring data from SQL Server database into a Datalake.
I succeded that through a pipeline which ingest data from the source and load it into a DL in parquet format.
My question is how to merge new data from data source to the existing file into that data lake(Upserting).
You can use Azure data flows wherein you can map the source file with other sources and override the existing file. There is no upsert activity directly in ADF for files unlike for databases.
reference :
https://learn.microsoft.com/en-us/answers/questions/542994/azure-data-factory-merge-2-csv-files-with-differen.html

Why Temporary GCS bucket is needed to write a dataframe to BigQuery: pyspark

Recently I face an issue while writing the dataframe data into BigQuery using pyspark. Here it was:
pyspark.sql.utils.IllegalArgumentException: u'Temporary or persistent GCS bucket must be informed
After research the issue I found that Temporary GCS bucket to be mentioned spark.conf.
bucket = "temp_bucket"
spark.conf.set('temporaryGcsBucket', bucket)
I think there is no concept to have a file for a table in Biquery like Hive.
I would like to know more about it, why we need to have temp-gcs-bucket to write the data into bigquery?
I was searching for the reason behind this but I couldn't.
Please clarify.
Spark BigQuery connector has two write modes(writeMethod), 1. Direct 2.Indirect while writing data into BigQuery. This is a optional parameter, default is Indirect.
Indirect
You can specify indirect option like this option("writeMethod","indirect"). Its optional, and Indirect is default. This requires you to specify a temporary gcs bucket, if not you will get the error.
The need of temporary bucket is .
The connector writes the data to BigQuery by first buffering all the
data into a Cloud Storage temporary table. Then it copies all data
from into BigQuery in one operation.
Taken from the GCFS spark example docs here
Direct
In this method the data is written directly to BigQuery using the BigQuery Storage Write API
In scala you can specify like this option("writeMethod","direct"). which eliminates the need for a temporary bucket.
You can read more about the bigquery connector here

load orc format to aurora postgres DB

We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.

Benefits and drawbacks of using Hive Warehouse Connector over Hadoop File Location

Normally we are using the Hadoop file location of the hive table to access data from our spark ETLs. Are there any benefits of using Hive Warehouse Connector instead of our current approach? And is there any drawback of using the Hive Warehouse connector for ETLs?
I cannot think of a drawback.
Hive stores the schema and provides faster predicate push downs. If you read from the filesystem, you will need to often define the scheme on your own

Can I force flush a Databricks Delta table, so the disk copy has latest/consistent data?

I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?
Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).
ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.