Databricks - failing to write from a DataFrame to a Delta location - scala

I wanted to change a column name of a Databricks Delta table.
So I did the following:
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Dropped and recereated the Delta files location
dbutils.fs.rm("dbfs:/mnt/main/sales", true)
dbutils.fs.mkdirs("dbfs:/mnt/main/sales")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
Here I'm getting an Error at the last step when writing to Delta:
java.io.FileNotFoundException: dbfs:/mnt/main/sales/sale_date_partition=2019-04-29/part-00000-769.c000.snappy.parquet
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement
Obviously the data was deleted and most likely I've missed something in the above logic. Now the only place that contains the data is the new_data_DF.
Writing to a location like dbfs:/mnt/main/sales_tmp also fails
What should I do to write data from new_data_DF to a Delta location?

In general, it is a good idea to avoid using rm on Delta tables. Delta's transaction log can prevent eventual consistency issues in most cases, however, when you delete and recreate a table in a very short time, different versions of the transaction log can flicker in and out of existence.
Instead, I'd recommend using the transactional primitives provided by Delta. For example, to overwrite the data in a table you can:
df.write.format("delta").mode("overwrite").save("/delta/events")
If you have a table that has already been corrupted, you can fix it using FSCK.

You could do that in the following way.
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.mode("overwrite") // this would overwrite the whole data files
.option("overwriteSchema", "true") //this is the key line.
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
OverWriteSchema option will create new physical files with latest schema that we have updated during transformation.

Related

Table or view not found when reading existing delta table

I am new to Delalake. I was trying a simple example.
Create dataframe from a csv
Save it is as delta table
Read it again.
It works fine. I can see the files are created in the default spark-warehouse folder.
But Next time I just want to read the saved table. So I comment code for the first two septs and re-run the program I get
Analysis Exception:Table or view not found
val transHistory = spark.
read
.option("header", "true")
.option("inferschema", true)
.csv(InputPath + "trainHistory.csv");
transHistory.write.format("delta").mode(SaveMode.Overwrite).saveAsTable("transactionshistory")
val transHistoryTable = spark.read.format("delta").table("transactionshistory")
transHistoryTable.show(10)
I am using delta lake 0.8.0, Spark 3.0, and scala 2.12.13

Databricks Delta files adding new partition causes old ones to be not readable

I have a notebook using which i am doing a history load. Loading 6 months data everytime, starting with 2018-10-01.
My delta file is partitioned by calendar_date
After the initial load i am able to read the delta file and look the data just fine.
But after the second load for date 2019-01-01 to 2019-06-30, the previous partitons are not loading normally using delta format.
Reading my source delta file like this throws me error saying
file dosen't exist
game_refined_start = (
spark.read.format("delta").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)
However reading like below just works fine any idea what could be wrong
spark.conf.set("spark.databricks.delta.formatCheck.enabled", "false")
game_refined_start = (
spark.read.format("parquet").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)
If the overwrite mode is used, then it completely replaces previous data. You see old data via parquet because delta doesn't remove old versions immediately (but if you do with parquet, then it will remove data immediately).
To fix your problem - you need to use append mode. If you need to get previous data, you can read specific version from a table, and append it. Something like this:
path = "s3://game_events/refined/game_session_start/"
v = spark.sql(f"DESCRIBE HISTORY delta.`{path}` limit 2")
version = v.take(2)[1][0]
df = spark.read.format("delta").option("versionAsOf", version).load(path)
df.write.mode("append").save(path)

Spark JDBC - How to stop automatic creation of table if table doesnt exist

I'm currently using jdbc to write data into an existing table.
jdbcDF.write
.mode(Savemode.append)
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
I'm currently using code similar to above. When given a wrong table name, it will create that table and input that table there as opposed to throwing an error. Is there anyway to stop that feature.
This behaviour is not supported by Spark. Your would need to write your own logic around it.
According to the ScalaDocs on Enumeration SaveMode you have the following options when writing Data to a sink:
Append: Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
ErrorIfExists: ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
Ignore: Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
Overwrite: Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;

How to save data in parquet format and append entries

I am trying to follow this example to save some data in parquet format and read it. If I use the write.parquet("filename"), then the iterating Spark job gives error that
"filename" already exists.
If I use SaveMode.Append option, then the Spark job gives the error
".spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables".
Please let me know the best way to ensure new data is just appended to the parquet file. Can I define primary keys on these parquet tables?
I am using Spark 1.6.2 on Hortonworks 2.5 system. Here is the code:
// Option 1: peopleDF.write.parquet("people.parquet")
//Option 2:
peopleDF.write.format("parquet").mode(SaveMode.Append).saveAsTable("people.parquet")
// Read in the parquet file created above
val parquetFile = spark.read.parquet("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT * FROM people.parquet")
I believe if you use .parquet("...."), you should use .mode('append'),
not SaveMode.Append:
df.write.mode('append').parquet("....")