write only when all tables are valid with databricks and delta table

write only when all tables are valid with databricks and delta table - pyspark

I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains until data are fixed. I'm running a lot of test but ultimately I have to actually write the files as delta table with the following loop (simplified for this question):
for f in files:
# read csv
df = spark.read.csv(f, header=True, schema=schema)
# writing to already existing delta table
df.write.format("delta").save('path/' + f)
Is there a callback mechanism so the write method is executed only if all the dataframe doesn't returns any errors? Delta table schema enforcement is pretty rigid which is great, but errors can pop at any time despite all the test I'm running before passing these files in this loop.
union is not an option because I want to handle this by date and each files has different schemas and names.

You can use df.union() or df.unionByName() to read all of your files into a single dataframe. Then that one is either written fully or fails.
# Create empty dataframe with schema to fill up
emptyRDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emptyRDD,schema)
for f in files:
# read csv
dfNext = spark.read.csv(f, header=True, schema=schema)
df = df.unionByName(dfNext)
df.write.format("delta").save(path)

Related

Save pyspark dataframe entires as separate html files on s3

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

Adding an additional column containing file name to pyspark dataframe

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this

you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

How to specify schema for the folder structure when reading parquet file into a dataframe [duplicate]

This question already has an answer here:
Reading partition columns without partition column names
(1 answer)
Closed 2 years ago.
I have to read parquet files that are stored in the following folder structure
/yyyy/mm/dd/ (eg: 2021/01/31)
If I read the files like this, it works:
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Unfortunately, the folder structure is not stored in the typical partitioned format /yyyy=2021/mm=01/dd=31/ and I don't have the luxury of converting it to that format.
I was wondering if there is a way I can provide Spark a hint as to the folder structure so that it would make "2021/01/31" available as yyyy, mm, dd in my dataframe.
I have another set of files, which are stored in the /yyyy=aaaa/mm=bb/dd=cc format and the following code works:
partitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
Things I have tried
I have specified the schema, but it just returned nulls
customSchema = StructType([
StructField("yyyy",LongType(),True),
StructField("mm",LongType(),True),
StructField("dd",LongType(),True),
StructField("id",LongType(),True),
StructField("a",LongType(),True),
StructField("b",LongType(),True),
StructField("c",TimestampType(),True)])
partitionDF = spark.read.option("mergeSchema", "true").schema(customSchema).parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
display(partitionDF)
the above returns no data!. If I change the path to: "abfss://xxx#abc.dfs.core.windows.net/Address////.parquet", then I get data, but yyyy,mm,dd columns are empty.
Another option would be to load the folder path as a column, but I cant seem to find a way to do that.
TIA
Databricks N00B!

I suggest you load the data without the partitioned folders as you mentioned
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Then add a column with the input_file_name function value in:
import pyspark.sql.functions as F
unPartitionedDF = unPartitionedDF.withColumn('file_path', F.input_file_name())
Then you could split the values of the new file_path column into three separate columns.
df = unPartitionedDF.withColumn('year', F.split(df['file_path'], '/').getItem(3)) \
.withColumn('month', F.split(df['file_path'], '/').getItem(4)) \
.withColumn('day', F.split(df['file_path'], '/').getItem(5))
The input value of getItem function is based on the exact folder structure you have.
I hope it could resolve your proble.

Hive SaveAsTable creates a new Parquet table file for every run

I have the following Scala code that I use to write data from a json file to a table in Hive.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("App").setMaster("local")
import org.apache.spark.sql.hive._
val hiveContext = new HiveContext(sc)
val stg_comments = hiveContext.read.schema(buildSchema()).json(<path to json file)
comment.write.mode("append").saveAsTable(<table name>)
My json data has newline and carriage return characters in it's field values and hence, I cannot simply insert records in Hive (because Hive tables by default do not store newline and carriage returns in the data values) and hence, I need to use SaveAsTable option. The issue here is that every time a json file is read and new records are appended to the existing table, a new parquet file is created in the table directory in Hive warehouse directory. This leads to really small small parquet files in the directory. I would like the data to be appended to the existing parquet file. Do we know how to do that? Thanks!

This is an expected behavior. There is no append-to-existing file option here. Each job has its own set of tasks, each task has its own output file. repartitioning before rewrite can reduce number of files written, but not prevent creating new files.
If number of files becomes a problem, you have to run a separate job to read existing small files and merge into larger chunks.

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.

I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;