I am new to Delalake. I was trying a simple example.
Create dataframe from a csv
Save it is as delta table
Read it again.
It works fine. I can see the files are created in the default spark-warehouse folder.
But Next time I just want to read the saved table. So I comment code for the first two septs and re-run the program I get
Analysis Exception:Table or view not found
val transHistory = spark.
read
.option("header", "true")
.option("inferschema", true)
.csv(InputPath + "trainHistory.csv");
transHistory.write.format("delta").mode(SaveMode.Overwrite).saveAsTable("transactionshistory")
val transHistoryTable = spark.read.format("delta").table("transactionshistory")
transHistoryTable.show(10)
I am using delta lake 0.8.0, Spark 3.0, and scala 2.12.13
Related
I want to save Spark DataFrame in Delta format to S3, however, for some reason, the data is not saved. I debugged all the processing steps there was data and right before saving it, I ran count on the DataFrame which returned 24 rows. But as soon as save is called no data appears in the resulting folder. What could be the reason for it?
This is how I save the data:
df
.select(schema)
.repartition(partitionKeys.map(new ColumnName(_)): _*)
.sortWithinPartitions(sortByKeys.map(new ColumnName(_)): _*)
.write
.format("delta")
.partitionBy(partitionKeys: _*)
.mode(saveMode)
.save("s3a://etl-qa/data_feed")
There is a quick start from Databricks that explains how to read and write from and to a delta lake.
If the Dataframe you are trying to save is called df you need to execute:
df.write.format("delta").save(s3path)
Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object demo {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
I'm just using spark sql and loading data from csv file to
table(textformat) and then from this temp table to orc table(using
select insert)
2 step process is not needed here..
Read the dataframe like below sample...
val DFCsv = spark.read.format("csv")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.load("yourcsv")
if needed you have to do repartition(may be this is cause of the actual 4hr delay since you have not done) since its large file and then...
dfcsv.repartition(90) means it will/may repartition the csv data in to 90 almost equal parts. where 90 is sample number. you can mention what ever you want.
DFCsv.write.format("orc")
.partitionBy('yourpartitioncolumns')
.saveAsTable('yourtable')
OR
DFCsv.write.format("orc")
.partitionBy('yourpartitioncolumns')
.insertInto('yourtable')
Note: 1) For large data you need to do repartition to uniformly distribute the data will increase the parllelism and hence
performance.
2) If you dont have patition columns and is
non-partition table then no need of partitionBy in the above
samples
I wanted to change a column name of a Databricks Delta table.
So I did the following:
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Dropped and recereated the Delta files location
dbutils.fs.rm("dbfs:/mnt/main/sales", true)
dbutils.fs.mkdirs("dbfs:/mnt/main/sales")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
Here I'm getting an Error at the last step when writing to Delta:
java.io.FileNotFoundException: dbfs:/mnt/main/sales/sale_date_partition=2019-04-29/part-00000-769.c000.snappy.parquet
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement
Obviously the data was deleted and most likely I've missed something in the above logic. Now the only place that contains the data is the new_data_DF.
Writing to a location like dbfs:/mnt/main/sales_tmp also fails
What should I do to write data from new_data_DF to a Delta location?
In general, it is a good idea to avoid using rm on Delta tables. Delta's transaction log can prevent eventual consistency issues in most cases, however, when you delete and recreate a table in a very short time, different versions of the transaction log can flicker in and out of existence.
Instead, I'd recommend using the transactional primitives provided by Delta. For example, to overwrite the data in a table you can:
df.write.format("delta").mode("overwrite").save("/delta/events")
If you have a table that has already been corrupted, you can fix it using FSCK.
You could do that in the following way.
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.mode("overwrite") // this would overwrite the whole data files
.option("overwriteSchema", "true") //this is the key line.
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
OverWriteSchema option will create new physical files with latest schema that we have updated during transformation.
I am trying to append columns to a existing CSV file in HDFS.
Script1:
someDF1.repartition(1).write.format("com.databricks.spark.csv").mode("append").option("sep", "\t").option("header","true").save("folder/test_file.csv")
Error:
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory.
Any suggestions on the mistake would be helpful
CSV files doesn't support Schema Evolution. So basically what you have to do is to read the entire data in the target path and then add the new column in this dataframe with some default value.
val oldDF = dfWithExistingData.withColumn("new_col", lit(null))
You can then union or merge this dataframe with the new dataset.
val targetData = oldDF.union(newDF)
You can then write the Data back to your target path in overwrite mode.
targetData
.repartition(1)
.write
.format("com.databricks.spark.csv")
.mode("overwrite")
.option("sep", "\t")
.option("header","true")
.save("folder")
Alternative: You can switch to other file formats which supports schema evolution e.g: Parquet to avoid doing the above process.
I am new to Spark streaming. I am trying structured Spark streaming with local csv files. I am getting the below exception while processing.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
This is my code.
val df = spark
.readStream
.format("csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", ":") // Specifying the delimiter of the input file
.schema(inputdata_schema) // Specifying the schema for the input file
.load("file:///home/Teju/Desktop/SparkInputFiles/*.csv")
val filterop = spark.sql("select tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID,first(rssi_weightage(RSSI)) as RSSI_Weight from my_table where RSSI > -127 group by tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID order by Timestamp ASC")
val outStream = filterop.writeStream.outputMode("complete").format("console").start()
I created cron job so every 5 mins I will get one input csv file. I am trying to parse through Spark streaming.
(This is not a solution but more a comment, but given its length it ended up here. I'm going to make it an answer eventually right after I've collected enough information for investigation).
My guess is that you're doing something incorrect on df that you have not included in your question.
Since the error message is about FileSource with the path as below and it is a streaming dataset that must be df that's in play.
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
Given the other lines I guess that you register the streaming dataset as a temporary table (i.e. my_table) that you then use in spark.sql to execute SQL and writeStream to the console.
df.createOrReplaceTempView("my_table")
If that's correct, the code you've included in the question is incomplete and does not show the reason for the error.
Add .writeStream.start to your df, as the Exception is telling you.
Read the docs for more detail.