I am trying to insert data into a table using insert overwrite statement but I am getting below error.
org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
command is as below
spark.sql("INSERT OVERWRITE TABLE edm_hive SELECT run_number+1 from edm_hive")
I am trying to use temp table, store the results and then update in final table but that is also not working.
Also I am trying to insert record into table using some variable but that is also not working.
e.g.
spark.sql("INSERT into TABLE Feed_metadata_s2 values ('LOGS','StartTimestamp',$StartTimestamp)")
Please suggest
This solution works well for me. I added that property in sparkSession.
Spark HiveContext : Insert Overwrite the same table it is read from
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet","false")
Related
Using Dbeaver, I am trying to add a new column using an existing column to update it. However, after an update to the new column, other column values are changed (to NULL). I am unsure what is the issue?
Before the update:
My statements:
ALTER TABLE project1.nhd
ADD "Address" varchar;
UPDATE project1.nhd
SET "Address" = SPLIT_PART("PropertyAddress", ',', 1);
After the update:
I thought the issue was the method of data transfer, so I tried importing the CSV as a database in Dbeaver and then exporting it to the table and using COPY to import the data into the table. However, the result was the same.
I want to use this syntax that was posted in a similar thread to update the partitions of a delta table I have. The challenge is the delta table does not exist in databricks but we use databricks to update the delta able via Azure Data factory.
How can I adjust the below syntax to update the partitions and overwrite the table via the table path?
Python:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
SQL:
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
I tried to adjust the above code but could not adjust it to use the delta table path.
If you have path, then you need to use correct functions:
for Python, you need to use .load for reading, and .save for writing
if you're using SQL, then you specify table as following:
delta.`path`
Tried to create a delta table from spark data frame using below command:
destination_path = "/dbfs/mnt/kidneycaredevstore/delta/df_corr_feats_spark_4"
df_corr_feats_spark.write.format("delta").option("delta.columnMapping.mode", "name").option("path",destination_path).saveAsTable("CKD_Features_4")
Getting below error:
AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: default.abc_features_4; Column: Adverse, abc initial encounter
Please note that there are around 6k columns in this data frame and it is developed by data scientist generate feature. So, we cannot rename columns.
How to fix this error? Can any configuration change in Metastore solve this issue?
Column mapping feature requires writer version 5 and reader version 2 of Delta protocol, therefore you need to specify this when saving:
df_corr_feats_spark.write.format("delta")
.option("delta.columnMapping.mode", "name")
.option("delta.minReaderVersion", "2")
.option("delta.minWriterVersion", "5")
.option("path", destination_path)
.saveAsTable("CKD_Features_4")
I'm looking for a way to append a column spark DF to an existing Hive table, I'm using the code below to overwrite the table but only works when df schema and hive table schema are equal, but sometimes I need to add one column and since schemas don't match it does not work.
Is there a way to append the df as a column?
Or I have to make an ALTER TABLE ADD COLUMN in a spark.sql()?
temp = spark.table('temp')
temp.write.mode('overwrite').insertInto(datalab + '.' + table,overwrite=True)
Hope my question is understandable, thanks.
You can get a dataframe with new set of data that has additional columns, then append that to the existing table, in the following manner.
new_data_df = df with additional columns
new_data_df.write.mode('append').saveAsTable('same_table_name', mergeSchema=True)
So suppose, the new column you have added is 'column_new', the older records in the table will have been set with null values.
I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.