Insert Spark DF as column in existing hive table - pyspark

I'm looking for a way to append a column spark DF to an existing Hive table, I'm using the code below to overwrite the table but only works when df schema and hive table schema are equal, but sometimes I need to add one column and since schemas don't match it does not work.
Is there a way to append the df as a column?
Or I have to make an ALTER TABLE ADD COLUMN in a spark.sql()?
temp = spark.table('temp')
temp.write.mode('overwrite').insertInto(datalab + '.' + table,overwrite=True)
Hope my question is understandable, thanks.

You can get a dataframe with new set of data that has additional columns, then append that to the existing table, in the following manner.
new_data_df = df with additional columns
new_data_df.write.mode('append').saveAsTable('same_table_name', mergeSchema=True)
So suppose, the new column you have added is 'column_new', the older records in the table will have been set with null values.

Related

Cannot create a table having a column whose name contains commas in Hive metastore

Tried to create a delta table from spark data frame using below command:
destination_path = "/dbfs/mnt/kidneycaredevstore/delta/df_corr_feats_spark_4"
df_corr_feats_spark.write.format("delta").option("delta.columnMapping.mode", "name").option("path",destination_path).saveAsTable("CKD_Features_4")
Getting below error:
AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: default.abc_features_4; Column: Adverse, abc initial encounter
Please note that there are around 6k columns in this data frame and it is developed by data scientist generate feature. So, we cannot rename columns.
How to fix this error? Can any configuration change in Metastore solve this issue?
Column mapping feature requires writer version 5 and reader version 2 of Delta protocol, therefore you need to specify this when saving:
df_corr_feats_spark.write.format("delta")
.option("delta.columnMapping.mode", "name")
.option("delta.minReaderVersion", "2")
.option("delta.minWriterVersion", "5")
.option("path", destination_path)
.saveAsTable("CKD_Features_4")

Spark2.4 Unable to overwrite table from same table

I am trying to insert data into a table using insert overwrite statement but I am getting below error.
org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
command is as below
spark.sql("INSERT OVERWRITE TABLE edm_hive SELECT run_number+1 from edm_hive")
I am trying to use temp table, store the results and then update in final table but that is also not working.
Also I am trying to insert record into table using some variable but that is also not working.
e.g.
spark.sql("INSERT into TABLE Feed_metadata_s2 values ('LOGS','StartTimestamp',$StartTimestamp)")
Please suggest
This solution works well for me. I added that property in sparkSession.
Spark HiveContext : Insert Overwrite the same table it is read from
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet","false")

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

Spark DataFrame duplicate column names while using mergeSchema

I have a huge Spark DataFrame which I create using the following statement
val df = sqlContext.read.option("mergeSchema", "true").parquet("parquet/partitions/path")
Now when I try to do column rename or select operation on above DataFrame it fails saying ambiguous columns found with the following exception
org.apache.spark.sql.AnalysisException: Reference 'Product_Type' is
ambiguous, could be Product_Type#13, Product_Type#235
Now I saw columns and found there are two columns Product_Type and Product_type which seems to be same columns with one letter case different created because of schema merge over time. Now I don't mind keeping duplicate columns but Spark sqlContext for some reason don't like it.
I believe by default spark.sql.caseSensitive config is true so don't know why it fails. I am using Spark 1.5.2. I am new to Spark.
By default, spark.sql.caseSensitive property is false so before your rename or select statement, you should set the property to true
sqlContext.sql("set spark.sql.caseSensitive=true")

How do I insert a pandas DataFrame to an existing PostgreSQL table?

I have a pandas DataFrame df and a PostgreSQL table my_table. I wish to truncate my_table and insert df (which has columns in the same order) into my_table, without affecting the schema of my_table. How do I do this?
In a rather naive attempt, I tried dropping my_table and then using pandas.DataFrame.to_sql, but this creates a new table with a different schema.
I would manually truncate the table and then simply let Pandas do its job:
con.execute('TRUNCATE my_table RESTART IDENTITY;')
df.to_sql('my_table', con, if_exists='append')