Appending spark dataframe to hive table with different columnn order

Appending spark dataframe to hive table with different columnn order - pyspark

I'm using pyspark with HiveWarehouseConnector in HDP3 cluster.
There was a change in the schema so I updated my target table using the "alter table" command and added the new columns to the last positions of it by default.
Now I'm trying to use the following code to save spark dataframe to it but the columns in the dataframe have alphabetical order and i'm getting the error message below
df = spark.read.json(df_sub_path)
hive.setDatabase('myDB')
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode('append').option('table','target_table').save()
and the error message taced to:
Caused by: java.lang.IllegalArgumentException: Hive column:
column_x cannot be found at same index: 77 in
dataframe. Found column_y. Aborting as this may lead to
loading of incorrect data.
Is there any dynamic way of appending the dataframe to correct location in the hive table? It is important as I expect more columns to be added to the target table.

You can read the target column without rows to get the columns. Then, using select, you can order the column correctly and append it:
target = hive.executeQuery('select * from target_Table where 1=0')
test = spark.createDataFrame(source.collect())
test = test.select(target.columns)

Related

Synapse spark select column with space

I am trying to read synapse table, which has spaces in column names.
Read table is working until I am selecting columns without spaces or special characters:
%%spark
val df = spark.read.synapsesql("<Pool>.<schema>.<table>").select("TYPE", "Year").limit(100)
df.show()
OUTPUT:
+------+----+
| TYPE|Year|
+------+----+
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
When I start selecting columns with spaces I am getting errors. I have tried many variants:
.select(col("""`Country Code`"""))
.select(col("`Country Code`"))
.select(col("""[Country Code]"""))
.select(col("Country Code"))
.select($"`Country Code`")
.select("`Country Code`")
.select("""`Country Code`""")
will return this error:
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'Country'.
If I ommit ` in select for example:
.select("[Country Code]")
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name '[Country Code]'.
With back-tick spark in synapse just take only first word as column.
Any experience?

The select on its own will work, adding show (or any other action like count) will not. There does seem to be an issue with the Synapse synapsesql API. The Invalid column name 'country' error is coming from the SQL engine because it seems like there is no way to pass square brackets back to it. Also parquet files do not support spaces in column names so it's probably connected.
The workaround is to simply not use spaces in column names. Fix up the tables in a previous Synapse pipeline step if required. I'll have a look into it but may be no other answer.
If you want to rename existing columns in the database you can use sp_rename, eg
EXEC sp_rename 'dbo.countries.country Type', 'countryType', 'COLUMN';
This code has been tested on a Synapse dedicated SQL pool.
That particular API (sysnapsesql.read) cannot handle views unfortunately. You would have to materialise it eg using a CTAS in a prior Synapse Pipeline step. The API useful for simple patterns (get table -> process -> put back) but is pretty limited. You can't even manage table distribution (hash, round_robin, replicate) or indexing (clustered columnstore, clustered index, heap) or partitioning but you never know they might add to it one day. I'll be keeping on eye out during the next MS conference anyway.

I have created the function to run query using JDBC. Thanks this I was able to read from view. I have added saplme code how to get password from KeyVault, using TokenLibrary.
def spark_query(db, query):
jdbc_hostname = "<synapse_db>.sql.azuresynapse.net"
user = "<spark_db_client>"
password = "<strong_password>"
# password_from_kv = TokenLibrary.getSecret("<Linked_Key_Vault_Service_Name>", "<Key_Vault_Key_Name>", "<Key_Vault_Name>")
return spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{jdbc_hostname }:1433;databaseName={db};user={user};password={password}") \
.option("query", query) \
.load()
Then I have created VIEW with column names without spaces:
CREATE VIEW v_my_table
AS
SELECT [Country code] as country_code from my_table
Granted access to <spark_db_client>:
GRANT SELECT ON v_my_table to <spark_db_client>
After whole preparation I was able to read table from VIEW and save to spark database:
query = """
SELECT country_code FROM dbo.v_my_table
"""
df = spark_query(db="<my_database>", query=query)
spark.sql("CREATE DATABASE IF NOT EXISTS spark_poc")
df.write.mode("overwrite").saveAsTable("spark_poc.my_table")
df.registerTempTable("my_table")
This are <placeholder_variables>

Pyspark: writing data to Postgres using JDBC

1) I am reading table from Postgres as below and creating a dataframe
df = spark.read.format("jdbc").option("url", url). \
option("query", "SELECT * FROM test_spark"). \
load()
2) Updating the one value in the dataframe df
newDf = df.withColumn('id',F.when(df['id']==10,20).otherwise(df['id']))
3) I am trying to upsert the data back Postgres table.
--Below code is clearing out the table data
newDf.write.mode("overwrite").option("upsert", True).\
option("condition_columns", "id").option("truncate", True).\
format("jdbc").option("url", url).option("dbtable", "test_spark").save()
--Below code is working fine.
newDf.write.mode("overwrite").option("upsert", True).\
option("condition_columns", "id").option("truncate", True).\
format("jdbc").option("url", url).option("dbtable", "test_spark1").save()
Issue: When I am trying to write the updated dataframe back to same table (i.e test_spark) the table data is getting cleared out, but when it is new table (i.e non existing table) it's working fine.

Resolved the issue by writing the dataframe to checkpoint directory before writing it to DB table as shown in the code below
sparkContext.setCheckpointDir('checkpoints')
newDf.checkpoint().write.format("jdbc").option("url", url).option("truncate", "true").mode("overwrite").\
option("dbtable", "spark_test").save()

To Compute statistics of Hive table in Spark

I have created a DataFrame to load CSV files and created a temp table to get the column statistics.
However when I try to run the ANALYZE command I am facing the below error
The same Analyze command ran in Hive successfully.
Spark Version : 1.6.3
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("/bn_data/bopis/*.csv")
// To get the statistics of columns
df.registerTempTable("bopis")
val stat=sqlContext.sql("analyze table bopis compute statistics for columns").show()
Error:
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier analyze found
analyze table bopis compute statistics for columns
^
Please let us know on how to achieve the column statistics using Spark
Thanks.!

If you use the FOR COLUMNS option, you have to pass a list of column names, see https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html
In any case, even if you do, you are going to get an error because you can't run compute statistics on a temp table. ( you will get a Table or view 'bopis' not found in database 'default').
You'll have to create a full blown Hive table, either via df.write.saveAsTable("bopis_hive"), or sqlContext.sql("CREATE TABLE bopis_hive as SELECT * from bopis")

Read from a hive table and write back to it using spark sql

I am reading a Hive table using Spark SQL and assigning it to a scala val
val x = sqlContext.sql("select * from some_table")
Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table.
Finally I am trying to insert overwrite the y dataframe to the same hive table some_table
y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")
Then I am getting the error
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from
I tried creating an insert sql statement and firing it using sqlContext.sql() but it too gave me the same error.
Is there any way I can bypass this error? I need to insert the records back to the same table.
Hi I tried doing as suggested , but still getting the same error .
val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")
scala> dy.write.mode("overwrite").insertInto("incremental.test2")
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;

Actually you can also use checkpointing to achieve this. Since it breaks data lineage, Spark is not able to detect that you are reading and overwriting in the same table:
sqlContext.sparkContext.setCheckpointDir(checkpointDir)
val ds = sqlContext.sql("select * from some_table").checkpoint()
ds.write.mode("overwrite").saveAsTable("some_table")

You should first save your DataFrame y in a temporary table
y.write.mode("overwrite").saveAsTable("temp_table")
Then you can overwrite rows in your target table
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")

You should first save your DataFrame y like a parquet file:
y.write.parquet("temp_table")
After you load this like:
val parquetFile = sqlContext.read.parquet("temp_table")
And finish you insert your data in your table
parquetFile.write.insertInto("some_table")

In context to Spark 2.2
This error means that our process is reading from same table and writing to same table.
Normally, this should work as process writes to directory .hiveStaging...
This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
This error should not occur with insertInto method, as it overwrites partitions not the table.
A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties -
'spark.sql.partitionProvider' 'spark.sql.sources.provider'
'spark.sql.sources.schema.numPartCols
'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
'spark.sql.sources.schema.partCol.0'
'spark.sql.sources.schema.partCol.1'
https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html
when we upgraded our HDP to 2.6.3 The Spark was updated from 2.2 to 2.3 which resulted in below error -
Caused by: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:906)
This error occurs for job where-in we are reading and writing to same path. Like Jobs with SCD Logic
Solution -
Set --conf "spark.sql.hive.convertMetastoreOrc=false"
or, update the job such that it writes data to a temporary table. Then reads from temporary table and insert it into final table.
https://querydb.blogspot.com/2020/09/orgapachesparksqlanalysisexception.html

Read the data from hive table in spark:
val hconfig = new org.apache.hadoop.conf.Configuration()
org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(hconfig , "dbname", "tablename")
val inputFormat = (new HCatInputFormat).asInstanceOf[InputFormat[WritableComparable[_],HCatRecord]].getClass
val data = sc.newAPIHadoopRDD(hconfig,inputFormat,classOf[WritableComparable[_]],classOf[HCatRecord])

You'll also get the Error: "Cannot overwrite a path that is also being read from" in a case where your are doing this:
You are "insert overwrite" to a hive TABLE "A" from a VIEW "V" (that executes your logic)
And that VIEW also references the same TABLE "A". I found this the hard way as the VIEW is deeply nested code that was querying "A" as well. Bummer.
It is like cutting the very branch on which you are sitting :-(

What you need to keep in mind before doing below is that the hive table in which you are overwriting should be have been created by hive DDL not by
spark(df.write.saveAsTable("<table_name>"))
if the above is not true this wont work.
I tested this in spark 2.3.0
val tableReadDf=spark.sql("select * from <dbName>.<tableName>")
val updatedDf=tableReadDf.<transformation> //any update/delete/addition
updatedDf.createOrReplaceTempView("myUpdatedTable")
spark.sql("""with tempView as(select * from myUpdatedTable) insert overwrite table
<dbName>.<tableName> <partition><partition_columns> select * from tempView""")

This is good solution for me:
Extract RDD and schema from DataFrame.
Create new clone DataFame.
Overwrite table.
private def overWrite(df: DataFrame): Unit = {
val schema = df.schema
val rdd = df.rdd
val dfForSave = spark.createDataFrame(rdd, schema)
dfForSave.write
.mode(SaveMode.Overwrite)
.insertInto(s"${tableSource.schema}.${tableSource.table}")}

ApacheSpark: Unsupported parquet datatype

I'm trying to read Hive table with SparkSql HiveContext. But, when I submit the job, I get the following error:
Exception in thread "main" java.lang.RuntimeException: Unsupported parquet datatype optional fixed_len_byte_array(11) amount (DECIMAL(24,7))
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:77)
at org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:131)
at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:383)
at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:380)
Column type is DECIMAL(24,7). I've changed column type with HiveQL, but it doesn't work. Also I've tried cast to another Decimal type in sparksql like below:
val results = hiveContext.sql("SELECT cast(amount as DECIMAL(18,7)), number FROM dmp_wr.test")
But, I got same error. My code is like that:
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("TColumnModify")
val sc: SparkContext = new SparkContext(conf)
val vectorAcc = sc.accumulator(new MyVector())(VectorAccumulator)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql("SELECT amount, number FROM dmp_wr.test")
How can i solve this problem? Thank you for your response.
Edit1: I found the Spark source line which thrown exception. It looks like that
if(originalType == ParquetOriginalType.DECIMAL && decimalInfo.getPrecision <= 18)
So, I created new table which has column in DECIMAL(18,7) type and my code works as I expected.
I drop table and create new one which has column in DECIMAL(24,7), after that I changed column type
alter table qwe change amount amount decimal(18,7) and I can see It is changed to DECIMAL(18,7), but Spark
doesn't accept change. It still read column type as DECIMAL(24,7) and give same error.
What can be the main reason?

alter table qwe change amount amount decimal(18,7)
Alter table commands in Hive does not touch the actual data that is stored in Hive. It only changes the metadata in Hive Metastore. This is very different from "alter table" commands in normal databases (like MySQL).
When Spark reads data from Parquet files, it will try to use the metadata in the actual Parquet file to deserialize the data, which will still give it DECIMAL(24, 7).
There are 2 solutions to your problem:
1. Try out a new version of Spark - build from trunk. See https://issues.apache.org/jira/browse/SPARK-6777 which totally changes this part of the code (will only be in Spark 1.5 though), so hopefully you won't see the same problem again.
Convert the data in your table manually. You can use hive query like "INSERT OVERWRITE TABLE new_table SELECT * from old_table") to do it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Appending spark dataframe to hive table with different columnn order - pyspark

You can read the target column without rows to get the columns. Then, using select, you can order the column correctly and append it: target = hive.executeQuery('select * from target_Table where 1=0') test = spark.createDataFrame(source.collect()) test = test.select(target.columns)

Related

Synapse spark select column with space

Pyspark: writing data to Postgres using JDBC

To Compute statistics of Hive table in Spark

Read from a hive table and write back to it using spark sql

ApacheSpark: Unsupported parquet datatype

Categories

Resources