Could not execute sql query on view created by spark using createOrReplaceTempView() - scala

I am using Spark and I created a view using createOrReplaceTempView(), but I couldn't fire any sql query on that view table.
Here is my code:
temp = spark.createDataFrame(Seq(comp_current_col(col1,col2,col3)))
temp.createOrReplaceTempView("Temp")
spark.sql("SELECT * FROM Temp")
I couldn't get any data.
And for this query,
spark.sql("insert into table comp_current select col1,col2,col3 from temp")
I am getting an error of
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64
Please help me with this!

Related

How to execute a update query in spark sql temp tables

I am trying the below code but it is throwing some random error that I am unable to understand:
df.registerTempTable("Temp_table")
spark.sql("Update Temp_table set column_a='1'")
Currently spark sql does not support UPDATE statments. The workaround is to use create a delta lake / iceberg table using your spark dataframe and execute you sql query directly on this table.
For iceberg implementation refer to :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html

Using loop to create spark SQL queries

I am trying to create some spark SQL queries for different tables which i have collected as a list. I want to create SQL queries for all the tables present in the hive database.The hive context has been initialized Following is my approach.
tables= spark.sql("select tables in survey_db")
# registering dataframe as temp view with 2 columns - tableName and db name
tables.createOrReplaceTempView("table_list")
# collecting my table names in a list
table_array= spark.sql("select collect_list(tableName) from table_list").collect()[0][0]
# array values(list)
table_array= [u'survey',u'market',u'customer']
I want to create spark SQL queries for the table names stored in table_array. for example:
for i in table_array:
spark.sql("select * from survey_db.'i'")
I cant use shell scripting as i have to write a pyspark script for this. Please advice if spark.sql queries can be created using loop/map . Thanks everyone.
You can achieve the same as follows:
sql_list = [f"select * from survey_db.{table}" for table in table_array]
for sql in sql_list:
df = spark.sql(sql)
df.show()

How to get Create Statement of Table in some other database in Spark using JDBC

Problem statement:
I have a Impala database where multiple tables are present
I am creating Spark JDBC connection to Impala and loading these tables into spark dataframe for my validations like this which works fine:
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","tablename")
.load()
Now the next step and my actual problem is I need to find the create statement which was used to create the tables in Impala itself
Since I cannot run command like below as it gives error, is there anyway I can fetch the show create statement for tables present in Impala.
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","show create table tablename")
.load()
Perhaps you can use Spark SQL "natively" to execute something like
val createstmt = spark.sql("show create table <tablename>")
The resulting dataframe will have a single column (type string) which contains a complete CREATE TABLE statement.
But, if you still choose to go JDBC route there is always an option to use the good old JDBC interface. Scala understands everything written in Java, after all...
import java.sql.*
Connection conn = DriverManager.getConnection("url")
Statement stmt = conn.createStatement()
ResultSet rs = stmt.executeQuery("show create table <tablename>")
...etc...

Pyspark: writing data to Postgres using JDBC

1) I am reading table from Postgres as below and creating a dataframe
df = spark.read.format("jdbc").option("url", url). \
option("query", "SELECT * FROM test_spark"). \
load()
2) Updating the one value in the dataframe df
newDf = df.withColumn('id',F.when(df['id']==10,20).otherwise(df['id']))
3) I am trying to upsert the data back Postgres table.
--Below code is clearing out the table data
newDf.write.mode("overwrite").option("upsert", True).\
option("condition_columns", "id").option("truncate", True).\
format("jdbc").option("url", url).option("dbtable", "test_spark").save()
--Below code is working fine.
newDf.write.mode("overwrite").option("upsert", True).\
option("condition_columns", "id").option("truncate", True).\
format("jdbc").option("url", url).option("dbtable", "test_spark1").save()
Issue: When I am trying to write the updated dataframe back to same table (i.e test_spark) the table data is getting cleared out, but when it is new table (i.e non existing table) it's working fine.
Resolved the issue by writing the dataframe to checkpoint directory before writing it to DB table as shown in the code below
sparkContext.setCheckpointDir('checkpoints')
newDf.checkpoint().write.format("jdbc").option("url", url).option("truncate", "true").mode("overwrite").\
option("dbtable", "spark_test").save()

Unable to create Hbase table using Hive query through Spark

Using the following tutorial: https://hadooptutorial.info/hbase-integration-with-hive/, I was able to do the HBase integration with Hive. After the configuration I was successfully able to create Hbase table using Hive query with Hive table mapping.
Hive query:
CREATE TABLE upc_hbt(key string, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,value:value")
TBLPROPERTIES ("hbase.table.name" = "upc_hbt");
Spark-Scala:
val createTableHql : String = s"CREATE TABLE upc_hbt2(key string, value string)"+
"STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'"+
"WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,value:value')"+
"TBLPROPERTIES ('hbase.table.name' = 'upc_hbt2')"
hc.sql(createTableHql)
But when I execute the same Hive query through Spark it throws the following error:
Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler
It’s seem like during the Hive execution through Spark it can’t find the auxpath jar location. Is there anyway to solve this problem?
Thank you very much in advance.