Temporary View in Spark Structure Streaming - pyspark

In ForeachBatch Function Structured Straming I want to create Temporary View of the Dataframe Received in the Micro Batch
func(tabdf, epoch_id):
tabaDf.createOrReplaceView("taba")
But I am getting below error:
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: taba
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'taba' not found
Please anyone help me to resolve this issue.

A streaming query uses its own SparkSession which is cloned from the SparkSession that starts the query. And the DataFrame provided by foreachBatch is created from the streaming query's SparkSession. Hence you cannot access temp views using the original SparkSession.
One workaround is using createGlobalTempView/createOrReplaceGlobalTempView to create a global temp view. Please note that global temp views are tied to a system preserved database global_temp, and you need to use the qualified name to refer a global temp, such as SELECT * FROM global_temp.view1.

Related

DML error logging.. handling bad records in Spark Delta Table

Is there some option to capture error records/bad records in error table when we load data from staging to fact/Hub tables. I am looking for DML error logging Oracle in Spark Delta tables.
if there is any bad records/rejected table and it should be load in another table instead of throwing error. i need direct solution ... not looking for data validation for before loading the data in fact table.. Please refer the below notebook which is published in pubic.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2167814208768909/75203411582892/266969774599527/latest.html
Use option badRecordsPath, and load bad_records to required location, as follows:
df = spark.readStream.format("cloudfiles") \
.option("cloudfiles.format","parquet") \
.option("badRecordsPath","req_path") \
.schema(schema).load("source_path")
and then create in-place delta table at bad_records location

Spark2.4 Unable to overwrite table from same table

I am trying to insert data into a table using insert overwrite statement but I am getting below error.
org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
command is as below
spark.sql("INSERT OVERWRITE TABLE edm_hive SELECT run_number+1 from edm_hive")
I am trying to use temp table, store the results and then update in final table but that is also not working.
Also I am trying to insert record into table using some variable but that is also not working.
e.g.
spark.sql("INSERT into TABLE Feed_metadata_s2 values ('LOGS','StartTimestamp',$StartTimestamp)")
Please suggest
This solution works well for me. I added that property in sparkSession.
Spark HiveContext : Insert Overwrite the same table it is read from
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet","false")

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

Cannot read case-sensitive Glue table backed by Parquet

Spark version: 2.4.2 on Amazon EMR 5.24.0
I have a Glue Catalog table backed by S3 Parquet directory. The Parquet files have case-sensitive column names (like lastModified). It doesn't matter what I do, I get lowercase column names (lastmodified) when reading the Glue Catalog table with Spark:
for {
i <- Seq(false, true)
j <- Seq("NEVER_INFER", "INFER_AND_SAVE", "INFER_ONLY")
k <- Seq(false, true)
} {
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet", i)
.config("spark.sql.hive.caseSensitiveInferenceMode", j)
.config("spark.sql.parquet.mergeSchema", k)
.enableHiveSupport()
.getOrCreate()
import spark.sql
val df = sql("""SELECT * FROM ecs_db.test_small""")
df.columns.foreach(println)
}
[1] https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Edit
The below solution is incorrect.
Glue Crawlers are not supposed to set the spark.sql.sources.schema.* properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that Spark infers the schema from the underlying files and alters the tables to add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our case, Spark failed to do so, because of a IllegalArgumentException: Can not create a Path from an empty string exception which is caused because the Hive database class instance has an empty locationUri property string. This is caused because the Glue database does not have a Location property . After the schema is saved, Spark reads it from the table.
There could be a way around this, by setting INFER_ONLY, which should only infer the schema from the files and not attempt to alter the table SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where the inferred schema is then lowercased (see here).
Original solution (incorrect)
This bug happens because the Glue table's SERDEPROPERTIES is missing two important properties:
spark.sql.sources.schema.numParts
spark.sql.sources.schema.part.0
To solve the problem, I had to add those two properties via the Glue console (couldn't do it with ALTER TABLE …)
I guess this is a bug with Glue crawlers, which do not set these properties when creating the table.

comparing dataframes to import incremental data in spark and scala issue

I have derived a dataframe from oracle using Sqlcontext and I have registered it as temp table tb1.
I have another dataframe which is derived from hive using HiveContext and I registered this as table tb2.
When I am trying to access these two tables using HiveContext I am getting the error like Unable to find tb1 and when try it with sqlcontext.
I am getting error like Unable to find tb2.
Any help on this please.
I'm doing it in Scala of-course.