comparing dataframes to import incremental data in spark and scala issue - scala

I have derived a dataframe from oracle using Sqlcontext and I have registered it as temp table tb1.
I have another dataframe which is derived from hive using HiveContext and I registered this as table tb2.
When I am trying to access these two tables using HiveContext I am getting the error like Unable to find tb1 and when try it with sqlcontext.
I am getting error like Unable to find tb2.
Any help on this please.
I'm doing it in Scala of-course.

Related

Temporary View in Spark Structure Streaming

In ForeachBatch Function Structured Straming I want to create Temporary View of the Dataframe Received in the Micro Batch
func(tabdf, epoch_id):
tabaDf.createOrReplaceView("taba")
But I am getting below error:
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: taba
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'taba' not found
Please anyone help me to resolve this issue.
A streaming query uses its own SparkSession which is cloned from the SparkSession that starts the query. And the DataFrame provided by foreachBatch is created from the streaming query's SparkSession. Hence you cannot access temp views using the original SparkSession.
One workaround is using createGlobalTempView/createOrReplaceGlobalTempView to create a global temp view. Please note that global temp views are tied to a system preserved database global_temp, and you need to use the qualified name to refer a global temp, such as SELECT * FROM global_temp.view1.

How to create an empty dataframe using hive external hive table?

I am using the below to create a dataframe (spark scala) using hive external table. But the dataframe also loaded data in it. I need an empty DF created using hive external table's schema. I am using spark scala for this.
val table1 = sqlContext.table("db.table")
How can I create an empty dataframe using hive external hive table?
You can just do:
val table1 = sqlContext.table("db.table").limit(0)
This will give you the empty df with only the schema. Because of lazy evaluation it also does not take longer than just loading the schema.

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.

pySpark jdbc write error: An error occurred while calling o43.jdbc. : scala.MatchError: null

I am trying to write simple spark dataframe to db2 database using pySpark. Dataframe has only one column with double as a data type.
This is the dataframe with only one row and one column:
This is the dataframe schema:
When I try to write this dataframe to db2 table with this syntax:
dataframe.write.mode('overwrite').jdbc(url=url, table=source, properties=prop)
it creates the table in the database for the first time without any issue, but if I run the code second time, it throws an exception:
On the DB2 side the column datatype is also DOUBLE.
Not sure what am I missing.
I just changed small part in the code and it worked perfectly.
Here is the small change I did to syntax
dataframe.write.jdbc(url=url, table=source, mode = 'overwrite', properties=prop)

How do I create a Hive External table from AVRO files writen using databricks?

The code below is how it was written into HDFS using scala. What is the HQL syntax to create a Hive table to query this data?
import com.databricks.spark.avro._
val path = "/user/myself/avrodata"
dataFrame.write.avro(path)
The examples I find require providing an avro.schema.literal to describe the schema or an avro.schema.url to the actual avro schema.
In the spark-shell all I would need to do to read this is:
scala> import com.databricks.spark.avro._
scala> val df = sqlContext.read.avro("/user/myself/avrodata")
scala> df.show()
So I cheated to get this to work. Basically I created a temporary table and used HQL to create and insert the data from the temp table. This method uses the metadata from the temporary table and creates the avro target table which I wanted to create and populate. If the data frame can create a temporary table from its schema, why could it not save the table as avro?
dataFrame.registerTempTable("my_tmp_table")
sqlContext.sql(s"create table ${schema}.${tableName} stored as avro as select * from ${tmptbl}")